SlideShare a Scribd company logo
1 of 100
Floating Point Operations , Memory Chip
Organization , Serial Bus Architecture ,
Parallel processing
Architecture and Challenges
Mrs.M.Vaishnavi
Parallel Computer
Architecture
• A parallel computer is a collection of processing
elements that cooperate to solve large problems fast
• Broad issues involved:
• Resource Allocation:
• Number of processing elements (PEs).
• Computing power of each element.
• Amount of physical memory used.
• Data access, Communication and Synchronization
• How the elements cooperate and communicate.
• How data is transmitted between processors.
• Abstractions and primitives for cooperation.
• Performance and Scalability
• Performance enhancement of parallelism: Speedup.
• Scalabilty of performance to larger systems/problems.
Exploiting Program Parallelism
Instruction
Loop
Thread
Process
Levels
of
Parallelism
Grain Size (instructions)
1 10 100 1K 10K 100K 1M
The Need And Feasibility of Parallel Computing
• Application demands: More computing cycles:
• Scientific computing: CFD, Biology, Chemistry, Physics, ...
• General-purpose computing: Video, Graphics, CAD, Databases, Transaction Processing,
Gaming…
• Mainstream multithreaded programs, are similar to parallel programs
• Technology Trends
• Number of transistors on chip growing rapidly
• Clock rates expected to go up but only slowly
• Architecture Trends
• Instruction-level parallelism is valuable but limited
• Coarser-level parallelism, as in MPs, the most viable approach
• Economics:
• Today’s microprocessors have multiprocessor support eliminating the need for designing
expensive custom PEs
• Lower parallel system cost.
• Multiprocessor systems to offer a cost-effective replacement of uniprocessor systems in
mainstream computing.
Scientific Computing
Demand
Scientific Supercomputing Trends
• Proving ground and driver for innovative architecture
and advanced techniques:
• Market is much smaller relative to commercial segment
• Dominated by vector machines starting in 70s
• Meanwhile, microprocessors have made huge gains in
floating-point performance
• High clock rates.
• Pipelined floating point units.
• Instruction-level parallelism.
• Effective use of caches.
• Large-scale multiprocessors replace vector
supercomputers
• Well under way already
Raw Uniprocessor
Performance: LINPACK
LINPACK
(MFLOPS)















 
1
10
100
1,000
10,000
1975 1980 1985 1990 1995 2000
 CRAY n = 100
 CRAY n = 1,000
 Micro n = 100
 Micro n = 1,000
CRAY 1s
Xmp/14se
Xmp/416
Ymp
C90
T94
DEC 8200
IBM Pow er2/990
MIPS R4400
HP9000/735
DEC Alpha
DEC Alpha AXP
HP 9000/750
IBM RS6000/540
MIPS M/2000
MIPS M/120
Sun 4/260









 






Raw Parallel Performance:
LINPACK
LINPACK
(GFLOPS)
 CRAY peak
 MPP peak
Xmp /416(4)
Ymp/832(8)
nCUBE/2(1024)
iPSC/860
CM-2
CM-200
Delta
Paragon XP/S
C90(16)
CM-5
ASCI Red
T932(32)
T3D
Paragon XP/S MP
(1024)
Paragon XP/S MP
(6768)















0.1
1
10
100
1,000
10,000
1985 1987 1989 1991 1993 1995 1996
Transistors














 









 




 







 





 

 
 



 







1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level (?)
i4004
i8008
i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Parallelism in Microprocessor VLSI Generations
The Goal of Parallel
Computing
• Goal of applications in using parallel machines:
Speedup
Speedup (p processors) =
• For a fixed problem size (input data set),
performance = 1/time
Speedup fixed problem (p processors) =
Performance (p processors)
Performance (1 processor)
Time (1 processor)
Time (p processors)
Elements of Modern
Computers
Hardware
Architecture
Operating System
Applications Software
Computing
Problems
Algorithms
and Data
Structures
High-level
Languages
Performance
Evaluation
Mapping
Programming
Binding
(Compile,
Load)
Elements of Modern
Computers
1Computing Problems:
• Numerical Computing: Science and technology numerical
problems demand intensive integer and floating point
computations.
• Logical Reasoning: Artificial intelligence (AI) demand logic
inferences and symbolic manipulations and large space searches.
2Algorithms and Data Structures
• Special algorithms and data structures are needed to specify the
computations and communication present in computing problems.
• Most numerical algorithms are deterministic using regular data
structures.
• Symbolic processing may use heuristics or non-deterministic
searches.
• Parallel algorithm development requires interdisciplinary
interaction.
Elements of Modern
Computers
3 Hardware Resources
• Processors, memory, and peripheral devices form the
hardware core of a computer system.
• Processor instruction set, processor connectivity, memory
organization, influence the system architecture.
4 Operating Systems
• Manages the allocation of resources to running processes.
• Mapping to match algorithmic structures with hardware
architecture and vice versa: processor scheduling,
memory mapping, interprocessor communication.
• Parallelism exploitation at: algorithm design, program
writing, compilation, and run time.
Elements of Modern
Computers
5 System Software Support
• Needed for the development of efficient programs in high-
level languages (HLLs.)
• Assemblers, loaders.
• Portable parallel programming languages
• User interfaces and tools.
6 Compiler Support
• Preprocessor compiler: Sequential compiler and low-level
library of the target parallel computer.
• Precompiler: Some program flow analysis, dependence
checking, limited optimizations for parallelism detection.
• Parallelizing compiler: Can automatically detect
parallelism in source code and transform sequential code
into parallel constructs.
Approaches to Parallel Programming
Source code written in
concurrent dialects of C, C++
FORTRAN, LISP ..
Programmer
Concurrency
preserving compiler
Concurrent
object code
Execution by
runtime system
Source code written in
sequential languages C, C++
FORTRAN, LISP ..
Programmer
Parallelizing
compiler
Parallel
object code
Execution by
runtime system
(a) Implicit
Parallelism
(b) Explicit
Parallelism
Evolution of
Computer
Architecture
Scalar
Sequential Lookahead
I/E Overlap
Functional
Parallelism
Multiple
Func. Units Pipeline
Implicit
Vector
Explicit
Vector
MIMD
SIMD
Multiprocessor
Multicomputer
Register-to
-Register
Memory-to
-Memory
Processor
Array
Associative
Processor
Massively Parallel Processors
(MPPs)
I/E: Instruction Fetch and
Execute
SIMD: Single Instruction stream
over Multiple Data streams
MIMD: Multiple Instruction
streams over Multiple Data
streams
Parallel Architectures History
Application Software
System
Software SIMD
Message Passing
Shared Memory
Dataflow
Systolic
Arrays
Architecture
Historically, parallel architectures tied to programming models
• Divergent architectures, with no predictable pattern of growth.
Programming Models
• Programming methodology used in coding
applications
• Specifies communication and synchronization
• Examples:
• Multiprogramming:
No communication or synchronization at program level
• Shared memory address space:
• Message passing:
Explicit point to point communication
• Data parallel:
More regimented, global actions on data
• Implemented with shared address space or message passing
Flynn’s 1972 Classification of Computer
Architecture
• Single Instruction stream over a Single Data stream
(SISD): Conventional sequential machines.
• Single Instruction stream over Multiple Data streams
(SIMD): Vector computers, array of synchronized
processing elements.
• Multiple Instruction streams and a Single Data stream
(MISD): Systolic arrays for pipelined execution.
• Multiple Instruction streams over Multiple Data
streams (MIMD): Parallel computers:
• Shared memory multiprocessors.
• Multicomputers: Unshared distributed memory, message-passing
used instead.
Flynn’s Classification of Computer Architecture
Fig. 1.3 page 12 in
Advanced Computer Architecture: Parallelism,
Scalability, Programmability, Hwang, 1993.
Current Trends In Parallel Architectures
• The extension of “computer architecture” to support
communication and cooperation:
• OLD: Instruction Set Architecture
• NEW: Communication Architecture
• Defines:
• Critical abstractions, boundaries, and primitives (interfaces)
• Organizational structures that implement interfaces
(hardware or software)
• Compilers, libraries and OS are important bridges today
Modern Parallel Architecture
Layered Framework
CAD
Multiprogramming Shared
address
Message
passing
Data
parallel
Database Scientific modeling Parallel applications
Programming models
Communication abstraction
User/system boundary
Compilation
or library
Operating systems support
Communication hardware
Physical communication medium
Hardware/software boundary
Shared Address Space Parallel Architectures
• Any processor can directly reference any memory
location
• Communication occurs implicitly as result of loads and stores
• Convenient:
• Location transparency
• Similar programming model to time-sharing in uniprocessors
• Except processes run on different processors
• Good throughput on multiprogrammed workloads
• Naturally provided on a wide range of platforms
• Wide range of scale: few to hundreds of processors
• Popularly known as shared memory machines or model
• Ambiguous: Memory may be physically distributed among
processors
Shared Address Space (SAS)
Model
• Process: virtual address space plus one or more threads of control
• Portions of address spaces of processes are shared
• Writes to shared address visible to other threads (in other processes too)
• Natural extension of the uniprocessor model:
• Conventional memory operations used for communication
• Special atomic operations needed for synchronization
• OS uses shared memory to coordinate processes
St or e
P1
P2
Pn
P0
Load
P0 pr i vat e
P1 pr i vat e
P2 pr i vat e
Pn pr i vat e
Virtual address spaces for a
collection of processes communicating
via shared addresses
Machine physical address space
Shared portion
of address space
Private portion
of address space
Common physical
addresses
Models of Shared-Memory Multiprocessors
• The Uniform Memory Access (UMA) Model:
• The physical memory is shared by all processors.
• All processors have equal access to all memory addresses.
• Distributed memory or Nonuniform Memory Access
(NUMA) Model:
• Shared memory is physically distributed locally among
processors.
• The Cache-Only Memory Architecture (COMA) Model:
• A special case of a NUMA machine where all distributed
main memory is converted to caches.
• No memory hierarchy at each processor.
Models of Shared-Memory Multiprocessors
I/O ctrl
Mem Mem Mem
Interconnect
Mem I/O ctrl
Processor Processor
Interconnect
I/O
devices
M 
M M
Network
P
$
P
$
P
$

Network
D
P
C
D
P
C
D
P
C
Distributed memory or
Nonuniform Memory Access (NUMA) Model
Uniform Memory Access (UMA) Model
Interconnect:
Bus, Crossbar, Multistage network
P: Processor
M: Memory
C: Cache
D: Cache directory
Cache-Only Memory Architecture (COMA)
Uniform Memory Access Example: Intel Pentium Pro
Quad
• All coherence and multiprocessing glue
in processor module
• Highly integrated, targeted at high
volume
• Low latency and bandwidth
P-Pr o bus (64-bit data, 36-bit addr
ess, 66 MHz)
CPU
Bus interface
MIU
P-Pr o
module
P-Pr o
module
P-Pr o
module
256-KB
L2 $
Interrupt
controller
PCI
bridge
PCI
bridge
Memory
controller
1-, 2-, or 4-w ay
interleaved
DRAM
PCI
bus
PCI
bus
PCI
I/O
cards
Uniform Memory Access Example: SUN
Enterprise
• 16 cards of either type: processors + memory, or I/O
• All memory accessed over bus, so symmetric
• Higher bandwidth, higher latency bus
Gigaplane bus (256 data, 41 address, 83 MHz)
SBUS
SBUS
SBUS
2
FiberChannel
100bT,
SCSI
Bus interface
CPU/mem
cards
P
$2
$
P
$2
$
Mem ctrl
Bus interface/sw itch
I/O cards
Distributed Shared-Memory
Multiprocessor System
Example:
Cray T3E
• Scale up to 1024 processors, 480MB/s links
• Memory controller generates communication requests for nonlocal references
• No hardware mechanism for coherence (SGI Origin etc. provide this)
Sw itch
P
$
XY
Z
External I/O
Mem
ctrl
and NI
Mem
Message-Passing Multicomputers
• Comprised of multiple autonomous computers (nodes).
• Each node consists of a processor, local memory, attached storage
and I/O peripherals.
• Programming model more removed from basic hardware operations
• Local memory is only accessible by local processors.
• A message passing network provides point-to-point static
connections among the nodes.
• Inter-node communication is carried out by message passing through
the static connection network
• Process communication achieved using a message-passing
programming environment.
Message-Passing Abstraction
• Send specifies buffer to be transmitted and receiving process
• Recv specifies sending process and application storage to receive into
• Memory to memory copy, but need to name processes
• Optional tag on send and matching rule on receive
• User process names local data and entities in process/tag space too
• In simplest form, the send/recv match achieves pairwise synch event
• Many overheads: copying, buffer management, protection
Pr ocess P Pr ocess Q
Addr ess Y
Addr ess X
Send X, Q, t
Receive Y
, P, t
Match
Local pr ocess
addr ess space
Local pr ocess
addr ess space
Message-Passing Example: IBM SP-2
• Made out of essentially
complete RS6000
workstations
• Network interface integrated
in I/O bus (bandwidth limited
by I/O bus)
Memory bus
MicroChannel bus
I/O
i860 NI
DMA
DRAM
IBM SP-2 node
L2 $
Pow er 2
CPU
Memory
controller
4-w ay
interleaved
DRAM
General interconnection
netw ork formed fr
om
8-port sw itches
NIC
Message-Passing Example:
Intel Paragon
Memory bus (64-bit, 50 MHz)
i860
L1 $
NI
DMA
i860
L1 $
Driver
Mem
ctrl
4-w ay
interleaved
DRAM
Intel
Paragon
node
8 bits,
175 MHz,
bidirectional
2D grid netw ork
w ith pr
ocessing node
attached to every sw itch
Sandia’ s Intel Paragon XP/S-based Super computer
Message-Passing
Programming Tools
• Message-passing programming libraries include:
• Message Passing Interface (MPI):
• Provides a standard for writing concurrent message-
passing programs.
• MPI implementations include parallel libraries used by
existing programming languages.
• Parallel Virtual Machine (PVM):
• Enables a collection of heterogeneous computers to
used as a coherent and flexible concurrent
computational resource.
• PVM support software executes on each machine in a
user-configurable pool, and provides a computational
environment of concurrent applications.
• User programs written for example in C, Fortran or Java
are provided access to PVM through the use of calls to
PVM library routines.
Data Parallel Systems SIMD in Flynn
taxonomy
• Programming model
• Operations performed in parallel on each
element of data structure
• Logically single thread of control, performs
sequential or parallel steps
• Conceptually, a processor is associated with
each data element
• Architectural model
• Array of many simple, cheap processors each
with little memory
• Processors don’t sequence through
instructions
• Attached to a control processor that issues
instructions
• Specialized and general communication, cheap
global synchronization
• Some recent machines:
• Thinking Machines CM-1, CM-2 (and CM-5)
• Maspar MP-1 and MP-2,
PE PE PE

PE PE PE

PE PE PE

  
Control
processor
Dataflow Architectures
• Represent computation as a graph of
essential dependences
• Logical processor at each node, activated by availability of operands
• Message (tokens) carrying tag of next instruction sent to next processor
• Tag compared with others in matching store; match fires execution
1 b
a
+  


c e
d
f
Dataflow graph
f = a  d
Netw ork
Token
store
Waiting
Matching
Instruction
fetch
Execute
Token queue
Form
token
Netw ork
Netw ork
Program
store
a = (b +1)  (b  c)
d = c  e
Systolic Architectures
• Replace single processor with an array of regular processing elements
• Orchestrate data flow for high throughput with less memory access
M
PE
M
PE PE PE
• Different from pipelining
– Nonlinear array structure, multidirection data flow, each PE
may have (small) local instruction and data memory
• Different from SIMD: each PE may do something different
• Initial motivation: VLSI enables inexpensive special-purpose chips
• Represent algorithms directly by chips connected in regular
pattern
Parallel Programs
• Conditions of Parallelism:
• Data Dependence
• Control Dependence
• Resource Dependence
• Bernstein’s Conditions
• Asymptotic Notations for Algorithm Analysis
• Parallel Random-Access Machine (PRAM)
• Example: sum algorithm on P processor PRAM
• Network Model of Message-Passing Multicomputers
• Example: Asynchronous Matrix Vector Product on a Ring
• Levels of Parallelism in Program Execution
• Hardware Vs. Software Parallelism
• Parallel Task Grain Size
• Example Motivating Problems With high levels of concurrency
• Limited Concurrency: Amdahl’s Law
• Parallel Performance Metrics: Degree of Parallelism (DOP)
• Concurrency Profile
• Steps in Creating a Parallel Program:
• Decomposition, Assignment, Orchestration, Mapping
• Program Partitioning Example
• Static Multiprocessor Scheduling Example
Conditions of Parallelism:
Data Dependence
1True Data or Flow Dependence: A statement S2 is
data dependent on statement S1 if an execution
path exists from S1 to S2 and if at least one
output variable of S1 feeds in as an input operand
used by S2
denoted by S1 S2
2Antidependence: Statement S2 is antidependent
on S1 if S2 follows S1 in program order and if the
output of S2 overlaps the input of S1
denoted by S1  S2
3Output dependence: Two statements are output
dependent if they produce the same output
variable
denoted by S1  S2
Conditions of Parallelism: Data Dependence
4I/O dependence: Read and write are I/O
statements. I/O dependence occurs not
because the same variable is involved but
because the same file is referenced by both
I/O statements.
5Unknown dependence:
• Subscript of a variable is subscribed (indirect
addressing)
• The subscript does not contain the loop index.
• A variable appears more than once with subscripts
having different coefficients of the loop variable.
• The subscript is nonlinear in the loop index variable.
Data and I/O Dependence: Examples
A -
B -
S1: Load R1,A
S2: Add R2, R1
S3: Move R1, R3
S4: Store B, R1
S1: Read (4),A(I) /Read array A from tape unit 4/
S2: Rewind (4) /Rewind tape unit 4/
S3: Write (4), B(I) /Write array B into tape unit 4/
S4: Rewind (4) /Rewind tape unit 4/
S1
S3
S4
S2
Dependence graph
S1 S3
I/O
I/O dependence caused by accessing the
same file by the read and write statements
Conditions of Parallelism
• Control Dependence:
• Order of execution cannot be determined before runtime due to
conditional statements.
• Resource Dependence:
• Concerned with conflicts in using shared resources including
functional units (integer, floating point), memory areas, among
parallel tasks.
• Bernstein’s Conditions:
Two processes P1 , P2 with input sets I1, I2 and output sets
O1, O2 can execute in parallel (denoted by P1 || P2) if:
I1 O2 = 
I2 O1 = 
O1 O2 = 
Bernstein’s Conditions: An Example
• For the following instructions P1, P2, P3, P4, P5 in program order and
• Instructions are in program order
• Each instruction requires one step to execute
• Two adders are available
P1 : C = D x E
P2 : M = G + C
P3 : A = B + C
P4 : C = L + M
P5 : F = G  E
Using Bernstein’s Conditions after checking statement pairs:
P1 || P5 , P2 || P3 , P2 || P5 , P5 || P3 , P4 || P5
X P1
D E
+3
P4
+2
P3
+1
P2
C
B
G
L  P5
G E
F
A
C
X P1
D E
+1
P2
+3
P4
 P5
G
B
F
C
+2
P3
A
L
E G
C
M
Parallel execution in three steps
assuming two adders are available
per step
Sequential
execution
Time
X
P1

P5
+2
+3
+1
P2 P4
P3
Dependence graph:
Data dependence (solid lines)
Resource dependence (dashed lines)
Asymptotic Notations for Algorithm Analysis
Asymptotic analysis of computing time of an algorithm f(n) ignores
constant execution factors and concentrates on determining the order of
magnitude of algorithm performance.
Upper bound:
Used in worst case analysis of algorithm performance.
f(n) = O(g(n))
iff there exist two positive constants c and n0 such that
| f(n) | c | g(n) | for all n > n0
 i.e. g(n) an upper bound on f(n)
O(1) < O(log n) < O(n) < O(n log n) < O (n2) < O(n3) < O(2n)
Asymptotic Notations for Algorithm Analysis
 Lower bound: Used in the analysis of the lower limit of algorithm
performance
f(n) = (g(n))
if there exist positive constants c, n0 such that
| f(n) |  c | g(n) | for all n > n0
i.e. g(n) is a lower bound on f(n)
 Tight bound:
Used in finding a tight limit on algorithm performance
f(n) = (g(n))
if there exist constant positive integers c1, c2, and n0 such that
c1 | g(n) | | f(n) | c2 | g(n) | for all n > n0
i.e. g(n) is both an upper and lower bound on f(n)
The Growth Rate of Common Computing Functions
log n n n log n n2 n3 2n
0 1 0 1 1 2
1 2 2 4 8 4
2 4 8 16 64 16
3 8 24 64 512 256
4 16 64 256 4096 65536
5 32 160 1024 32768 4294967296
Theoretical Models of Parallel
Computers
• Parallel Random-Access Machine
(PRAM):
• n processor, global shared memory model.
• Models idealized parallel computers with zero synchronization or
memory access overhead.
• Utilized parallel algorithm development and scalability and complexity
analysis.
• PRAM variants: More realistic models
than pure PRAM
• EREW-PRAM: Simultaneous memory reads or writes to/from the same
memory location are not allowed.
• CREW-PRAM: Simultaneous memory writes to the same location is
not allowed.
• ERCW-PRAM: Simultaneous reads from the same memory location are
not allowed.
• CRCW-PRAM: Concurrent reads or writes to/from the same memory
location are allowed.
Example: sum algorithm on P processor PRAM
•Input: Array A of size n = 2k
in shared memory
•Initialized local variables:
•the order n,
•number of processors p = 2q n,
• the processor number s
•Output: The sum of the elements
of A stored in shared memory
begin
1. for j = 1 to l ( = n/p) do
Set B(l(s - 1) + j): = A(l(s-1) + j)
2. for h = 1 to log n do
2.1 if (k- h - q 0) then
for j = 2k-h-q(s-1) + 1 to 2k-h-qS do
Set B(j): = B(2j -1) + B(2s)
2.2 else {if (s 2k-h) then
Set B(s): = B(2s -1 ) + B(2s)}
3. if (s = 1) then set S: = B(1)
end
Running time analysis:
• Step 1: takes O(n/p) each processor executes n/p operations
•The hth of step 2 takes O(n / (2hp)) since each processor has
to perform (n / (2hp))  operations
• Step three takes O(1)
•Total Running time: p h
h
n
T n O
n
p
n
p
O
n
p
n
( ) ( log )
log
 















  


2
1
Example: Sum Algorithm on P Processor PRAM
Operation represented by a node
is executed by the processor
indicated below the node.
B(6)
=A(6)
P3
B(5)
=A(5)
P3
P3
B(3)
B(8)
=A(8)
P4
B(7)
=A(7)
P4
P4
B(4)
B(2)
=A(2)
P1
B(1)
=A(1)
P1
P1
B(1)
B(4)
=A(4)
P2
B(3)
=A(3)
P2
P2
B(2)
B(2)
P2
B(1)
P1
B(1)
P1
S= B(1)
P1
For n = 8 p = 4
Processor allocation for
computing the sum of 8 elements
on 4 processor PRAM
5
4
3
2
1
Time
Unit
The Power of The PRAM
Model
• Well-developed techniques and algorithms to handle many
computational problems exist for the PRAM model
• Removes algorithmic details regarding synchronization and
communication, concentrating on the structural properties
of the problem.
• Captures several important parameters of parallel
computations. Operations performed in unit time, as well
as processor allocation.
• The PRAM design paradigms are robust and many network
algorithms can be directly derived from PRAM algorithms.
• It is possible to incorporate synchronization and
communication into the shared-memory PRAM model.
Performance of Parallel Algorithms
• Performance of a parallel algorithm is typically
measured in terms of worst-case analysis.
• For problem Q with a PRAM algorithm that runs in
time T(n) using P(n) processors, for an instance
size of n:
• The time-processor product C(n) = T(n) . P(n)
represents the cost of the parallel algorithm.
• For P < P(n), each of the of the T(n) parallel steps is
simulated in O(P(n)/p) substeps. Total simulation
takes O(T(n)P(n)/p)
• The following four measures of performance are
asymptotically equivalent:
• P(n) processors and T(n) time
• C(n) = P(n)T(n) cost and T(n) time
• O(T(n)P(n)/p) time for any number of processors p < P(n)
• O(C(n)/p + T(n)) time for any number of processors.
Network Model of Message-Passing Multicomputers
• A network of processors can viewed as a graph G
(N,E)
• Each node i  N represents a processor
• Each edge (i,j)  E represents a two-way communication
link between processors i and j.
• Each processor is assumed to have its own local memory.
• No shared memory is available.
• Operation is synchronous or asynchronous(message
passing).
• Typical message-passing communication constructs:
• send(X,i) a copy of X is sent to processor Pi, execution continues.
• receive(Y, j) execution suspended until the data from processor
Pj is received and stored in Y then execution resumes.
Network Model of Multicomputers
• Routing is concerned with delivering each
message from source to destination over the
network.
• Additional important network topology
parameters:
• The network diameter is the maximum distance
between any pair of nodes.
• The maximum degree of any node in G
• Example:
• Linear array: P processors P1, …, Pp are connected in
a linear array where:
• Processor Pi is connected to Pi-1 and Pi+1 if they exist.
• Diameter is p-1; maximum degree is 2
• A ring is a linear array of processors where
processors P1 and Pp are directly connected.
A Four-Dimensional Hypercube
• Two processors are connected if their binary
indices differ in one bit position.
Example: Asynchronous Matrix Vector Product on a Ring
• Input:
• n x n matrix A ; vector x of order n
• The processor number i. The number of processors
• The ith submatrix B = A( 1:n, (i-1)r +1 ; ir) of size n x r where r = n/p
• The ith subvector w = x(i - 1)r + 1 : ir) of size r
• Output:
• Processor Pi computes the vector y = A1x1 + …. Aixi and passes the result to the right
• Upon completion P1 will hold the product Ax
Begin
1. Compute the matrix vector product z = Bw
2. If i = 1 then set y: = 0
else receive(y,left)
3. Set y: = y +z
4. send(y, right)
5. if i =1 then receive(y,left)
End
Tcomp = k(n2/p)
Tcomm = p(l+ mn)
T = Tcomp + Tcomm
= k(n2/p) + p(l+ mn)
Creating a Parallel Program
•Assumption: Sequential algorithm to solve
problem is given
• Or a different algorithm with more inherent parallelism is devised.
• Most programming problems have several parallel solutions. The best solution
may differ from that suggested by existing sequential algorithms.
One must:
• Identify work that can be done in parallel
• Partition work and perhaps data among processes
• Manage data access, communication and synchronization
• Note: work includes computation, data access and I/O
Main goal: Speedup (plus low programming effort and resource needs)
Speedup (p) =
For a fixed problem:
Speedup (p) =
Performance(p)
Performance(1)
Time(1)
Time(p)
Some Important Concepts
Task:
• Arbitrary piece of undecomposed work in parallel computation
• Executed sequentially on a single processor; concurrency is only
across tasks
• E.g. a particle/cell in Barnes-Hut, a ray or ray group in Raytrace
• Fine-grained versus coarse-grained tasks
Process (thread):
• Abstract entity that performs the tasks assigned to processes
• Processes communicate and synchronize to perform their tasks
Processor:
• Physical engine on which process executes
• Processes virtualize machine to programmer
• first write program in terms of processes, then
map to processors
Levels of Parallelism in Program Execution
Jobs or programs
(Multiprogramming)
Level 5
Subprograms, job
steps or related parts
of a program
Level 4
Procedures, subroutines,
or co-routines
Level 3
Non-recursive loops or
unfolded iterations
Level 2
Instructions or
statements
Level 1
Increasing
communications
demand and
mapping/scheduling
overhead
}
}
}
Higher
degree of
Parallelism
Medium
Grain
Coarse
Grain
Fine
Grain
Hardware and Software Parallelism
• Hardware parallelism:
• Defined by machine architecture, hardware multiplicity
(number of processors available) and connectivity.
• Often a function of cost/performance tradeoffs.
• Characterized in a single processor by the number of
instructions k issued in a single cycle (k-issue processor).
• A multiprocessor system with n k-issue processor can
handle a maximum limit of nk threads.
• Software parallelism:
• Defined by the control and data dependence of programs.
• Revealed in program profiling or program flow graph.
• A function of algorithm, programming style and compiler
optimization.
Computational Parallelism and Grain Size
• Grain size (granularity) is a measure of the
amount of computation involved in a task in
parallel computation :
• Instruction Level:
• At instruction or statement level.
• 20 instructions grain size or less.
• For scientific applications, parallelism at this level
range from 500 to 3000 concurrent statements
• Manual parallelism detection is difficult but assisted
by parallelizing compilers.
• Loop level:
• Iterative loop operations.
• Typically, 500 instructions or less per iteration.
• Optimized on vector parallel computers.
• Independent successive loop operations can be
vectorized or run in SIMD mode.
Computational Parallelism and Grain Size
• Procedure level:
• Medium-size grain; task, procedure, subroutine levels.
• Less than 2000 instructions.
• More difficult detection of parallel than finer-grain
levels.
• Less communication requirements than fine-grain
parallelism.
• Relies heavily on effective operating system support.
• Subprogram level:
• Job and subprogram level.
• Thousands of instructions per grain.
• Often scheduled on message-passing multicomputers.
• Job (program) level, or Multiprogrammimg:
• Independent programs executed on a parallel computer.
• Grain size in tens of thousands of instructions.
Example Motivating Problems:
Simulating Ocean Currents
• Model as two-dimensional grids
• Discretize in space and time
• finer spatial and temporal resolution => greater accuracy
• Many different computations per time step
• set up and solve equations
• Concurrency across and within grid computations
(a) Cross sections (b) Spatial discretization of a cross section
Example MotivatingProblems:
Simulating Galaxy Evolution
•Simulate the interactions of many stars evolving over
time
•Computing forces is expensive
•O(n2) brute force approach
•Hierarchical Methods take advantage of force law: G
m1m2
r2
• Many time-steps, plenty of concurrency across stars within one
Star on w hich for
ces
are being computed
Star too close to
approximate
Small group far enough aw ay to
approximate to center of mass
Large group far
enough aw ay to
approximate
Example Motivating Problems:
Rendering Scenes by Ray Tracing
• Shoot rays into scene through pixels in image
plane
• Follow their paths
• They bounce around as they strike objects
• They generate new rays: ray tree per input ray
• Result is color and opacity for that pixel
• Parallelism across rays
• All above case studies have abundant
concurrency
Limited Concurrency: Amdahl’s Law
•Most fundamental limitation on parallel speedup.
•If fraction s of seqeuential execution is inherently
serial, speedup <= 1/s
•Example: 2-phase calculation
• sweep over n-by-n grid and do some independent computation
• sweep again and add each value to global sum
•Time for first phase = n2/p
•Second phase serialized at global variable, so time = n2
•Speedup <= or at most 2
•Possible Trick: divide second phase into two
• Accumulate into private sum during sweep
• Add per-process private sum into global sum
•Parallel time is n2/p + n2/p + p, and speedup at best
2n2
n
2
p
+ n2
2n2
2n2 + p2
Amdahl’s Law Example:
A Pictorial Depiction
1
p
1
p
1
n2/p
n2
p
work
done
concurrently
n2
n2
Time
n2/p n2/p
(
c
)
(b)
(a)
Parallel Performance Metrics
Degree of Parallelism (DOP)
• For a given time period, DOP reflects the number of processors in a specific parallel computer
actually executing a particular parallel program.
• Average Parallelism:
• given maximum parallelism = m
• n homogeneous processors
• computing capacity of a single processor 
• Total amount of work W (instructions, computations):
or as a discrete summation
W i i
i
m
t



 .
1
W DOP t dt
t
t
 
 ( )
1
2
A DOP t dt
t t t
t

 
1
2 1 1
2
( ) A i i
i
m
i
i
m
t t













 
 
.
1 1
i
i
m
t t t

  
1
2 1
Where ti is the total time that DOP = i and
The average parallelism A:
In discrete form
Example: Concurrency Profile of
A Divide-and-Conquer Algorithm
• Execution observed from t1 = 2 to t2 = 27
• Peak parallelism m = 8
• A = (1x5 + 2x3 + 3x4 + 4x6 + 5x2 + 6x2 + 8x3) / (5 + 3+4+6+2+2+3)
= 93/25 = 3.72
Degree of Parallelism (DOP)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
11
10
9
8
7
6
5
4
3
2
1
Time
t1 t2
A i i
i
m
i
i
m
t t













 
 
.
1 1
Parallel Performance Example
• The execution time T for three parallel programs is given in terms of processor count P
and problem size N
• In each case, we assume that the total computation work performed by an optimal
sequential algorithm scales as N+N2 .
1 For first parallel algorithm: T = N + N2/P
This algorithm partitions the computationally demanding O(N2) component of the
algorithm but replicates the O(N) component on every processor. There are no other
sources of overhead.
2 For the second parallel algorithm: T = (N+N2 )/P + 100
This algorithm optimally divides all the computation among all processors but
introduces an additional cost of 100.
3 For the third parallel algorithm: T = (N+N2 )/P + 0.6P2
This algorithm also partitions all the computation optimally but introduces an
additional cost of 0.6P2.
• All three algorithms achieve a speedup of about 10.8 when P = 12 and N=100 . However, they
behave differently in other situations as shown next.
• With N=100 , all three algorithms perform poorly for larger P , although Algorithm (3) does
noticeably worse than the other two.
• When N=1000 , Algorithm (2) is much better than Algorithm (1) for larger P .
Parallel
Performan
ce
Example
(continued
)
All algorithms achieve:
Speedup = 10.8 when P = 12 and N=100
N=1000 , Algorithm (2) performs
much better than Algorithm (1)
for larger P .
Algorithm 1: T = N + N2/P
Algorithm 2: T = (N+N2 )/P + 100
Algorithm 3: T = (N+N2 )/P + 0.6P2
Steps in Creating a Parallel Program
• 4 steps:
Decomposition, Assignment, Orchestration, Mapping
• Done by programmer or system software (compiler,
runtime, ...)
• Issues are the same, so assume programmer does it all
explicitly
P0
Tasks Processes Processors
P1
P2 P3
p0 p1
p2 p3
p0 p1
p2 p3
Partitioning
Sequential
computation
Parallel
program
A
s
s
i
g
n
m
e
n
t
D
e
c
o
m
p
o
s
i
t
i
o
n
M
a
p
p
i
n
g
O
r
c
h
e
s
t
r
a
t
i
o
n
Decomposition
• Break up computation into concurrent tasks to be divided among processes:
• Tasks may become available dynamically.
• No. of available tasks may vary with time.
• Together with assignment, also called partitioning.
i.e. identify concurrency and decide level at which to exploit it.
• Grain-size problem:
• To determine the number and size of grains or tasks in a parallel program.
• Problem and machine-dependent.
• Solutions involve tradeoffs between parallelism, communication and
scheduling/synchronization overhead.
• Grain packing:
• To combine multiple fine-grain nodes into a coarse grain node (task) to reduce
communication delays and overall scheduling overhead.
Goal: Enough tasks to keep processes busy, but not too many
• No. of tasks available at a time is upper bound on achievable speedup
Assignment
• Specifying mechanisms to divide work up among processes:
• Together with decomposition, also called partitioning.
• Balance workload, reduce communication and management cost
• Partitioning problem:
• To partition a program into parallel branches, modules to give the
shortest possible execution on a specific parallel architecture.
• Structured approaches usually work well:
• Code inspection (parallel loops) or understanding of application.
• Well-known heuristics.
• Static versus dynamic assignment.
• As programmers, we worry about partitioning first:
• Usually independent of architecture or programming model.
• But cost and complexity of using primitives may affect decisions.
Orchestration
• Naming data.
• Structuring communication.
• Synchronization.
• Organizing data structures and scheduling tasks
temporally.
• Goals
• Reduce cost of communication and synch. as seen by processors
• Reserve locality of data reference (incl. data structure organization)
• Schedule tasks to satisfy dependences early
• Reduce overhead of parallelism management
• Closest to architecture (and programming model &
language).
• Choices depend a lot on comm. abstraction, efficiency of primitives.
• Architects should provide appropriate primitives efficiently.
Mapping
• Each task is assigned to a processor in a manner that attempts to satisfy
the competing goals of maximizing processor utilization and minimizing
communication costs.
• Mapping can be specified statically or determined at runtime by load-
balancing algorithms (dynamic scheduling).
• Two aspects of mapping:
• Which processes will run on the same processor, if necessary
• Which process runs on which particular processor
• mapping to a network topology
• One extreme: space-sharing
• Machine divided into subsets, only one app at a time in a subset
• Processes can be pinned to processors, or left to OS.
• Another extreme: complete resource management control to OS
• OS uses the performance techniques we will discuss later.
• Real world is between the two.
• User specifies desires in some aspects, system may ignore
Program Partitioning Example
Example 2.4 page 64
Fig 2.6 page 65
Fig 2.7 page 66
In Advanced Computer
Architecture, Hwang
Static Multiprocessor Scheduling
Dynamic multiprocessor scheduling is an NP-hard problem.
Node Duplication: to eliminate idle time and communication delays, some
nodes may be duplicated in more than one processor.
Fig. 2.8 page 67
Example: 2.5 page 68
In Advanced Computer
Architecture, Hwang
Table 2.1 Steps in the Parallelization Pr
ocess and Their Goals
Step
Architecture-
Dependent? Major Performance Goals
Decomposition Mostly no Expose enough concurr
ency but not too much
Assignment Mostly no Balance workload
Reduce communication volume
Orchestration Yes Reduce noninher
ent communication via data
locality
Reduce communication and synchr
onization cost
as seen by the processor
Reduce serialization at shar
ed resources
Schedule tasks to satisfy dependences early
Mapping Yes Put related processes on the same pr
ocessor if
necessary
Exploit locality in network topology
Successive Refinement
Partitioning is often independent of architecture, and may
be
done first:
• View machine as a collection of communicating processors
• Balancing the workload.
• Reducing the amount of inherent communication
• Reducing extra work.
• Above three issues are conflicting.
Then deal with interactions with architecture:
• View machine as an extended memory hierarchy
• Extra communication due to architectural interactions.
• Cost of communication depends on how it is structured
• This may inspire changes in partitioning.
Partitioning for Performance
• Balancing the workload and reducing wait time at synch
points
• Reducing inherent communication.
• Reducing extra work.
These algorithmic issues have extreme trade-offs:
• Minimize communication => run on 1 processor.
=> extreme load imbalance.
• Maximize load balance => random assignment of tiny tasks.
=> no control over communication.
• Good partition may imply extra work to compute or manage it
• The goal is to compromise between the above extremes
• Fortunately, often not difficult in practice.
Load Balancing and Synch Wait Time
Reduction
Limit on speedup:
• Work includes data access and other costs.
• Not just equal work, but must be busy at same time.
Four parts to load balancing and reducing synch wait
time:
1. Identify enough concurrency.
2. Decide how to manage it.
3. Determine the granularity at which to exploit it
4. Reduce serialization and cost of synchronization
Sequential Work
Max Work on any Processor
Speedupproblem(p) 
Managing Concurrency
Static versus Dynamic techniques
Static:
• Algorithmic assignment based on input; won’t
change
• Low runtime overhead
• Computation must be predictable
• Preferable when applicable (except in
multiprogrammed/heterogeneous environment)
Dynamic:
• Adapt at runtime to balance load
• Can increase communication and reduce locality
• Can increase task management overheads
Dynamic Load Balancing
• To achieve best performance of a parallel computing system running a parallel
problem, it’s essential to maximize processor utilization by distributing the
computation load evenly or balancing the load among the available processors.
• Optimal static load balancing, optimal mapping or scheduling, is an intractable NP-
complete problem, except for specific problems on specific networks.
• Hence heuristics are usually used to select processors for processes.
• Even the best static mapping may offer the best execution time due to changing
conditions at runtime and the process may need to done dynamically.
• The methods used for balancing the computational load dynamically among
processors
Can be broadly classified as:
1. Centralized dynamic load balancing.
2. Decentralized dynamic load balancing.
Processor Load Balance
& Performance
Dynamic Tasking with Task Queues
Centralized versus distributed queues.
Task stealing with distributed queues.
• Can compromise communication and locality, and increase
synchronization.
• Whom to steal from, how many tasks to steal, ...
• Termination detection
• Maximum imbalance related to size of task
QQ 0 Q2
Q1 Q3
All remove tasks
P0 inserts P
1 inserts P2 inserts P3 inserts
P0 removes P1 removes P
2 removes P3 removes
(b) Distributed task queues (one per pr
ocess)
Others may
steal
All processes
insert tasks
(a) Centralized task queue
Implications of Load
Balancing
Extends speedup limit expression to:
Speedupproblem(p) 
Generally, responsibility of software
Architecture can support task stealing and synch efficiently
• Fine-grained communication, low-overhead access to queues
• Efficient support allows smaller tasks, better load balancing
• Naming logically shared data in the presence of task stealing
• Need to access data of stolen tasks, esp. multiply-stolen tasks
=> Hardware shared address space advantageous
• Efficient support for point-to-point communication
Sequential Work
Max (Work + Synch Wait Time)
Reducing Inherent Communication
Measure: communication to computation ratio
Focus here is on inherent communication
• Determined by assignment of tasks to processes
• Actual communication can be greater
• Assign tasks that access same data to same
process
• Optimal solution to reduce communication and
achive an optimal load balance is NP-hard in
the general case
• Simple heuristic solutions work well in practice:
• Due to specific structure of applications.
Implications of Communication-to-
Computation Ratio
• Architects must examine application needs
• If denominator is execution time, ratio gives average BW needs
• If operation count, gives extremes in impact of latency and bandwidth
• Latency: assume no latency hiding
• Bandwidth: assume all latency hidden
• Reality is somewhere in between
• Actual impact of communication depends on
structure and cost as well:
• Need to keep communication balanced across
processors as well.
Sequential Work
Max (Work + Synch Wait Time + Comm Cost)
Speedup <
Reducing Extra Work (Overheads)
• Common sources of extra work:
• Computing a good partition
e.g. partitioning in Barnes-Hut or sparse matrix
• Using redundant computation to avoid
communication
• Task, data and process management overhead
• Applications, languages, runtime systems, OS
• Imposing structure on communication
• Coalescing messages, allowing effective naming
• Architectural Implications:
• Reduce need by making communication and
orchestration efficient
Sequential Work
Max (Work + Synch Wait Time + Comm Cost + Extra Work)
Speedup <
Extended Memory-Hierarchy
View of Multiprocessors
• Levels in extended hierarchy:
• Registers, caches, local memory, remote memory
(topology)
• Glued together by communication architecture
• Levels communicate at a certain granularity of data
transfer
• Need to exploit spatial and temporal locality in
hierarchy
• Otherwise extra communication may also be caused
• Especially important since communication is expensive
Extended Hierarchy
• Idealized view: local cache hierarchy + single main memory
• But reality is more complex:
• Centralized Memory: caches of other processors
• Distributed Memory: some local, some remote; +
network topology
• Management of levels:
• Caches managed by hardware
• Main memory depends on programming model
• SAS: data movement between local and remote
transparent
• Message passing: explicit
• Improve performance through architecture or
program locality
• Tradeoff with parallelism; need good node
performance and parallelism
Artifactual Communication in Extended Hierarchy
Accesses not satisfied in local portion cause
communication
• Inherent communication, implicit or explicit, causes
transfers
• Determined by program
• Artifactual communication:
• Determined by program implementation and arch.
interactions
• Poor allocation of data across distributed memories
• Unnecessary data in a transfer
• Unnecessary transfers due to system granularities
• Redundant communication of data
• finite replication capacity (in cache or main memory)
• Inherent communication assumes unlimited capacity,
small transfers, perfect knowledge of what is needed.
• More on artifactual communication later; first
consider replication-induced further
Structuring Communication
Given amount of comm (inherent or artifactual), goal is to reduce cost
• Cost of communication as seen by process:
C = f * ( o + l + + tc - overlap)
• f = frequency of messages
• o = overhead per message (at both ends)
• l = network delay per message
• nc = total data sent
• m = number of messages
• B = bandwidth along path (determined by network, NI,
assist)
• tc = cost induced by contention per message
• overlap = amount of latency hidden by overlap with comp.
or comm.
• Portion in parentheses is cost of a message (as seen by processor)
• That portion, ignoring overlap, is latency of a message
• Goal: reduce terms in latency and increase overlap
nc/m
B
Reducing Overhead
• Can reduce no. of messages m or overhead per
message o
• o is usually determined by hardware or system
software
• Program should try to reduce m by coalescing messages
• More control when communication is explicit
• Coalescing data into larger messages:
• Easy for regular, coarse-grained communication
• Can be difficult for irregular, naturally fine-grained
communication
• May require changes to algorithm and extra work
• coalescing data and determining what and to whom to send
• Will discuss more in implications for programming models later
Reducing Network Delay
• Network delay component = f*h*th
• h = number of hops traversed in network
• th = link+switch latency per hop
• Reducing f: Communicate less, or make messages
larger
• Reducing h:
• Map communication patterns to network topology
e.g. nearest-neighbor on mesh and ring; all-to-all
• How important is this?
• Used to be a major focus of parallel algorithms
• Depends on no. of processors, how th, compares with other
components
• Less important on modern machines
• Overheads, processor count, multiprogramming
Overlapping Communication
• Cannot afford to stall for high latencies
• Overlap with computation or communication to hide
latency
• Requires extra concurrency (slackness), higher
bandwidth
• Techniques:
• Prefetching
• Block data transfer
• Proceeding past communication
• Multithreading
Summary of Tradeoffs
• Different goals often have conflicting
demands
• Load Balance
• Fine-grain tasks
• Random or dynamic assignment
• Communication
• Usually coarse grain tasks
• Decompose to obtain locality: not random/dynamic
• Extra Work
• Coarse grain tasks
• Simple assignment
• Communication Cost:
• Big transfers: amortize overhead and latency
• Small transfers: reduce contention
Relationship Between Perspectives
Synch w ait
Data-remote
Data-local
Orchestration
Busy-overhead
Extra w ork
Performance issue
Parallelization step(s) Processor time component
Decomposition/
assignment/
orchestration
Decomposition/
assignment
Decomposition/
assignment
Orchestration/
mapping
Load imbalance and
synchr
onization
Inherent
communication
volume
Artifactual
communication
and data locality
Communication
structure
Summary
Speedupprob(p) =
• Goal is to reduce denominator components
• Both programmer and system have role to play
• Architecture cannot do much about load
imbalance or too much communication
• But it can:
• reduce incentive for creating ill-behaved programs
(efficient naming, communication and
synchronization)
• reduce artifactual communication
• provide efficient naming for flexible assignment
• allow effective overlapping of communication
Busy(1) + Data(1)
Busyuseful(p)+Datalocal(p)+Synch(p)+Dateremote(p)+Busyoverhead(p)
Generic Distributed Memory
Organization
• Network bandwidth?
• Bandwidth demand?
• Independent processes?
• Communicatingprocesses?
• Latency? O(log2P) increase?
• Cost scalabilityof system?

Scalable network
CA
P
$
Switch
M
Switch Switch
Multi-stage
interconnection
network (MIN)?
Custom-designed?
Node:
O(10) Bus-based SMP
Custom-designed CPU?
Node/System integration level?
How far? Cray-on-a-Chip?
SMP-on-a-Chip?
OS Supported?
Network protocols?
Communication Assist
Extend of functionality?
Message
transaction
DMA?
Global virtual
Shared address space?

More Related Content

Similar to Floating Point Operations , Memory Chip Organization , Serial Bus Architecture , Parallel processing A.ppt

Evaluation of morden computer & system attributes in ACA
Evaluation of morden computer &  system attributes in ACAEvaluation of morden computer &  system attributes in ACA
Evaluation of morden computer & system attributes in ACAPankaj Kumar Jain
 
Multicore_Architecture Book.pdf
Multicore_Architecture Book.pdfMulticore_Architecture Book.pdf
Multicore_Architecture Book.pdfSwatantraPrakash5
 
Dedicated fully parallel architecture
Dedicated fully parallel architectureDedicated fully parallel architecture
Dedicated fully parallel architectureGhufran Hasan
 
Overview of HPC.pptx
Overview of HPC.pptxOverview of HPC.pptx
Overview of HPC.pptxsundariprabhu
 
High performance computing for research
High performance computing for researchHigh performance computing for research
High performance computing for researchEsteban Hernandez
 
A Source-To-Source Approach to HPC Challenges
A Source-To-Source Approach to HPC ChallengesA Source-To-Source Approach to HPC Challenges
A Source-To-Source Approach to HPC ChallengesChunhua Liao
 
Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Sudarshan Mondal
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systemsdairsie
 
Lecture 2
Lecture 2Lecture 2
Lecture 2Mr SMAK
 
Coding For Cores - C# Way
Coding For Cores - C# WayCoding For Cores - C# Way
Coding For Cores - C# WayBishnu Rawal
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learnJohn D Almon
 
Chap 2 classification of parralel architecture and introduction to parllel p...
Chap 2  classification of parralel architecture and introduction to parllel p...Chap 2  classification of parralel architecture and introduction to parllel p...
Chap 2 classification of parralel architecture and introduction to parllel p...Malobe Lottin Cyrille Marcel
 
Data Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelData Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelNikhil Sharma
 

Similar to Floating Point Operations , Memory Chip Organization , Serial Bus Architecture , Parallel processing A.ppt (20)

Chap 1(one) general introduction
Chap 1(one)  general introductionChap 1(one)  general introduction
Chap 1(one) general introduction
 
Evaluation of morden computer & system attributes in ACA
Evaluation of morden computer &  system attributes in ACAEvaluation of morden computer &  system attributes in ACA
Evaluation of morden computer & system attributes in ACA
 
Multicore_Architecture Book.pdf
Multicore_Architecture Book.pdfMulticore_Architecture Book.pdf
Multicore_Architecture Book.pdf
 
Dedicated fully parallel architecture
Dedicated fully parallel architectureDedicated fully parallel architecture
Dedicated fully parallel architecture
 
Pdc lecture1
Pdc lecture1Pdc lecture1
Pdc lecture1
 
aca mod1.pptx
aca mod1.pptxaca mod1.pptx
aca mod1.pptx
 
Overview of HPC.pptx
Overview of HPC.pptxOverview of HPC.pptx
Overview of HPC.pptx
 
High performance computing for research
High performance computing for researchHigh performance computing for research
High performance computing for research
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
A Source-To-Source Approach to HPC Challenges
A Source-To-Source Approach to HPC ChallengesA Source-To-Source Approach to HPC Challenges
A Source-To-Source Approach to HPC Challenges
 
Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)Lec 2 (parallel design and programming)
Lec 2 (parallel design and programming)
 
Aca module 1
Aca module 1Aca module 1
Aca module 1
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
Coding For Cores - C# Way
Coding For Cores - C# WayCoding For Cores - C# Way
Coding For Cores - C# Way
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
 
Chap 2 classification of parralel architecture and introduction to parllel p...
Chap 2  classification of parralel architecture and introduction to parllel p...Chap 2  classification of parralel architecture and introduction to parllel p...
Chap 2 classification of parralel architecture and introduction to parllel p...
 
Data Parallel and Object Oriented Model
Data Parallel and Object Oriented ModelData Parallel and Object Oriented Model
Data Parallel and Object Oriented Model
 

More from VAISHNAVI MADHAN

More from VAISHNAVI MADHAN (12)

Control Systems I - V Notes New.pdf
Control Systems I - V Notes New.pdfControl Systems I - V Notes New.pdf
Control Systems I - V Notes New.pdf
 
Mitochondria.pptx
Mitochondria.pptxMitochondria.pptx
Mitochondria.pptx
 
ASM chart (1).pptx
ASM chart (1).pptxASM chart (1).pptx
ASM chart (1).pptx
 
FLIP FLOP TRIGGERING.pptx
FLIP FLOP TRIGGERING.pptxFLIP FLOP TRIGGERING.pptx
FLIP FLOP TRIGGERING.pptx
 
POLAR PLOT.pptx
POLAR PLOT.pptxPOLAR PLOT.pptx
POLAR PLOT.pptx
 
BCD ADDER (1).pptx
BCD ADDER (1).pptxBCD ADDER (1).pptx
BCD ADDER (1).pptx
 
OHMS LAW.pptx
OHMS LAW.pptxOHMS LAW.pptx
OHMS LAW.pptx
 
TDMA & CDMA.pptx
TDMA & CDMA.pptxTDMA & CDMA.pptx
TDMA & CDMA.pptx
 
Nortons theorem.pptx
Nortons theorem.pptxNortons theorem.pptx
Nortons theorem.pptx
 
Nortons theorem.pptx
Nortons theorem.pptxNortons theorem.pptx
Nortons theorem.pptx
 
Altitude and Orbit control.pptx
Altitude and Orbit control.pptxAltitude and Orbit control.pptx
Altitude and Orbit control.pptx
 
Bluetooth
BluetoothBluetooth
Bluetooth
 

Recently uploaded

Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .Satyam Kumar
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 

Recently uploaded (20)

Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 

Floating Point Operations , Memory Chip Organization , Serial Bus Architecture , Parallel processing A.ppt

  • 1. Floating Point Operations , Memory Chip Organization , Serial Bus Architecture , Parallel processing Architecture and Challenges Mrs.M.Vaishnavi
  • 2. Parallel Computer Architecture • A parallel computer is a collection of processing elements that cooperate to solve large problems fast • Broad issues involved: • Resource Allocation: • Number of processing elements (PEs). • Computing power of each element. • Amount of physical memory used. • Data access, Communication and Synchronization • How the elements cooperate and communicate. • How data is transmitted between processors. • Abstractions and primitives for cooperation. • Performance and Scalability • Performance enhancement of parallelism: Speedup. • Scalabilty of performance to larger systems/problems.
  • 4. The Need And Feasibility of Parallel Computing • Application demands: More computing cycles: • Scientific computing: CFD, Biology, Chemistry, Physics, ... • General-purpose computing: Video, Graphics, CAD, Databases, Transaction Processing, Gaming… • Mainstream multithreaded programs, are similar to parallel programs • Technology Trends • Number of transistors on chip growing rapidly • Clock rates expected to go up but only slowly • Architecture Trends • Instruction-level parallelism is valuable but limited • Coarser-level parallelism, as in MPs, the most viable approach • Economics: • Today’s microprocessors have multiprocessor support eliminating the need for designing expensive custom PEs • Lower parallel system cost. • Multiprocessor systems to offer a cost-effective replacement of uniprocessor systems in mainstream computing.
  • 6. Scientific Supercomputing Trends • Proving ground and driver for innovative architecture and advanced techniques: • Market is much smaller relative to commercial segment • Dominated by vector machines starting in 70s • Meanwhile, microprocessors have made huge gains in floating-point performance • High clock rates. • Pipelined floating point units. • Instruction-level parallelism. • Effective use of caches. • Large-scale multiprocessors replace vector supercomputers • Well under way already
  • 7. Raw Uniprocessor Performance: LINPACK LINPACK (MFLOPS)                  1 10 100 1,000 10,000 1975 1980 1985 1990 1995 2000  CRAY n = 100  CRAY n = 1,000  Micro n = 100  Micro n = 1,000 CRAY 1s Xmp/14se Xmp/416 Ymp C90 T94 DEC 8200 IBM Pow er2/990 MIPS R4400 HP9000/735 DEC Alpha DEC Alpha AXP HP 9000/750 IBM RS6000/540 MIPS M/2000 MIPS M/120 Sun 4/260                 
  • 8. Raw Parallel Performance: LINPACK LINPACK (GFLOPS)  CRAY peak  MPP peak Xmp /416(4) Ymp/832(8) nCUBE/2(1024) iPSC/860 CM-2 CM-200 Delta Paragon XP/S C90(16) CM-5 ASCI Red T932(32) T3D Paragon XP/S MP (1024) Paragon XP/S MP (6768)                0.1 1 10 100 1,000 10,000 1985 1987 1989 1991 1993 1995 1996
  • 9. Transistors                                                                   1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 1970 1975 1980 1985 1990 1995 2000 2005 Bit-level parallelism Instruction-level Thread-level (?) i4004 i8008 i8080 i8086 i80286 i80386 R2000 Pentium R10000 R3000 Parallelism in Microprocessor VLSI Generations
  • 10. The Goal of Parallel Computing • Goal of applications in using parallel machines: Speedup Speedup (p processors) = • For a fixed problem size (input data set), performance = 1/time Speedup fixed problem (p processors) = Performance (p processors) Performance (1 processor) Time (1 processor) Time (p processors)
  • 11. Elements of Modern Computers Hardware Architecture Operating System Applications Software Computing Problems Algorithms and Data Structures High-level Languages Performance Evaluation Mapping Programming Binding (Compile, Load)
  • 12. Elements of Modern Computers 1Computing Problems: • Numerical Computing: Science and technology numerical problems demand intensive integer and floating point computations. • Logical Reasoning: Artificial intelligence (AI) demand logic inferences and symbolic manipulations and large space searches. 2Algorithms and Data Structures • Special algorithms and data structures are needed to specify the computations and communication present in computing problems. • Most numerical algorithms are deterministic using regular data structures. • Symbolic processing may use heuristics or non-deterministic searches. • Parallel algorithm development requires interdisciplinary interaction.
  • 13. Elements of Modern Computers 3 Hardware Resources • Processors, memory, and peripheral devices form the hardware core of a computer system. • Processor instruction set, processor connectivity, memory organization, influence the system architecture. 4 Operating Systems • Manages the allocation of resources to running processes. • Mapping to match algorithmic structures with hardware architecture and vice versa: processor scheduling, memory mapping, interprocessor communication. • Parallelism exploitation at: algorithm design, program writing, compilation, and run time.
  • 14. Elements of Modern Computers 5 System Software Support • Needed for the development of efficient programs in high- level languages (HLLs.) • Assemblers, loaders. • Portable parallel programming languages • User interfaces and tools. 6 Compiler Support • Preprocessor compiler: Sequential compiler and low-level library of the target parallel computer. • Precompiler: Some program flow analysis, dependence checking, limited optimizations for parallelism detection. • Parallelizing compiler: Can automatically detect parallelism in source code and transform sequential code into parallel constructs.
  • 15. Approaches to Parallel Programming Source code written in concurrent dialects of C, C++ FORTRAN, LISP .. Programmer Concurrency preserving compiler Concurrent object code Execution by runtime system Source code written in sequential languages C, C++ FORTRAN, LISP .. Programmer Parallelizing compiler Parallel object code Execution by runtime system (a) Implicit Parallelism (b) Explicit Parallelism
  • 16. Evolution of Computer Architecture Scalar Sequential Lookahead I/E Overlap Functional Parallelism Multiple Func. Units Pipeline Implicit Vector Explicit Vector MIMD SIMD Multiprocessor Multicomputer Register-to -Register Memory-to -Memory Processor Array Associative Processor Massively Parallel Processors (MPPs) I/E: Instruction Fetch and Execute SIMD: Single Instruction stream over Multiple Data streams MIMD: Multiple Instruction streams over Multiple Data streams
  • 17. Parallel Architectures History Application Software System Software SIMD Message Passing Shared Memory Dataflow Systolic Arrays Architecture Historically, parallel architectures tied to programming models • Divergent architectures, with no predictable pattern of growth.
  • 18. Programming Models • Programming methodology used in coding applications • Specifies communication and synchronization • Examples: • Multiprogramming: No communication or synchronization at program level • Shared memory address space: • Message passing: Explicit point to point communication • Data parallel: More regimented, global actions on data • Implemented with shared address space or message passing
  • 19. Flynn’s 1972 Classification of Computer Architecture • Single Instruction stream over a Single Data stream (SISD): Conventional sequential machines. • Single Instruction stream over Multiple Data streams (SIMD): Vector computers, array of synchronized processing elements. • Multiple Instruction streams and a Single Data stream (MISD): Systolic arrays for pipelined execution. • Multiple Instruction streams over Multiple Data streams (MIMD): Parallel computers: • Shared memory multiprocessors. • Multicomputers: Unshared distributed memory, message-passing used instead.
  • 20. Flynn’s Classification of Computer Architecture Fig. 1.3 page 12 in Advanced Computer Architecture: Parallelism, Scalability, Programmability, Hwang, 1993.
  • 21. Current Trends In Parallel Architectures • The extension of “computer architecture” to support communication and cooperation: • OLD: Instruction Set Architecture • NEW: Communication Architecture • Defines: • Critical abstractions, boundaries, and primitives (interfaces) • Organizational structures that implement interfaces (hardware or software) • Compilers, libraries and OS are important bridges today
  • 22. Modern Parallel Architecture Layered Framework CAD Multiprogramming Shared address Message passing Data parallel Database Scientific modeling Parallel applications Programming models Communication abstraction User/system boundary Compilation or library Operating systems support Communication hardware Physical communication medium Hardware/software boundary
  • 23. Shared Address Space Parallel Architectures • Any processor can directly reference any memory location • Communication occurs implicitly as result of loads and stores • Convenient: • Location transparency • Similar programming model to time-sharing in uniprocessors • Except processes run on different processors • Good throughput on multiprogrammed workloads • Naturally provided on a wide range of platforms • Wide range of scale: few to hundreds of processors • Popularly known as shared memory machines or model • Ambiguous: Memory may be physically distributed among processors
  • 24. Shared Address Space (SAS) Model • Process: virtual address space plus one or more threads of control • Portions of address spaces of processes are shared • Writes to shared address visible to other threads (in other processes too) • Natural extension of the uniprocessor model: • Conventional memory operations used for communication • Special atomic operations needed for synchronization • OS uses shared memory to coordinate processes St or e P1 P2 Pn P0 Load P0 pr i vat e P1 pr i vat e P2 pr i vat e Pn pr i vat e Virtual address spaces for a collection of processes communicating via shared addresses Machine physical address space Shared portion of address space Private portion of address space Common physical addresses
  • 25. Models of Shared-Memory Multiprocessors • The Uniform Memory Access (UMA) Model: • The physical memory is shared by all processors. • All processors have equal access to all memory addresses. • Distributed memory or Nonuniform Memory Access (NUMA) Model: • Shared memory is physically distributed locally among processors. • The Cache-Only Memory Architecture (COMA) Model: • A special case of a NUMA machine where all distributed main memory is converted to caches. • No memory hierarchy at each processor.
  • 26. Models of Shared-Memory Multiprocessors I/O ctrl Mem Mem Mem Interconnect Mem I/O ctrl Processor Processor Interconnect I/O devices M  M M Network P $ P $ P $  Network D P C D P C D P C Distributed memory or Nonuniform Memory Access (NUMA) Model Uniform Memory Access (UMA) Model Interconnect: Bus, Crossbar, Multistage network P: Processor M: Memory C: Cache D: Cache directory Cache-Only Memory Architecture (COMA)
  • 27. Uniform Memory Access Example: Intel Pentium Pro Quad • All coherence and multiprocessing glue in processor module • Highly integrated, targeted at high volume • Low latency and bandwidth P-Pr o bus (64-bit data, 36-bit addr ess, 66 MHz) CPU Bus interface MIU P-Pr o module P-Pr o module P-Pr o module 256-KB L2 $ Interrupt controller PCI bridge PCI bridge Memory controller 1-, 2-, or 4-w ay interleaved DRAM PCI bus PCI bus PCI I/O cards
  • 28. Uniform Memory Access Example: SUN Enterprise • 16 cards of either type: processors + memory, or I/O • All memory accessed over bus, so symmetric • Higher bandwidth, higher latency bus Gigaplane bus (256 data, 41 address, 83 MHz) SBUS SBUS SBUS 2 FiberChannel 100bT, SCSI Bus interface CPU/mem cards P $2 $ P $2 $ Mem ctrl Bus interface/sw itch I/O cards
  • 29. Distributed Shared-Memory Multiprocessor System Example: Cray T3E • Scale up to 1024 processors, 480MB/s links • Memory controller generates communication requests for nonlocal references • No hardware mechanism for coherence (SGI Origin etc. provide this) Sw itch P $ XY Z External I/O Mem ctrl and NI Mem
  • 30. Message-Passing Multicomputers • Comprised of multiple autonomous computers (nodes). • Each node consists of a processor, local memory, attached storage and I/O peripherals. • Programming model more removed from basic hardware operations • Local memory is only accessible by local processors. • A message passing network provides point-to-point static connections among the nodes. • Inter-node communication is carried out by message passing through the static connection network • Process communication achieved using a message-passing programming environment.
  • 31. Message-Passing Abstraction • Send specifies buffer to be transmitted and receiving process • Recv specifies sending process and application storage to receive into • Memory to memory copy, but need to name processes • Optional tag on send and matching rule on receive • User process names local data and entities in process/tag space too • In simplest form, the send/recv match achieves pairwise synch event • Many overheads: copying, buffer management, protection Pr ocess P Pr ocess Q Addr ess Y Addr ess X Send X, Q, t Receive Y , P, t Match Local pr ocess addr ess space Local pr ocess addr ess space
  • 32. Message-Passing Example: IBM SP-2 • Made out of essentially complete RS6000 workstations • Network interface integrated in I/O bus (bandwidth limited by I/O bus) Memory bus MicroChannel bus I/O i860 NI DMA DRAM IBM SP-2 node L2 $ Pow er 2 CPU Memory controller 4-w ay interleaved DRAM General interconnection netw ork formed fr om 8-port sw itches NIC
  • 33. Message-Passing Example: Intel Paragon Memory bus (64-bit, 50 MHz) i860 L1 $ NI DMA i860 L1 $ Driver Mem ctrl 4-w ay interleaved DRAM Intel Paragon node 8 bits, 175 MHz, bidirectional 2D grid netw ork w ith pr ocessing node attached to every sw itch Sandia’ s Intel Paragon XP/S-based Super computer
  • 34. Message-Passing Programming Tools • Message-passing programming libraries include: • Message Passing Interface (MPI): • Provides a standard for writing concurrent message- passing programs. • MPI implementations include parallel libraries used by existing programming languages. • Parallel Virtual Machine (PVM): • Enables a collection of heterogeneous computers to used as a coherent and flexible concurrent computational resource. • PVM support software executes on each machine in a user-configurable pool, and provides a computational environment of concurrent applications. • User programs written for example in C, Fortran or Java are provided access to PVM through the use of calls to PVM library routines.
  • 35. Data Parallel Systems SIMD in Flynn taxonomy • Programming model • Operations performed in parallel on each element of data structure • Logically single thread of control, performs sequential or parallel steps • Conceptually, a processor is associated with each data element • Architectural model • Array of many simple, cheap processors each with little memory • Processors don’t sequence through instructions • Attached to a control processor that issues instructions • Specialized and general communication, cheap global synchronization • Some recent machines: • Thinking Machines CM-1, CM-2 (and CM-5) • Maspar MP-1 and MP-2, PE PE PE  PE PE PE  PE PE PE     Control processor
  • 36. Dataflow Architectures • Represent computation as a graph of essential dependences • Logical processor at each node, activated by availability of operands • Message (tokens) carrying tag of next instruction sent to next processor • Tag compared with others in matching store; match fires execution 1 b a +     c e d f Dataflow graph f = a  d Netw ork Token store Waiting Matching Instruction fetch Execute Token queue Form token Netw ork Netw ork Program store a = (b +1)  (b  c) d = c  e
  • 37. Systolic Architectures • Replace single processor with an array of regular processing elements • Orchestrate data flow for high throughput with less memory access M PE M PE PE PE • Different from pipelining – Nonlinear array structure, multidirection data flow, each PE may have (small) local instruction and data memory • Different from SIMD: each PE may do something different • Initial motivation: VLSI enables inexpensive special-purpose chips • Represent algorithms directly by chips connected in regular pattern
  • 38. Parallel Programs • Conditions of Parallelism: • Data Dependence • Control Dependence • Resource Dependence • Bernstein’s Conditions • Asymptotic Notations for Algorithm Analysis • Parallel Random-Access Machine (PRAM) • Example: sum algorithm on P processor PRAM • Network Model of Message-Passing Multicomputers • Example: Asynchronous Matrix Vector Product on a Ring • Levels of Parallelism in Program Execution • Hardware Vs. Software Parallelism • Parallel Task Grain Size • Example Motivating Problems With high levels of concurrency • Limited Concurrency: Amdahl’s Law • Parallel Performance Metrics: Degree of Parallelism (DOP) • Concurrency Profile • Steps in Creating a Parallel Program: • Decomposition, Assignment, Orchestration, Mapping • Program Partitioning Example • Static Multiprocessor Scheduling Example
  • 39. Conditions of Parallelism: Data Dependence 1True Data or Flow Dependence: A statement S2 is data dependent on statement S1 if an execution path exists from S1 to S2 and if at least one output variable of S1 feeds in as an input operand used by S2 denoted by S1 S2 2Antidependence: Statement S2 is antidependent on S1 if S2 follows S1 in program order and if the output of S2 overlaps the input of S1 denoted by S1  S2 3Output dependence: Two statements are output dependent if they produce the same output variable denoted by S1  S2
  • 40. Conditions of Parallelism: Data Dependence 4I/O dependence: Read and write are I/O statements. I/O dependence occurs not because the same variable is involved but because the same file is referenced by both I/O statements. 5Unknown dependence: • Subscript of a variable is subscribed (indirect addressing) • The subscript does not contain the loop index. • A variable appears more than once with subscripts having different coefficients of the loop variable. • The subscript is nonlinear in the loop index variable.
  • 41. Data and I/O Dependence: Examples A - B - S1: Load R1,A S2: Add R2, R1 S3: Move R1, R3 S4: Store B, R1 S1: Read (4),A(I) /Read array A from tape unit 4/ S2: Rewind (4) /Rewind tape unit 4/ S3: Write (4), B(I) /Write array B into tape unit 4/ S4: Rewind (4) /Rewind tape unit 4/ S1 S3 S4 S2 Dependence graph S1 S3 I/O I/O dependence caused by accessing the same file by the read and write statements
  • 42. Conditions of Parallelism • Control Dependence: • Order of execution cannot be determined before runtime due to conditional statements. • Resource Dependence: • Concerned with conflicts in using shared resources including functional units (integer, floating point), memory areas, among parallel tasks. • Bernstein’s Conditions: Two processes P1 , P2 with input sets I1, I2 and output sets O1, O2 can execute in parallel (denoted by P1 || P2) if: I1 O2 =  I2 O1 =  O1 O2 = 
  • 43. Bernstein’s Conditions: An Example • For the following instructions P1, P2, P3, P4, P5 in program order and • Instructions are in program order • Each instruction requires one step to execute • Two adders are available P1 : C = D x E P2 : M = G + C P3 : A = B + C P4 : C = L + M P5 : F = G  E Using Bernstein’s Conditions after checking statement pairs: P1 || P5 , P2 || P3 , P2 || P5 , P5 || P3 , P4 || P5 X P1 D E +3 P4 +2 P3 +1 P2 C B G L  P5 G E F A C X P1 D E +1 P2 +3 P4  P5 G B F C +2 P3 A L E G C M Parallel execution in three steps assuming two adders are available per step Sequential execution Time X P1  P5 +2 +3 +1 P2 P4 P3 Dependence graph: Data dependence (solid lines) Resource dependence (dashed lines)
  • 44. Asymptotic Notations for Algorithm Analysis Asymptotic analysis of computing time of an algorithm f(n) ignores constant execution factors and concentrates on determining the order of magnitude of algorithm performance. Upper bound: Used in worst case analysis of algorithm performance. f(n) = O(g(n)) iff there exist two positive constants c and n0 such that | f(n) | c | g(n) | for all n > n0  i.e. g(n) an upper bound on f(n) O(1) < O(log n) < O(n) < O(n log n) < O (n2) < O(n3) < O(2n)
  • 45. Asymptotic Notations for Algorithm Analysis  Lower bound: Used in the analysis of the lower limit of algorithm performance f(n) = (g(n)) if there exist positive constants c, n0 such that | f(n) |  c | g(n) | for all n > n0 i.e. g(n) is a lower bound on f(n)  Tight bound: Used in finding a tight limit on algorithm performance f(n) = (g(n)) if there exist constant positive integers c1, c2, and n0 such that c1 | g(n) | | f(n) | c2 | g(n) | for all n > n0 i.e. g(n) is both an upper and lower bound on f(n)
  • 46. The Growth Rate of Common Computing Functions log n n n log n n2 n3 2n 0 1 0 1 1 2 1 2 2 4 8 4 2 4 8 16 64 16 3 8 24 64 512 256 4 16 64 256 4096 65536 5 32 160 1024 32768 4294967296
  • 47. Theoretical Models of Parallel Computers • Parallel Random-Access Machine (PRAM): • n processor, global shared memory model. • Models idealized parallel computers with zero synchronization or memory access overhead. • Utilized parallel algorithm development and scalability and complexity analysis. • PRAM variants: More realistic models than pure PRAM • EREW-PRAM: Simultaneous memory reads or writes to/from the same memory location are not allowed. • CREW-PRAM: Simultaneous memory writes to the same location is not allowed. • ERCW-PRAM: Simultaneous reads from the same memory location are not allowed. • CRCW-PRAM: Concurrent reads or writes to/from the same memory location are allowed.
  • 48. Example: sum algorithm on P processor PRAM •Input: Array A of size n = 2k in shared memory •Initialized local variables: •the order n, •number of processors p = 2q n, • the processor number s •Output: The sum of the elements of A stored in shared memory begin 1. for j = 1 to l ( = n/p) do Set B(l(s - 1) + j): = A(l(s-1) + j) 2. for h = 1 to log n do 2.1 if (k- h - q 0) then for j = 2k-h-q(s-1) + 1 to 2k-h-qS do Set B(j): = B(2j -1) + B(2s) 2.2 else {if (s 2k-h) then Set B(s): = B(2s -1 ) + B(2s)} 3. if (s = 1) then set S: = B(1) end Running time analysis: • Step 1: takes O(n/p) each processor executes n/p operations •The hth of step 2 takes O(n / (2hp)) since each processor has to perform (n / (2hp))  operations • Step three takes O(1) •Total Running time: p h h n T n O n p n p O n p n ( ) ( log ) log                       2 1
  • 49. Example: Sum Algorithm on P Processor PRAM Operation represented by a node is executed by the processor indicated below the node. B(6) =A(6) P3 B(5) =A(5) P3 P3 B(3) B(8) =A(8) P4 B(7) =A(7) P4 P4 B(4) B(2) =A(2) P1 B(1) =A(1) P1 P1 B(1) B(4) =A(4) P2 B(3) =A(3) P2 P2 B(2) B(2) P2 B(1) P1 B(1) P1 S= B(1) P1 For n = 8 p = 4 Processor allocation for computing the sum of 8 elements on 4 processor PRAM 5 4 3 2 1 Time Unit
  • 50. The Power of The PRAM Model • Well-developed techniques and algorithms to handle many computational problems exist for the PRAM model • Removes algorithmic details regarding synchronization and communication, concentrating on the structural properties of the problem. • Captures several important parameters of parallel computations. Operations performed in unit time, as well as processor allocation. • The PRAM design paradigms are robust and many network algorithms can be directly derived from PRAM algorithms. • It is possible to incorporate synchronization and communication into the shared-memory PRAM model.
  • 51. Performance of Parallel Algorithms • Performance of a parallel algorithm is typically measured in terms of worst-case analysis. • For problem Q with a PRAM algorithm that runs in time T(n) using P(n) processors, for an instance size of n: • The time-processor product C(n) = T(n) . P(n) represents the cost of the parallel algorithm. • For P < P(n), each of the of the T(n) parallel steps is simulated in O(P(n)/p) substeps. Total simulation takes O(T(n)P(n)/p) • The following four measures of performance are asymptotically equivalent: • P(n) processors and T(n) time • C(n) = P(n)T(n) cost and T(n) time • O(T(n)P(n)/p) time for any number of processors p < P(n) • O(C(n)/p + T(n)) time for any number of processors.
  • 52. Network Model of Message-Passing Multicomputers • A network of processors can viewed as a graph G (N,E) • Each node i  N represents a processor • Each edge (i,j)  E represents a two-way communication link between processors i and j. • Each processor is assumed to have its own local memory. • No shared memory is available. • Operation is synchronous or asynchronous(message passing). • Typical message-passing communication constructs: • send(X,i) a copy of X is sent to processor Pi, execution continues. • receive(Y, j) execution suspended until the data from processor Pj is received and stored in Y then execution resumes.
  • 53. Network Model of Multicomputers • Routing is concerned with delivering each message from source to destination over the network. • Additional important network topology parameters: • The network diameter is the maximum distance between any pair of nodes. • The maximum degree of any node in G • Example: • Linear array: P processors P1, …, Pp are connected in a linear array where: • Processor Pi is connected to Pi-1 and Pi+1 if they exist. • Diameter is p-1; maximum degree is 2 • A ring is a linear array of processors where processors P1 and Pp are directly connected.
  • 54. A Four-Dimensional Hypercube • Two processors are connected if their binary indices differ in one bit position.
  • 55. Example: Asynchronous Matrix Vector Product on a Ring • Input: • n x n matrix A ; vector x of order n • The processor number i. The number of processors • The ith submatrix B = A( 1:n, (i-1)r +1 ; ir) of size n x r where r = n/p • The ith subvector w = x(i - 1)r + 1 : ir) of size r • Output: • Processor Pi computes the vector y = A1x1 + …. Aixi and passes the result to the right • Upon completion P1 will hold the product Ax Begin 1. Compute the matrix vector product z = Bw 2. If i = 1 then set y: = 0 else receive(y,left) 3. Set y: = y +z 4. send(y, right) 5. if i =1 then receive(y,left) End Tcomp = k(n2/p) Tcomm = p(l+ mn) T = Tcomp + Tcomm = k(n2/p) + p(l+ mn)
  • 56. Creating a Parallel Program •Assumption: Sequential algorithm to solve problem is given • Or a different algorithm with more inherent parallelism is devised. • Most programming problems have several parallel solutions. The best solution may differ from that suggested by existing sequential algorithms. One must: • Identify work that can be done in parallel • Partition work and perhaps data among processes • Manage data access, communication and synchronization • Note: work includes computation, data access and I/O Main goal: Speedup (plus low programming effort and resource needs) Speedup (p) = For a fixed problem: Speedup (p) = Performance(p) Performance(1) Time(1) Time(p)
  • 57. Some Important Concepts Task: • Arbitrary piece of undecomposed work in parallel computation • Executed sequentially on a single processor; concurrency is only across tasks • E.g. a particle/cell in Barnes-Hut, a ray or ray group in Raytrace • Fine-grained versus coarse-grained tasks Process (thread): • Abstract entity that performs the tasks assigned to processes • Processes communicate and synchronize to perform their tasks Processor: • Physical engine on which process executes • Processes virtualize machine to programmer • first write program in terms of processes, then map to processors
  • 58. Levels of Parallelism in Program Execution Jobs or programs (Multiprogramming) Level 5 Subprograms, job steps or related parts of a program Level 4 Procedures, subroutines, or co-routines Level 3 Non-recursive loops or unfolded iterations Level 2 Instructions or statements Level 1 Increasing communications demand and mapping/scheduling overhead } } } Higher degree of Parallelism Medium Grain Coarse Grain Fine Grain
  • 59. Hardware and Software Parallelism • Hardware parallelism: • Defined by machine architecture, hardware multiplicity (number of processors available) and connectivity. • Often a function of cost/performance tradeoffs. • Characterized in a single processor by the number of instructions k issued in a single cycle (k-issue processor). • A multiprocessor system with n k-issue processor can handle a maximum limit of nk threads. • Software parallelism: • Defined by the control and data dependence of programs. • Revealed in program profiling or program flow graph. • A function of algorithm, programming style and compiler optimization.
  • 60. Computational Parallelism and Grain Size • Grain size (granularity) is a measure of the amount of computation involved in a task in parallel computation : • Instruction Level: • At instruction or statement level. • 20 instructions grain size or less. • For scientific applications, parallelism at this level range from 500 to 3000 concurrent statements • Manual parallelism detection is difficult but assisted by parallelizing compilers. • Loop level: • Iterative loop operations. • Typically, 500 instructions or less per iteration. • Optimized on vector parallel computers. • Independent successive loop operations can be vectorized or run in SIMD mode.
  • 61. Computational Parallelism and Grain Size • Procedure level: • Medium-size grain; task, procedure, subroutine levels. • Less than 2000 instructions. • More difficult detection of parallel than finer-grain levels. • Less communication requirements than fine-grain parallelism. • Relies heavily on effective operating system support. • Subprogram level: • Job and subprogram level. • Thousands of instructions per grain. • Often scheduled on message-passing multicomputers. • Job (program) level, or Multiprogrammimg: • Independent programs executed on a parallel computer. • Grain size in tens of thousands of instructions.
  • 62. Example Motivating Problems: Simulating Ocean Currents • Model as two-dimensional grids • Discretize in space and time • finer spatial and temporal resolution => greater accuracy • Many different computations per time step • set up and solve equations • Concurrency across and within grid computations (a) Cross sections (b) Spatial discretization of a cross section
  • 63. Example MotivatingProblems: Simulating Galaxy Evolution •Simulate the interactions of many stars evolving over time •Computing forces is expensive •O(n2) brute force approach •Hierarchical Methods take advantage of force law: G m1m2 r2 • Many time-steps, plenty of concurrency across stars within one Star on w hich for ces are being computed Star too close to approximate Small group far enough aw ay to approximate to center of mass Large group far enough aw ay to approximate
  • 64. Example Motivating Problems: Rendering Scenes by Ray Tracing • Shoot rays into scene through pixels in image plane • Follow their paths • They bounce around as they strike objects • They generate new rays: ray tree per input ray • Result is color and opacity for that pixel • Parallelism across rays • All above case studies have abundant concurrency
  • 65. Limited Concurrency: Amdahl’s Law •Most fundamental limitation on parallel speedup. •If fraction s of seqeuential execution is inherently serial, speedup <= 1/s •Example: 2-phase calculation • sweep over n-by-n grid and do some independent computation • sweep again and add each value to global sum •Time for first phase = n2/p •Second phase serialized at global variable, so time = n2 •Speedup <= or at most 2 •Possible Trick: divide second phase into two • Accumulate into private sum during sweep • Add per-process private sum into global sum •Parallel time is n2/p + n2/p + p, and speedup at best 2n2 n 2 p + n2 2n2 2n2 + p2
  • 66. Amdahl’s Law Example: A Pictorial Depiction 1 p 1 p 1 n2/p n2 p work done concurrently n2 n2 Time n2/p n2/p ( c ) (b) (a)
  • 67. Parallel Performance Metrics Degree of Parallelism (DOP) • For a given time period, DOP reflects the number of processors in a specific parallel computer actually executing a particular parallel program. • Average Parallelism: • given maximum parallelism = m • n homogeneous processors • computing capacity of a single processor  • Total amount of work W (instructions, computations): or as a discrete summation W i i i m t     . 1 W DOP t dt t t    ( ) 1 2 A DOP t dt t t t t    1 2 1 1 2 ( ) A i i i m i i m t t                  . 1 1 i i m t t t     1 2 1 Where ti is the total time that DOP = i and The average parallelism A: In discrete form
  • 68. Example: Concurrency Profile of A Divide-and-Conquer Algorithm • Execution observed from t1 = 2 to t2 = 27 • Peak parallelism m = 8 • A = (1x5 + 2x3 + 3x4 + 4x6 + 5x2 + 6x2 + 8x3) / (5 + 3+4+6+2+2+3) = 93/25 = 3.72 Degree of Parallelism (DOP) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 11 10 9 8 7 6 5 4 3 2 1 Time t1 t2 A i i i m i i m t t                  . 1 1
  • 69. Parallel Performance Example • The execution time T for three parallel programs is given in terms of processor count P and problem size N • In each case, we assume that the total computation work performed by an optimal sequential algorithm scales as N+N2 . 1 For first parallel algorithm: T = N + N2/P This algorithm partitions the computationally demanding O(N2) component of the algorithm but replicates the O(N) component on every processor. There are no other sources of overhead. 2 For the second parallel algorithm: T = (N+N2 )/P + 100 This algorithm optimally divides all the computation among all processors but introduces an additional cost of 100. 3 For the third parallel algorithm: T = (N+N2 )/P + 0.6P2 This algorithm also partitions all the computation optimally but introduces an additional cost of 0.6P2. • All three algorithms achieve a speedup of about 10.8 when P = 12 and N=100 . However, they behave differently in other situations as shown next. • With N=100 , all three algorithms perform poorly for larger P , although Algorithm (3) does noticeably worse than the other two. • When N=1000 , Algorithm (2) is much better than Algorithm (1) for larger P .
  • 70. Parallel Performan ce Example (continued ) All algorithms achieve: Speedup = 10.8 when P = 12 and N=100 N=1000 , Algorithm (2) performs much better than Algorithm (1) for larger P . Algorithm 1: T = N + N2/P Algorithm 2: T = (N+N2 )/P + 100 Algorithm 3: T = (N+N2 )/P + 0.6P2
  • 71. Steps in Creating a Parallel Program • 4 steps: Decomposition, Assignment, Orchestration, Mapping • Done by programmer or system software (compiler, runtime, ...) • Issues are the same, so assume programmer does it all explicitly P0 Tasks Processes Processors P1 P2 P3 p0 p1 p2 p3 p0 p1 p2 p3 Partitioning Sequential computation Parallel program A s s i g n m e n t D e c o m p o s i t i o n M a p p i n g O r c h e s t r a t i o n
  • 72. Decomposition • Break up computation into concurrent tasks to be divided among processes: • Tasks may become available dynamically. • No. of available tasks may vary with time. • Together with assignment, also called partitioning. i.e. identify concurrency and decide level at which to exploit it. • Grain-size problem: • To determine the number and size of grains or tasks in a parallel program. • Problem and machine-dependent. • Solutions involve tradeoffs between parallelism, communication and scheduling/synchronization overhead. • Grain packing: • To combine multiple fine-grain nodes into a coarse grain node (task) to reduce communication delays and overall scheduling overhead. Goal: Enough tasks to keep processes busy, but not too many • No. of tasks available at a time is upper bound on achievable speedup
  • 73. Assignment • Specifying mechanisms to divide work up among processes: • Together with decomposition, also called partitioning. • Balance workload, reduce communication and management cost • Partitioning problem: • To partition a program into parallel branches, modules to give the shortest possible execution on a specific parallel architecture. • Structured approaches usually work well: • Code inspection (parallel loops) or understanding of application. • Well-known heuristics. • Static versus dynamic assignment. • As programmers, we worry about partitioning first: • Usually independent of architecture or programming model. • But cost and complexity of using primitives may affect decisions.
  • 74. Orchestration • Naming data. • Structuring communication. • Synchronization. • Organizing data structures and scheduling tasks temporally. • Goals • Reduce cost of communication and synch. as seen by processors • Reserve locality of data reference (incl. data structure organization) • Schedule tasks to satisfy dependences early • Reduce overhead of parallelism management • Closest to architecture (and programming model & language). • Choices depend a lot on comm. abstraction, efficiency of primitives. • Architects should provide appropriate primitives efficiently.
  • 75. Mapping • Each task is assigned to a processor in a manner that attempts to satisfy the competing goals of maximizing processor utilization and minimizing communication costs. • Mapping can be specified statically or determined at runtime by load- balancing algorithms (dynamic scheduling). • Two aspects of mapping: • Which processes will run on the same processor, if necessary • Which process runs on which particular processor • mapping to a network topology • One extreme: space-sharing • Machine divided into subsets, only one app at a time in a subset • Processes can be pinned to processors, or left to OS. • Another extreme: complete resource management control to OS • OS uses the performance techniques we will discuss later. • Real world is between the two. • User specifies desires in some aspects, system may ignore
  • 76. Program Partitioning Example Example 2.4 page 64 Fig 2.6 page 65 Fig 2.7 page 66 In Advanced Computer Architecture, Hwang
  • 77. Static Multiprocessor Scheduling Dynamic multiprocessor scheduling is an NP-hard problem. Node Duplication: to eliminate idle time and communication delays, some nodes may be duplicated in more than one processor. Fig. 2.8 page 67 Example: 2.5 page 68 In Advanced Computer Architecture, Hwang
  • 78. Table 2.1 Steps in the Parallelization Pr ocess and Their Goals Step Architecture- Dependent? Major Performance Goals Decomposition Mostly no Expose enough concurr ency but not too much Assignment Mostly no Balance workload Reduce communication volume Orchestration Yes Reduce noninher ent communication via data locality Reduce communication and synchr onization cost as seen by the processor Reduce serialization at shar ed resources Schedule tasks to satisfy dependences early Mapping Yes Put related processes on the same pr ocessor if necessary Exploit locality in network topology
  • 79. Successive Refinement Partitioning is often independent of architecture, and may be done first: • View machine as a collection of communicating processors • Balancing the workload. • Reducing the amount of inherent communication • Reducing extra work. • Above three issues are conflicting. Then deal with interactions with architecture: • View machine as an extended memory hierarchy • Extra communication due to architectural interactions. • Cost of communication depends on how it is structured • This may inspire changes in partitioning.
  • 80. Partitioning for Performance • Balancing the workload and reducing wait time at synch points • Reducing inherent communication. • Reducing extra work. These algorithmic issues have extreme trade-offs: • Minimize communication => run on 1 processor. => extreme load imbalance. • Maximize load balance => random assignment of tiny tasks. => no control over communication. • Good partition may imply extra work to compute or manage it • The goal is to compromise between the above extremes • Fortunately, often not difficult in practice.
  • 81. Load Balancing and Synch Wait Time Reduction Limit on speedup: • Work includes data access and other costs. • Not just equal work, but must be busy at same time. Four parts to load balancing and reducing synch wait time: 1. Identify enough concurrency. 2. Decide how to manage it. 3. Determine the granularity at which to exploit it 4. Reduce serialization and cost of synchronization Sequential Work Max Work on any Processor Speedupproblem(p) 
  • 82. Managing Concurrency Static versus Dynamic techniques Static: • Algorithmic assignment based on input; won’t change • Low runtime overhead • Computation must be predictable • Preferable when applicable (except in multiprogrammed/heterogeneous environment) Dynamic: • Adapt at runtime to balance load • Can increase communication and reduce locality • Can increase task management overheads
  • 83. Dynamic Load Balancing • To achieve best performance of a parallel computing system running a parallel problem, it’s essential to maximize processor utilization by distributing the computation load evenly or balancing the load among the available processors. • Optimal static load balancing, optimal mapping or scheduling, is an intractable NP- complete problem, except for specific problems on specific networks. • Hence heuristics are usually used to select processors for processes. • Even the best static mapping may offer the best execution time due to changing conditions at runtime and the process may need to done dynamically. • The methods used for balancing the computational load dynamically among processors Can be broadly classified as: 1. Centralized dynamic load balancing. 2. Decentralized dynamic load balancing.
  • 85. Dynamic Tasking with Task Queues Centralized versus distributed queues. Task stealing with distributed queues. • Can compromise communication and locality, and increase synchronization. • Whom to steal from, how many tasks to steal, ... • Termination detection • Maximum imbalance related to size of task QQ 0 Q2 Q1 Q3 All remove tasks P0 inserts P 1 inserts P2 inserts P3 inserts P0 removes P1 removes P 2 removes P3 removes (b) Distributed task queues (one per pr ocess) Others may steal All processes insert tasks (a) Centralized task queue
  • 86. Implications of Load Balancing Extends speedup limit expression to: Speedupproblem(p)  Generally, responsibility of software Architecture can support task stealing and synch efficiently • Fine-grained communication, low-overhead access to queues • Efficient support allows smaller tasks, better load balancing • Naming logically shared data in the presence of task stealing • Need to access data of stolen tasks, esp. multiply-stolen tasks => Hardware shared address space advantageous • Efficient support for point-to-point communication Sequential Work Max (Work + Synch Wait Time)
  • 87. Reducing Inherent Communication Measure: communication to computation ratio Focus here is on inherent communication • Determined by assignment of tasks to processes • Actual communication can be greater • Assign tasks that access same data to same process • Optimal solution to reduce communication and achive an optimal load balance is NP-hard in the general case • Simple heuristic solutions work well in practice: • Due to specific structure of applications.
  • 88. Implications of Communication-to- Computation Ratio • Architects must examine application needs • If denominator is execution time, ratio gives average BW needs • If operation count, gives extremes in impact of latency and bandwidth • Latency: assume no latency hiding • Bandwidth: assume all latency hidden • Reality is somewhere in between • Actual impact of communication depends on structure and cost as well: • Need to keep communication balanced across processors as well. Sequential Work Max (Work + Synch Wait Time + Comm Cost) Speedup <
  • 89. Reducing Extra Work (Overheads) • Common sources of extra work: • Computing a good partition e.g. partitioning in Barnes-Hut or sparse matrix • Using redundant computation to avoid communication • Task, data and process management overhead • Applications, languages, runtime systems, OS • Imposing structure on communication • Coalescing messages, allowing effective naming • Architectural Implications: • Reduce need by making communication and orchestration efficient Sequential Work Max (Work + Synch Wait Time + Comm Cost + Extra Work) Speedup <
  • 90. Extended Memory-Hierarchy View of Multiprocessors • Levels in extended hierarchy: • Registers, caches, local memory, remote memory (topology) • Glued together by communication architecture • Levels communicate at a certain granularity of data transfer • Need to exploit spatial and temporal locality in hierarchy • Otherwise extra communication may also be caused • Especially important since communication is expensive
  • 91. Extended Hierarchy • Idealized view: local cache hierarchy + single main memory • But reality is more complex: • Centralized Memory: caches of other processors • Distributed Memory: some local, some remote; + network topology • Management of levels: • Caches managed by hardware • Main memory depends on programming model • SAS: data movement between local and remote transparent • Message passing: explicit • Improve performance through architecture or program locality • Tradeoff with parallelism; need good node performance and parallelism
  • 92. Artifactual Communication in Extended Hierarchy Accesses not satisfied in local portion cause communication • Inherent communication, implicit or explicit, causes transfers • Determined by program • Artifactual communication: • Determined by program implementation and arch. interactions • Poor allocation of data across distributed memories • Unnecessary data in a transfer • Unnecessary transfers due to system granularities • Redundant communication of data • finite replication capacity (in cache or main memory) • Inherent communication assumes unlimited capacity, small transfers, perfect knowledge of what is needed. • More on artifactual communication later; first consider replication-induced further
  • 93. Structuring Communication Given amount of comm (inherent or artifactual), goal is to reduce cost • Cost of communication as seen by process: C = f * ( o + l + + tc - overlap) • f = frequency of messages • o = overhead per message (at both ends) • l = network delay per message • nc = total data sent • m = number of messages • B = bandwidth along path (determined by network, NI, assist) • tc = cost induced by contention per message • overlap = amount of latency hidden by overlap with comp. or comm. • Portion in parentheses is cost of a message (as seen by processor) • That portion, ignoring overlap, is latency of a message • Goal: reduce terms in latency and increase overlap nc/m B
  • 94. Reducing Overhead • Can reduce no. of messages m or overhead per message o • o is usually determined by hardware or system software • Program should try to reduce m by coalescing messages • More control when communication is explicit • Coalescing data into larger messages: • Easy for regular, coarse-grained communication • Can be difficult for irregular, naturally fine-grained communication • May require changes to algorithm and extra work • coalescing data and determining what and to whom to send • Will discuss more in implications for programming models later
  • 95. Reducing Network Delay • Network delay component = f*h*th • h = number of hops traversed in network • th = link+switch latency per hop • Reducing f: Communicate less, or make messages larger • Reducing h: • Map communication patterns to network topology e.g. nearest-neighbor on mesh and ring; all-to-all • How important is this? • Used to be a major focus of parallel algorithms • Depends on no. of processors, how th, compares with other components • Less important on modern machines • Overheads, processor count, multiprogramming
  • 96. Overlapping Communication • Cannot afford to stall for high latencies • Overlap with computation or communication to hide latency • Requires extra concurrency (slackness), higher bandwidth • Techniques: • Prefetching • Block data transfer • Proceeding past communication • Multithreading
  • 97. Summary of Tradeoffs • Different goals often have conflicting demands • Load Balance • Fine-grain tasks • Random or dynamic assignment • Communication • Usually coarse grain tasks • Decompose to obtain locality: not random/dynamic • Extra Work • Coarse grain tasks • Simple assignment • Communication Cost: • Big transfers: amortize overhead and latency • Small transfers: reduce contention
  • 98. Relationship Between Perspectives Synch w ait Data-remote Data-local Orchestration Busy-overhead Extra w ork Performance issue Parallelization step(s) Processor time component Decomposition/ assignment/ orchestration Decomposition/ assignment Decomposition/ assignment Orchestration/ mapping Load imbalance and synchr onization Inherent communication volume Artifactual communication and data locality Communication structure
  • 99. Summary Speedupprob(p) = • Goal is to reduce denominator components • Both programmer and system have role to play • Architecture cannot do much about load imbalance or too much communication • But it can: • reduce incentive for creating ill-behaved programs (efficient naming, communication and synchronization) • reduce artifactual communication • provide efficient naming for flexible assignment • allow effective overlapping of communication Busy(1) + Data(1) Busyuseful(p)+Datalocal(p)+Synch(p)+Dateremote(p)+Busyoverhead(p)
  • 100. Generic Distributed Memory Organization • Network bandwidth? • Bandwidth demand? • Independent processes? • Communicatingprocesses? • Latency? O(log2P) increase? • Cost scalabilityof system?  Scalable network CA P $ Switch M Switch Switch Multi-stage interconnection network (MIN)? Custom-designed? Node: O(10) Bus-based SMP Custom-designed CPU? Node/System integration level? How far? Cray-on-a-Chip? SMP-on-a-Chip? OS Supported? Network protocols? Communication Assist Extend of functionality? Message transaction DMA? Global virtual Shared address space?