SlideShare a Scribd company logo
By GS kosta
INTRODUCTION TO PARALLEL PROCESSING
Parallel computer structures will be
characterized as pipelined computers,array processors, and multiprocessor systems. Several new computing concepts,including data
flow and VLSI approaches.
1.1 EVOLUTION OF COMPUTER SYSTEMS
physically marked by the rapid changing of building blocks from relays and vacuum tubes (l940-1950s) to discrete diodes and
transistors (1950 1960s), to small- and medium-scale integrated (SSI/MSI) circuits (l960-1970s), and to large- and very-large-scale
integrated (LSI/VLSI) devices (1970s and beyond).Increases in device speed and reliability and reductions in hardware cost and
physical size have greatly enhanced computer performance
1.1.1 Generations of Computer Systems
The first generation (1938-1953)The introduction of the first electronic analog computer in 1938 and the first electronic
digital computer, ENIAC (Electronic Numerical Integrator and Computer), in 1946 marked the beginning of the first
generation of computers. Electromechanical relays were used as switching devices in the 1940s, and vacuumtubes were
used in the 1950s.
The second generation (1952-1963)Transistors were invented in 1948. The first transistorized digital computer, TRAD!C, was
built by Bell Laboratories in 1954.Discrete transistors and diodes were the building blocks: 800 transistors were
used in TRADIC. Printed circuits appeared
The third generation (1962-1975)This generation was marked by the use of small-scale integrated (SSI) and medium-scale
integrated (MSI) circuits as the basic building blocks. Multilayered printed circuits were used.Core memory was
still used in CDC-6600 and other machines but. by 1968, many fast computers, like CDC-7600, began to replace cores with solid-
state memories.
The fourth generation (1972-present) The present generation computers emphasize the use of large-scale integrated (LSI) circuits
for both logic and memory sections. High-density packaging has appeared. High-level languages are being extended to handle both
scalar and vector data. like the extended Fortran in many vector processors,
The future Computers to be used in the 1990s may be the next generation. Very large-scale integrated (VLSI) chips will be used
along with high-density modular design.Multiprocessors like the 16 processors in the S-1 project at Lawrence Livermore National
Laboratory and in the Denelcor's HEP will be required.Cray-2 is expected to have four processors,to be delivered in 1985. More
than 1000mega floating point operations persecond(megaflops) are expected in these future supercomputers.
1.1.2 TrendsTowardsParallel Processing **
According to Sidney Fern Bach:" Today's large computers (mainframes)wouldhere beenconsidered 'supercomputers' 10to 20
years ago.By the same token,today's supercomputers willhe considered'state-of-the-art' standard equipment 10 to 20_yearsFrom
now." from an application point of view. the mainstream usage of computers is experiencing a trend of four ascending
levels of sophistication:
• Data processing
• Information processing • Knowledge processing • Intelligence processing
By GS kosta
We are in an era which is promoting the use ofcomputers not only for conventionaldata-information processing.but also towardthe
building of workable machine knowledge-intelligence systems to advance human civilization. Many computer scientists feel that the
degree of parallelism exploitable at the two highest processing levels should be higher than that at thedata-information processing
levels
From an operating systempoint of view, computer systems have improved chronologically in four phases:
• Batch processing
• Multiprogramming
• Time sharing
• Multiprocessing
In these four operating modes. the degree of parallelism increases sharply from phase to phase.The general trend is to emphasize
parallel processing of information. In what follows. the term information is used with an extended meaning to include
data, information, knowledge, and intelligence. We formally define parallel processing as follows:
Parallel processing is an efficient form of information processing which emphasizes the exploitation of concurrent events in the
computing process.Concurrency implies parallelism, simultaneity, and pipelining. Parallel events may occur in multiple resources
during the same time interval; simultaneous events may occur at the same time instant; and pipelined events may occur in
overlapped time spans.
parallelprocessing can be challenged in four programmatic levels:
• Job or program level
• Task or procedure level
• Interinstruction level
• Intrainstructionlevel The highest job level is often conducted algorithmically. The lowest intrainstruction level is often
implemented directly by hardware means. Hardware roles increase from high to low levels. On the other hand,software
implementations increase from low to high levels. The trade-off between hardware and software approaches to solve a problem is
always a very controversial issue.As hardware cost declines and software cost increases, more and more hardware methods are
replacing the conventionalsoftware approaches.The trend is also supported by the increasing demand for a faster real-time,
resource-sharing, and fault-tolerant computing environment.
parallel processing is concerned,the general architectural trend is being shiftedaway from conventional uniprocessor systems to
multiprocessor systems orto an array of processing elements controlled by one uniprocessor.In all cases,a high degree of pipe
lining is being incorporated into the various systemlevels.
1.2 P ARALLEUSM IN UNIPROCESSOR SYSTEMS
1.2.1 Basic Uniprocessor Architecture
A typical uniprocessor computer consists of three major components: the main memory, the central processing unit
(CPU),and the input-output (l/O) subsystem.
By GS kosta
The architectures of two commercially available uniprocessor computers are given below to show the possible
interconnection of structures among the three subsystems. We will examine majorcomponents in the CPU and in
the 1/0 subsystem.
1.2.2 Parallel Processing Mechanisms
A numberof parallel processingmechanisms havebeendeveloped in uniprocessor computers.We identify them in the following six
categories:
• Multiplicity of functional units
• Parallelism and pipelining within the CPU
• Overlapped CPU and I/0 operations
• Use of a hierarchical memory system
• Balancing of subsystembandwidths
• Multiprogramming and time sharing
Multiplicity of functional units The early computer had only one arithmetic and logic unit in its CPU. Furthermore. the ALU could
only perform one function at a time, a rather slowprocess for executing a long sequence of arithmetic logic instructions.In practice,
many of the functions of the ALU can be distributed to multiple and specialized functional units which can operate in paralle l. The
CDC-6600 (designed in 1964) has I 0 functional units built into its CPU (Figure 1.5). These I 0 units are independent of each other
and may operate simultaneously. A scoreboard is used to keep track of the availability of the functional units and
registers being demanded. With 10 functional units and 24 registers available, the
instructionissue rate canbe significantlyincreased.
Anothergood example of a multifunction uniprocessoris the IBM 360/91 (1968), which has two parallel execution units
Parallelism and pipelining within the CPU Parallel
adders,using such techniques as carry-lookahead and
carry-save, arc now built into almost all ALUs. This is in
contrast to the bit-serial adders used in the first-generation
machines. High-speed multiplier recoding and
convergence division arc techniques for exploring
parallelism and the sharing of hardware resources for the
functions of multiply and divide The use of multiple
functional units is a form of parallelism with the CPU.
Various phases of instruction executions arc now
pipelined, including instruction fetch, decode,operand
fetch, arithmetic logic execution, and store result. To
facilitate overlapped instruction executions through the
pipe, instruction prefetch and data buffering techniques
have been developed.
By GS kosta
Overlapped CPU and 1/0 operations 1/0 operations can be performed simultaneously with the CPU computations by using
separate 1/0 controllers, channels, or I/0 processors.The direct-memory-access (DMA) channel can be used to provide direct
information transfer between the 1/0 devices and the main memory. The DMA is conducted on a cycle-stealingbasis,which is
apparent to the CPU.
Use of hierarchical memory system Usually, the CPU is
about 1000 times faster than memory access.A
hierarchical memory systemcan be used to close up the
speed gap. Computer memory hierarchy is conceptually
illustrated in Figure 1.6.The innermost level is the register
files directly addressable by ALU. Cache memory can be
used to serve as a buffer between the CPU and the main
memory. Block access of the main memory can be
achieved through multi way inter leaving across parallel
memory modules (see Figure 1.4). Virtual memory space
can be established with the use of disks and tape units at
the outerlevels.
Multiprogramming and Time Sharing
Multiprogramming Within the same time interval, there
may be multiple processes active in a computer.
competing for memory. 1/0. and CPG resources.We are
aware of the fact that some computer programs are CPU-
hound (computation intensive),and some are I/O-bound
(input-output intensive)
Time sharing Multiprogramming on a uniprocessoris
centered around the sharing of the CPU by many programs. Sometimes a high-priority program may occupy
the CPU for too long to allow others to share.This problem can be overcome by using a rime-sharingoperating system.
1.3 PARALLEL COMPUTER STRUCTURES
Parallel computers are those systems that emphasize parallel processing.The basic architectural features of parallel computers are
introduced below.' We divide parallel computers into three architectural configurations:
• Pipeline computers
• Array processors
• Multiprocessor systems
A pipeline computer performs overlapped computations to exploit temporal parallelism An array processoruses multiple
synchronized arithmetic logic units to achieve spatial parallelism. A multiprocessor systemachieves asynchronous parallelism
through a setof interactive processors withsharedresources (memories,
database,etc.).
1.4 ARCHITECTURAL CLASSIFICATION SCHEMES
Three computer architectural classification schemes are presented in this section .Flynn'., classification (1966) is based on the
multiplicity of instruction streams and data streams in a computer system. F eng's scheme (1972) is based on serial versus
parallel processing. handler’s classification (1977) is determined by the degree of parallelism and pipelining in various subsystem
levels.
1.4.1 Multiplicity of Instruction-Data Streams
In general, digital computers may be classified into four categories, according to the multiplicity of instruction and data
streams. This scheme for classifying computer organizations was introduced by Michael J. Flynn.
• Computer organizations are characterized by the
multiplicity of the hardware provided to service
the instruction and data streams. Listed below
are Flynn's four machine organizations:
• Single instruction stream-single data stream
(SISD)
• Single instruction stream-multiple data
stream (SIMD)
• Multiple instruction stream-single data
stream (MISD)
• Multiple instruction stream-multiple data
stream (MI MD)
By GS kosta
SISD computer organization This organization, shown in figure 1.16a, represents most serial computers available today.
Instructions are executed sequentially but may be overlapped in their execution stages (pipelining). Most SISD uniprocessor
systems are pipelined. An SISD computer may have more than one functional unit in it. All the functional units are under the
supervision of one control unit.
SIMD computer organization This class corresponds to array processors.Introduced in Section 1.3.2. As illustrated in Figure
1.16b, there are multiple processing elements supervised by the same control unit. All PEs receive the same instruction broad cast
from the control unit but operate on different data sets from distinct data streams. The shared memory subsystemmay contain
multiple modules.
MISD computer organization This organization is conceptually illustrated in Figure l.l6c. There are n processorunits, each
receiving distinct instructions operating over the same data stream and its derivatives. The results (output)of one processorbec ome
the input (operands) of the next processorin the micropipe. This structure has received much less attention and has been cha llenged
as impractical by some computer architects.No real embodiment of this class exists.
MIMD computer organization Most multiprocessor systems and multiple computer systems can be classified in this category
(Figure 1.16d). An intrinsic MIMDcomputer implies
interactions among then processors because all memory
streams are derived from the same data space shared by
all processors.If the n data streams were derived from
disjointed subspaces ofthe shared memories, then we
would have the so-called multiple SISD (MSISD)
operation, which is nothing but a set of n
independent SISD uniprocessorsystems.
1.4.2 Serial Versus Parallel Processing
Tse-yun Feng has suggested the use ofthe degree of
parallelism to
classify various
computer
architectures.
There are four types of processing methods that can be seen from this diagram:
• Word-serial and bit-serial (WSBS)
• Word-parallel and bit-serial (WPBS)
• Word-serial and bit-parallel (WSBP)
By GS kosta
• Word-parallel and bit-parallel (WPBP)
WSBS has been called bit-serial processing because one bit (n = m = 1) is processed at a time. a rather slow process.This was
done only in the first~generationcomputers.WPBS (n = 1, m > 1) has been called bis (bit-slice) processin9 because an m-bit slice is
processed at a time. WSBP (n > 1, m = l ), as found in most existing computers, has been called word-slice processi11g because one
word of 11 bits is processed at a time. Finally, WPBP (n > I, m > l) is known as fully parallel processing(orsimply parallel
processing,if no confusion exists), in which an array of n · m bits is processed at one time, the fastest processing mode of the four.
In Table 1.4, we have listed a number of computer systems undereach processing mode. The sys temparameters n, m are also shown
for each system.The bit-slice processors,like STARAN. ~PP, and DAP. all have long bit slices. llliac-IV and PEPE are two word-
slice array processors.
1.4.3 Parallelism Versus Pipelining
Wolfgang Handler has proposed a classification scheme for identifying the parallelism degree and pipe lining degree built into the
hardware structures of a computer system.He considers parallel-pipeline processing at three subsystemlevels:
• Processorcontrol unit (PCU)
• Arithmetic logic unit (ALU)
• Bit-level circuit (BLC)
The functions of PCU and ALU should be clear to us.Each PCU corresponds to one processoror one CPU. The ALU is equivalent
to the processing element (PE)we specified for SIMD array processors.The BLC corresponds to the combinational logic circuitry
needed to perform !-bit operations in the ALU. A computer systemC can be characterized by a triple containing six independent
entities. as defined below:
T(C) = < K x K', D x D', W x W'> (1.13)
where K = the number of processors (PCUs) within the computer
D = the number of ALUs (or PEs) under the control of one PCU
W = the word length of an ALU or of a PE
W' =the number of pipeline stages in all ALUs or in a PE
D' = the number of ALUs that can be pipelined (pipeline chaining to be
described in Chapter 4)
K' = the number of PCUs that can be pipelined
Several real computer examples are used to clarify the above parametric descriptions.The Texas Instrument's Advanced Scientific
Computer (Tl-ASC)has one controller controlling four arithmetic pipelines, each has 64-bit word lengths and eight stages.Thus,we
have
T(ASC) = <1 x 1, 4 X 1, 64 x 8> = < 1, 4, 64 x 8>
By GS kosta
By GS kosta
Amdahl's law
In computer architecture, Amdahl's law (or Amdahl's
argument[1]
) is a formula which gives the
theoretical speedup in latency of the execution of a
task at fixed workload that can be expected of a
system whose resources are improved. It is named
after computer scientist Gene Amdahl, and was
presented at the AFIPS Spring Joint Computer
Conference in 1967
Amdahl's law is often used in parallel computing to
predict the theoretical speedup when using multiple
processors.
Amdahl's law applies only to the cases where the
problem size is fixed.
By GS kosta
Moore’s Law
The quest for higher-performance digital computers seems unending. In
the past two decades,the performance ofmicroprocessors has enjoyed an
exponentialgrowth.The growth ofmicroprocessorspeed/performanceby
a factor of 2 every 18 months (or about 60% per
year)is known as Moore’slaw.This growth is the result ofa combination
of two factors:
1. Increase in complexity (related both to higher device density and to
larger size) ofVLSI chips,projectedto rise to around 10M transistors per
chip for microprocessors, and 1B for dynamic random-access memories
(DRAMs), by the year 2000 [SIA94]
2. Introductionof,andimprovementsin,architectural features suchas on-
chip cache memories, large instruction buffers, multiple instruction issue
per cycle, multithreading, deep pipelines, out-of-order instruction
execution, and branch prediction
Moore’s lawwas originally formulated in 1965 in terms ofthe doubling
of chip complexity every year(laterrevised to every 18months)based
only on a small numberof data points[Scha97].Moore’srevised
prediction matchesalmost perfectly the actualincreasesin the
number of transistors in DRAM and microprocessor chips.
Moore’s lawseems to hold regardlessofhowone measures
processorperformance:counting the numberofexecuted
instructionspersecond(IPS),counting the numberoffloating-point
operationspersecond (FLOPS),or using sophisticatedbenchmark
suites thatattempt to measure theprocessor'sperformance onreal
applications.This is because allof these measures,though
Figure 1.1. The exponential grow th of microprocessor performance,
know n as Moore’s law , show n overthe past two decades.
By GS kosta
numerically different,tend to rise at roughly the same rate .Figure 1.1 shows that the performanceofactualprocessors has in fact followed
Moore’s lawquite closely since1980 and is on the verge ofreaching the GIPS (giga IPS = 109 IPS) milestone
PRINCIPLES OF SCALABLE PERFORMANCE
1. Performance MetricsandMeasures
1.1. ParallelismProfileinPrograms
1.1.1. Degree of ParallelismThe numberof processorsusedatany instanttoexecute aprogram iscalledthe degree of
parallelism(DOP);thiscanvaryovertime.
DOP assumesaninfinite numberof processorsare available;thisisnotachievableinreal machines,sosome parallel
program segmentsmustbe executedsequentiallyassmallerparallelsegments.Otherresourcesmayimpose limiting
conditions.
A plotof DOP vs.time iscalleda parallelismprofile.
1.1.2. Average Parallelism - 1
Assume the following:
n homogeneous processors
maximum parallelism in a profile is m
Ideally, n >> m
D, the computing capacity of a processor, is something
like MIPS or Mflops w/o regard for memory latency, etc.
i is the number of processors busy in an observation
period (e.g. DOP = i )
W is the total work (instructions or computations)
performed by a program
A is the average parallelism in the program
1.1.3. Average Parallelism – 2 1.1.4. Average Parallelism – 3
1.1.5. Available Parallelism
Various studies have shown that the potential parallelism in scientific and engineering calculations can be very
high (e.g. hundreds or thousands of instructions per clock cycle).
But in real machines, the actual parallelism is much smaller (e.g. 10 or 20).
1.1.6. Basic Blocks
A basic block is a sequence or block of instructions with one entry and one exit.
Basic blocks are frequently used as the focus of optimizers in compilers (since its easier to manage the use of
registers utilized in the block).
Limiting optimization to basic blocks limits the instruction level parallelism that can be obtained (to about 2 to 5 in
typical code).
1.1.7. Asymptotic Speedup – 1 1.1.8. Asymptotic Speedup – 2
By GS kosta
1.2. Mean Performance
We seek to obtain a measure that characterizes the mean, or average, performance of a set of
benchmark programs with potentially many different execution modes (e.g. scalar, vector, sequential, parallel).
We may also wish to associate weights with these programs to emphasize these different modes and yield a more
meaningful performance measure.
1.2.1. Arithmetic Mean
The arithmetic mean is familiar (sum of the terms divided by the number of terms).
Our measures will use execution rates expressed in MIPS or Mflops.
The arithmetic mean of a set of execution rates is proportional to the sum of the inverses of the execution times; it
is not inversely proportional to the sum of the execution times.
Thus arithmetic mean fails to represent real times consumed by the benchmarks when executed.
1.2.2. Harmonic Mean
Instead of using arithmetic or geometric mean, we use the harmonic mean execution rate,
which is just the inverse of the arithmetic mean of the executiontime (thus guaranteeing
the inverse relation not exhibitedby the other means).
1.2.3. Weighted Harmonic Mean
If we associate weights fi with the benchmarks, then we can compute the weighted harmonic
mean:
1.2.4. Weighted Harmonic Mean Speedup
T1 = 1/R1 = 1 is the sequential execution time on a
single processor with rate R1 = 1.
Ti = 1/Ri = 1/i = is the execution time using i
processors with a combined execution rate of Ri = i.
Now suppose a program has n execution
modes with associated weights f1 … f n. The w eighted
harmonic mean speedup is definedas:
1.2.5. Amdahl’s Law
Assume Ri = i, and w (the weights) are (a, 0, …, 0, 1-a).
Basically this means the system is used sequentially (with probability a) or
all n processors are used (with probability 1- a).
This yieldsthe speedup equation known as Amdahl’s law:
The implication is that the best speedup possible is 1/ a, regardless of n, the number of processors.
1.3. Efficiency, Utilizations, and Quality
1.3.1. System Efficiency – 1
By GS kosta
Assume the following definitions:
O (n) = total number of “unit operations” performed by an n processor system in completing a program P.
T (n) = execution time required to execute the program P on an n processor system.
O (n) can be considered similar to the total number of instructions executed by the n processors, perhaps scaled by
a constant factor.
If we define O (1) = T (1), then it is logical to expect that T (n) < O (n) when n > 1 if the program P is able to make
any use at all of the extra processor(s).
1.3.2. System Efficiency – 2
Clearly, the speedup factor (how much faster the program runs with n processors) can now be expressed as
S (n) = T (1) / T (n)
Recall that we expect T (n) < T (1), so S (n) ³ 1.
System efficiency is defined as
E (n) = S (n) / n = T (1) / ( n ´ T (n) )
It indicates the actual degree of speedup achieved in a system as compared with the maximum possible speedup.
Thus 1 / n £ E (n) £ 1. The value is 1/n when only one
processor is used (regardless of n), and the value is 1 when all processors are fully utilized.
1.3.3. Redundancy
The redundancy in a parallel computation is defined as
R (n) = O (n) / O (1)
What values can R (n) obtain?
R (n) = 1 when O (n) = O (1), or when the number of operations performed is independent of the number of
processors, n. This is the ideal case.
R (n) = n when all processors performs the same number of operations as when only a single processor is used; this
implies that n completely redundant computations are performed!
The R (n) figure indicates to what extent the software parallelismis carried over to the hardware implementation
without having extra operations performed.
1.3.4. System Utilization
System utilization is defined as
U (n) = R (n) xE (n) = O (n) / ( nxT (n) )
It indicates the degree to which the system resources were kept busy during execution of the
program. Since 1 £ R (n) £ n, and 1 / n £ E (n) £
1, the best possible value for U (n) is 1, and the
worst is 1 / n.
SPEEDUP PERFORMANCE LAWS
The main objective is to produce the results as early as possible. In other words minimal turnaround time is the
primary goal.
Three performance laws defined below:
1. Amdahl’s Law(1967) is based on fixed workload or fixed problem size
2. Gustafson’s Law(1987) is applied to scalable problems, where the problem size increases with the increase in
machine size.
3. The speed up model by Sun and Ni(1993) is for scaled problems bounded by memory capacity.
Amdahl’s Law for fixed workload
In many practical applications the computational workload is often fixed with a fixed problem size. As the number
of processors increases, the fixed workload is distributed.
Speedup obtained for time-critical applications is called fixed-load speedup.
Fixed-Load Speedup
By GS kosta
The ideal speed up formula given below:
is based on a fixed workload, regardless of machine size.
We consider below two cases of DOP< n and of DOP ≥ n.
Parallel algorithm
In computer science, a parallel algorithm, as opposed to a traditional serial algorithm, is an algorithm which can
be executed a piece at a time on many different processing devices, and then combined together again at the end
to get the correct result.[1]
Many parallel algorithms are executed concurrently – though in general concurrent algorithms are a distinct
concept – and thus these concepts are often conflated, with which aspect of an algorithm is parallel and which is
concurrent not being clearly distinguished. Further, non-parallel, non-concurrent algorithms are often referred to
as "sequential algorithms", by contrast with concurrent algorithms.
Examples of Parallel Algorithms
This section describes and analyzes several parallel algorithms. These algorithms provide examples of how to analyze algorithms in terms of work
and depth and of how to use nested data-parallel constructs. They also introduce some important ideas concerning parallel algorithms. We mention
again that the main goals are to have thecode closely match the high-level intuition of the algorithm, and to make it easy to analyzethe asymptotic
performance from the code.
Parallel Algorithm Complexity
Analysis of an algorithm helps us determine whether the algorithm is useful or not. Generally, an algorithm is analyzed based on its
execution time (Time Complexity) and the amount of space (Space Complexity) it requires.
Since we have sophisticated memory devices available at reasonable cost, storage space is no longer an issue. Hence, spa ce
complexity is not given so much of importance.
Parallel algorithms are designed to improve the computation speed of a computer. For analyzing a Parallel Algorithm, we norma lly
consider the following parameters −
 Time complexity (Execution Time),
By GS kosta
 Totalnumber of processors used, and
 Totalcost.
Time Complexity
The main reason behind developing parallel algorithms was to reduce the computation time of an algorithm. Thus, evaluating th e
execution time of an algorithm is extremely important in analyzing its efficiency.
Execution time is measured on the basis of the time taken by the algorithm to solve a problem. The total execution time is calculated
from the moment when the algorithm starts executing to the moment it stops.If all the processors do not start or end execution at the
same time, then the total execution time of the algorithm is the moment when the first processorstarted its execution to the moment
when the last processor stops its execution.
Time complexity of an algorithm can be classified into three categories−
 Worst-case complexity − When the amount of time required by an algorithm for a given input is maximum.
 Average-case complexity − When the amount of time required by an algorithm for a given input is average.
 Best-case complexity − When the amount of time required by an algorithm for a given input is minimum.
Asymptotic Analysis
The complexity or efficiency of an algorithm is the number of steps executed by the algorithm to get the desired output.Asymptotic
analysis is done to calculate the complexity of an algorithm in its theoretical analysis. In asymptotic analysis,a large length of input
is used to calculate the complexity function of the algorithm.
Note − Asymptotic is a condition where a line tends to meet a curve, but they do not intersect. Here the line and the curve is
asymptotic to each other.
Asymptotic notation is the easiest way to describe the fastest and slowest possible execution time for an algorithm using hig h bounds
and low bounds on speed. For this, we use the following notations −
 Big O notation
 Omega notation
 Thetanotation
Big O notation
In mathematics, Big O notation is used to represent the asymptotic characteristics offunctions.It represents the behavioro fa function
for large inputs in a simple and accurate method. It is a method of representing the upper bound of an algorithm’s execution time. It
represents the longest amount of time that the algorithm could take to complete its execution. The function −
f(n) = O(g(n))
iff there exists positive constants c and n0 such that f(n) ≤ c * g(n) for all nwhere n ≥ n0.
Omega notation
Omega notation is a method of representing the lower bound of an algorithm’s execution time. The function −
f(n) = Ω (g(n))
iff there exists positive constants c and n0 such that f(n) ≥ c * g(n) for all nwhere n ≥ n0.
Theta Notation
Theta notation is a method of representing both the lower bound and the upper bound of an algorithm’s execution time. The function
−
f(n) = θ(g(n))
iff there exists positive constants c1, c2, and n0 such that c1 * g(n) ≤ f(n) ≤ c2 * g(n) for all n where n ≥ n0.
Speedup of anAlgorithm
The performance of a parallel algorithm is determined by calculating its speedup. Speedup is defined as the ratio of the worst-case
execution time of the fastest known sequential algorithm for a particular problem to the worst-case execution time of the parallel
algorithm.
speedup = Worst case execution time of the fastest known sequential for a particular problem / Worst case execution time of theparallel
algorithm
By GS kosta
Number of ProcessorsUsed
The number of processors used is an important factor in analyzing the efficiency of a parallel algorithm. The cost to buy, ma intain,
and run the computers are calculated. Larger the number of processors used by an algorithmto solve a problem, more costly becomes
the obtained result.
Total Cost
Total cost of a parallel algorithm is the product of time complexity and the number of processors used in that particular alg orithm.
Total Cost = Time complexity × Number of processors used
Therefore, the efficiency of a parallel algorithm is –
Efficiency = Worst case execution time of sequential algorithm / Worst case execution time of the parallel algorithm
Models of Parallel Processing
Parallel processors come in manydifferentvarieties.
1. SIMD VERSUS MIMD ARCHITECTURES
Within the SIMD category, two fundamental design choices exist:
1. Synchronous versus asynchronous SIMD. In a SIMD machine, each processor can execute or ignore the instruction being broadcast
based on its local stateor data-dependent conditions. However, this leads to some inefficiency in executing conditional computations.
For example, an “if-then-else” statement is executed by first enabling the processors for which the condition is satisfied and then flipping
the “enable” bit before getting into the “else” part. On the average, half of the processors will be idle for each branch. The situation is
even worsefor “case” statements involving multiway branches. A possiblecure is to use the asynchronous version of SIMD, known as
SPMD (spim-dee or single-program, multiple data), where each processor runs its own copy of the common program. Theadvantage of
SPMD is that in an “if-then-else” computation, each processor will only spend time on the relevant branch. The disadvantages include
the need for occasional synchronization and the higher complexity of each processor, which must now have a program memory and
instruction fetch/decode logic.
2. Custom- versus commodity-chip SIMD. A SIMD machine can be designed based on commodity (off-the-shelf) components or with
custom chips. In the first approach, components tend to be inexpensive because of mass production. However, such general-purpose
components will likely contain elements that may not be needed for a particular design. These extra components may complicate the
design, manufacture, and testing of theSIMD machine and may introduce speed penalties as well. Customcomponents (including ASICs
= application-specificICs, multichip modules, or WSI = wafer-scale integrated circuits) generally offer better performance but lead to
much higher cost in view of their development costs being borne by a relatively small number of parallel machine users (as op posed to
commodity microprocessors that are produced in millions). As integrating multiple processors along with ample memory on a single
VLSI chip becomes feasible, a typeof convergence between the two approaches appears imminent.
Within the MIMD class, three fundamental issuesor design choices are subjects of ongoing debates in the research community:
1. MPP—massively ormoderatelyparallel processor. Is it more cost-effective to build a parallelprocessorout ofa relatively small
numberof powerfulprocessorsora massive numberofvery simple processors (the“herd ofelephants” orthe “army ofants”
approach)? Referring to Amdahl’s law,the first choice does betteron the inherently sequentialpart ofa computation while the
secondapproach might allowa higherspeed-upforthe parallelizable part.A generalanswercannot be given to this question,as
the best choiceis both application-and technology-dependent.
2. Tightly versusloosely coupled MIMD. Which is a betterapproach to high-performance computing,thatofusing specially
designedmultiprocessors/multicomputerora collection ofordinary workstations thatare interconnected by commoditynetworks
(such as EthernetorATM)and whose interactions are coordinated byspecialsystemsoftware anddistributed file systems? The
latterchoice,sometimes referred to as network ofworkstations (NOW)orclustercomputing,has beengainingpopularity in
recent years.However,many open problems exist fortaking full advantage ofsuch network-based loosely coupled architectures.
The hardware,systemsoftware,and applications aspects ofNOWs are being investigated by numerous researchgroups.
3. .Explicit message passing versusvirtualshared memory. Which scheme is better,that offorcing the users to explicitly specify all
messagesthatmust be sent betweenprocessors orto allowthemto programin an abstracthigher-levelmodel,with the required
messagesautomatically generated bythe systemsoftware? This question is essentially very similar to the one asked in the early
days ofhigh-levellanguages and virtualmemory.At some point in the past,programming in as sembly languages anddoing
explicit transfers betweensecondary andprimary memories could lead to higherefficiency.However,nowadays,software is so
complex and compilers and operating systems so advanced (notto mention processingpowerso cheap)that it no longermakes
sense to hand-optimize the programs,except in limited time-critical instances.However,we are not yet at that point in parallel
processing,and hiding the explicit communication structure ofa parallel machine fromthe programmer has nontrivial
consequencesforperformance
By GS kosta
THE PRAM SHARED-MEMORY MODEL
The theoretical model used for conventional or sequential computers (SISD class) is
known as therandom-access machine (RAM) (not to be confused with random-access
memory, which has the same acronym). Theparallel version of RAM [PRAM (pea-ram)],
constitutes an abstract model of the class of global-memory parallel processors. The
abstraction consists of ignoring the details of the processor-to-memory interconnection
network and taking theview that each processor can access any memory location in each
machine cycle, independent of what other processors are doing
DISTRIBUTED-MEMORY OR GRAPH MODELS
This network is usually represented as a graph, with vertices corresponding to processor–memory nodes and edges corresponding to
communication links. If communication links are unidirectional, then directed edges are used. Undirected edges imply bidirectional
communication, although not necessarily in both directions at once. Important parameters of an interconnection network include
1. Network diameter: thelongest of the shortest paths
between various pairs of nodes, which should be
relatively small if network latency is to be minimized.
The network diameter is more important with store-and-
forward routing (when a message is stored in its entirety
and retransmitted by intermediate nodes) than with
wormhole routing (when a message is quickly relayed
through a node in small pieces).
2. Bisection (band)width: thesmallest number (total
capacity) of links that need to be cut in order to divide
the network into two subnetworks of half the size. This
is important when nodes communicate with each other
in a random fashion. A small bisection (band)width
limits the rate of data transfer between thetwo halves of
the network, thus affecting theperformance of
communication-intensive algorithms.
3. Vertex or node degree: the number of communication
ports required of each node, which should be a constant
independent of network size if the architecture is to be
readily scalable to larger sizes. Thenode degree has a
direct effect on thecost of each node, with the effect
being more significant for parallel ports containing
several wires or when the node is required to
communicate over all of its ports at once.
CIRCUIT MODEL AND PHYSICAL REALIZATIONS
In a sense, the only sure way to predict the performance of a parallel architecture on a
given set of problems is to actually build themachine and run the programs on it. Because
this is often impossible or very costly, the next best thing is to model themachine at the
circuit level, so that all computationaland signal propagation delays can be taken into
account. Unfortunately, this is also impossible for a complex supercomputer, both because
generating and debugging detailed circuit specifications are not much easier than a
fullblown
implementation and because a circuit simulator would take eons to run the simulation.
Despitetheabove observations, we can produce and evaluate circuit-level designs for
specific applications.
GLOBAL VERSUS DISTRIBUTED MEMORY
Within the MIMDclass ofparallelprocessors,memory can be globalordistributed.
Global memory may be visualized as being in a central location where all processors can
access it with equal ease (or with equal difficulty, if you are a half-empty-glass typeof
person). Figure 4.3 shows a possiblehardware organization for a global-memory parallel
processor. Processors can access memory through a special processor-to-memory network.
A global-memory multiprocessor is characterized by the typeand number p of processors,
the capacity and number m of memory modules, and the network architecture. Even though
p and m are independent parameters, achieving high performance typically requires that they
be comparable in magnitude (e.g., too few memory modules will cause contention among
the processors and too many would complicate the network design).
By GS kosta
Distributed-memory architectures can be conceptually viewed as in Fig. 4.5. A collection of p processors, each with its own privatememory,
communicates through an interconnection network. Here, the latency of the
interconnection network may be less critical, as each processor is likely to
access its own local memory most of the time. However, the communication
bandwidth of thenetwork may or may not be critical, depending on the typeof
parallel applications and the extent of task interdependencies. Note that each
processor is usually connected to thenetwork through multiple links or channels
(this is the norm here ,although it can also be the case for shared-memory
parallel processors).
Cache coherence
In computer architecture, cache coherence is the uniformity
of shared resource data that ends up stored in multiple local
caches. When clients in a system maintain caches of a
common memory resource, problems may arise with
incoherent data, which is particularly the case with CPUs in
a multiprocessingsystem.
In the illustration on the right, consider both the clients have
a cached copy of a particular memory block from a previous
read. Suppose the client on the bottom updates/changes
that memory block, the client on the top could be left with an
invalid cache of memory without any notification of the
change. Cache coherence is intended to manage such
conflicts by maintaining a coherent view of the data values in
multiple caches.
The following are the requirements for cache coherence:[2]
Write Propagation
Changes to the data in any cache must be propagated to other copies(of that cache line) in the peer
caches.
Transaction Serialization
Reads/Writes to a single memory location must be seen by all processors in the same order.
Coherence protocols
Coherence Protocols apply cache coherence in multiprocessor systems. The intention is that two clients must
never see different values of the same shared data.
The protocol must implement the basic requirements for coherence. It can be tailor made for the target
system/application.
Protocols can also be classified as Snooping(Snoopy/Broadcast) or Directory based. Typically, early systems
used directory based protocols where a directory would keep a track of the data being shared and the sharers. In
Snoopy protocols , the transaction request. (read/write/upgrade) are sent out to all processors. All processors
snoop the request and respond appropriately.
Write Propagation in Snoopy protocols can be implemented by either of the following:
Write Invalidate
When a write operation is observed to a location that a cache has a copy of, the cache controller
invalidates its own copy of the snooped memory location, and thus forcing reads from main memory of the
new value on their next access.[4]
Write Update
When a write operation is observed to a location that a cache has a copy of, the cache controller updates
its own copy of the snooped memory location with the new data.
By GS kosta
If the protocol design states that whenever any copy of the shared data is changed, all the other copies
must be "updated" to reflect the change, then it is a write update protocol. If the design states that on a
write to a cached copy by any processor requires other processors to discard/invalidate their cached
copies, then it is a write invalidate protocol.
However, scalability is one shortcoming of broadcast protocols.
Various models and protocols have been devised for maintaining coherence.
Parallel Algorithm - Models
The model of a parallel algorithm is developed by considering a strategy for dividing the
data and processing method and applying a suitable strategy to reduce interactions. In
this chapter, we will discuss the following Parallel Algorithm Models −
 Data parallel model
 Task graph model
 Work pool model
 Master slave model
 Producer consumer or pipeline model
 Hybrid model
Data Parallel
In data parallel model, tasks are assigned to processes and each task performs similar
types of operations on different data. Data parallelism is a consequence of single
operations that is being applied on multiple data items.
Data-parallel model can be applied on shared-address spaces and message-passing
paradigms. In data-parallel model, interaction overheads can be reduced by selecting a
locality preserving decomposition, by using optimized collective interaction routines, or
by overlapping computation and interaction.
The primary characteristic of data-parallel model problems is that the intensity of data
parallelism increases with the size of the problem, which in turn makes it possible to
use more processes to solve larger problems.
Example − Dense matrix multiplication.
By GS kosta
Task Graph Model
In the task graph model, parallelism is expressed by a task graph. A task graph can
be either trivial or nontrivial. In this model, the correlation among the tasks are utilized
to promote locality or to minimize interaction costs. This model is enforced to solve
problems in which the quantity of data associated with the tasks is huge compared to
the number of computation associated with them. The tasks are assigned to help
improve the cost of data movement among the tasks.
Examples − Parallel quick sort, sparse matrix factorization, and parallel algorithms
derived via divide-and-conquer approach.
Here, problems are divided into atomic tasks and implemented as a graph. Each task is
an independent unit of job that has dependencies on one or more antecedent task. After
By GS kosta
the completion of a task, the output of an antecedent task is passed to the dependent
task. A task with antecedent task starts execution only when its entire antecedent task
is completed. The final output of the graph is received when the last dependent task is
completed (Task 6 in the above figure).
Work Pool Model
In work pool model, tasks are dynamically assigned to the processes for balancing the
load. Therefore, any process may potentially execute any task. This model is used when
the quantity of data associated with tasks is comparatively smallerthan the computation
associated with the tasks.
There is no desired pre-assigning of tasks onto the processes. Assigning of tasks is
centralized or decentralized. Pointers to the tasks are saved in a physically shared list,
in a priority queue, or in a hash table or tree, or they could be saved in a physically
distributed data structure.
The task may be available in the beginning, or may be generated dynamically. If the
task is generated dynamically and a decentralized assigning of task is done, then a
termination detection algorithm is required so that all the processes can actually detect
the completion of the entire program and stop looking for more tasks.
Example − Parallel tree search
Master-Slave Model
In the master-slave model, one or more master processes generate task and allocate it
to slave processes. The tasks may be allocated beforehand if −
By GS kosta
 the master can estimate the volume of the tasks, or
 a random assigning can do a satisfactory job of balancing load, or
 slaves are assigned smaller pieces of task at different times.
This model is generally equally suitable to shared-address-space or message-
passing paradigms, since the interaction is naturally two ways.
In some cases, a task may need to be completed in phases, and the task in each phase
must be completed before the task in the next phases can be generated. The master-
slave model can be generalized to hierarchical or multi-level master-slave
model in which the top level master feeds the large portion of tasks to the second-level
master, who further subdivides the tasks among its own slaves and may perform a part
of the task itself.
Precautions in using the master-slave model
Care should be taken to assure that the master does not become a congestion point. It
may happen if the tasks are too small or the workers are comparatively fast.
The tasks should be selected in a way that the cost of performing a task dominates the
cost of communication and the cost of synchronization.
Asynchronous interaction may help overlap interaction and the computation associated
with work generation by the master.
Pipeline Model
It is also known as the producer-consumer model. Here a set of data is passed on
through a series of processes, each of which performs some task on it. Here, the arrival
of new data generates the execution of a new task by a process in the queue. The
processes could form a queue in the shape of linear or multidimensional arrays, trees,
or general graphs with or without cycles.
This model is a chain of producers and consumers. Each process in the queue can be
considered as a consumer of a sequence of data items for the process preceding it in
the queue and as a producer of data for the process following it in the queue. The queue
does not need to be a linear chain; it can be a directed graph. The most common
By GS kosta
interaction minimization technique applicable to this model is overlapping interaction
with computation.
Example − Parallel LU factorization algorithm.
Hybrid Models
A hybrid algorithm model is required when more than one model may be needed to
solve a problem.
A hybrid model may be composed of either multiple models applied hierarchically or
multiple models applied sequentially to different phases of a parallel algorithm.
Example − Parallel quick sort
Shared memory/Parallel Processing in Memory
In computer science, shared memory is memory that may be
simultaneously accessed by multiple programs with an intent
to provide communication among them or avoid redundant
copies. Shared memory is an efficient means of passing data
between programs. Depending on context, programs may
run on a single processor or on multiple separate
processors.
Using memory for communication inside a single
program, e.g. among its multiple threads, is also referred
to as shared memory.
In hardware[edit]
In computer hardware, shared memory refers to a (typically large) block of random access memory (RAM) that
can be accessed by several different central processing units (CPUs) in a multiprocessor computer system.
Shared memory systems may use:[1]
 uniform memory access (UMA): all the processors share the physical memory uniformly;
 non-uniform memory access (NUMA): memory access time depends on the memory location relative to a
processor;
 cache-only memory architecture (COMA): the local memories for the processors at each node is used as
cache instead of as actual main memory.
In software[edit]
In computer software, shared memory is either
By GS kosta
 a method of inter-process communication (IPC), i.e. a way of exchanging data between programs running at
the same time. One process will create an area in RAMwhich other processes can access;
 a method of conserving memory space by directing accesses to what would ordinarily be copies of a piece of
data to a single instance instead, by using virtual memorymappings or with explicit support of the program in
question. This is most often used for shared libraries and for XIP.

More Related Content

What's hot

EE5440 – Computer Architecture - Lecture 1
EE5440 – Computer Architecture - Lecture 1EE5440 – Computer Architecture - Lecture 1
EE5440 – Computer Architecture - Lecture 1
Dilawar Khan
 
Memory technology and optimization in Advance Computer Architechture
Memory technology and optimization in Advance Computer ArchitechtureMemory technology and optimization in Advance Computer Architechture
Memory technology and optimization in Advance Computer ArchitechtureShweta Ghate
 
Computer architecture
Computer architectureComputer architecture
Computer architecture
Zuhaib Zaroon
 
Cache memory
Cache memoryCache memory
Cache memory
Faiq Ali Sayed
 
Multi Processors And Multi Computers
 Multi Processors And Multi Computers Multi Processors And Multi Computers
Multi Processors And Multi Computers
Nemwos
 
Pentium processor
Pentium processorPentium processor
Pentium processor
Pranjali Deshmukh
 
Kernel (OS)
Kernel (OS)Kernel (OS)
Demand paging
Demand pagingDemand paging
Demand paging
Trinity Dwarka
 
Memory management
Memory managementMemory management
Memory management
cpjcollege
 
Memory Organization
Memory OrganizationMemory Organization
Memory Organization
Kamal Acharya
 
isa architecture
isa architectureisa architecture
isa architecture
AJAL A J
 
Parallel processing
Parallel processingParallel processing
Parallel processing
Praveen Kumar
 
data structures and its importance
 data structures and its importance  data structures and its importance
data structures and its importance
Anaya Zafar
 
80386 Architecture
80386 Architecture80386 Architecture
80386 Architecture
Rohit Choudhury
 
Instruction pipeline: Computer Architecture
Instruction pipeline: Computer ArchitectureInstruction pipeline: Computer Architecture
Instruction pipeline: Computer Architecture
InteX Research Lab
 
Von Neumann Architecture
Von Neumann ArchitectureVon Neumann Architecture
Von Neumann Architecture
Chamodi Adikaram
 
Parallel computing persentation
Parallel computing persentationParallel computing persentation
Parallel computing persentation
VIKAS SINGH BHADOURIA
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and AlgorithmDhaval Kaneria
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithms
Danish Javed
 
Interfacing With High Level Programming Language
Interfacing With High Level Programming Language Interfacing With High Level Programming Language
Interfacing With High Level Programming Language
.AIR UNIVERSITY ISLAMABAD
 

What's hot (20)

EE5440 – Computer Architecture - Lecture 1
EE5440 – Computer Architecture - Lecture 1EE5440 – Computer Architecture - Lecture 1
EE5440 – Computer Architecture - Lecture 1
 
Memory technology and optimization in Advance Computer Architechture
Memory technology and optimization in Advance Computer ArchitechtureMemory technology and optimization in Advance Computer Architechture
Memory technology and optimization in Advance Computer Architechture
 
Computer architecture
Computer architectureComputer architecture
Computer architecture
 
Cache memory
Cache memoryCache memory
Cache memory
 
Multi Processors And Multi Computers
 Multi Processors And Multi Computers Multi Processors And Multi Computers
Multi Processors And Multi Computers
 
Pentium processor
Pentium processorPentium processor
Pentium processor
 
Kernel (OS)
Kernel (OS)Kernel (OS)
Kernel (OS)
 
Demand paging
Demand pagingDemand paging
Demand paging
 
Memory management
Memory managementMemory management
Memory management
 
Memory Organization
Memory OrganizationMemory Organization
Memory Organization
 
isa architecture
isa architectureisa architecture
isa architecture
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 
data structures and its importance
 data structures and its importance  data structures and its importance
data structures and its importance
 
80386 Architecture
80386 Architecture80386 Architecture
80386 Architecture
 
Instruction pipeline: Computer Architecture
Instruction pipeline: Computer ArchitectureInstruction pipeline: Computer Architecture
Instruction pipeline: Computer Architecture
 
Von Neumann Architecture
Von Neumann ArchitectureVon Neumann Architecture
Von Neumann Architecture
 
Parallel computing persentation
Parallel computing persentationParallel computing persentation
Parallel computing persentation
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and Algorithm
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithms
 
Interfacing With High Level Programming Language
Interfacing With High Level Programming Language Interfacing With High Level Programming Language
Interfacing With High Level Programming Language
 

Similar to INTRODUCTION TO PARALLEL PROCESSING

Super-Computer Architecture
Super-Computer Architecture Super-Computer Architecture
Super-Computer Architecture
Vivek Garg
 
Advanced computer architecture
Advanced computer architectureAdvanced computer architecture
Advanced computer architecture
AjithaSomasundaram
 
Vector processor : Notes
Vector processor : NotesVector processor : Notes
Vector processor : Notes
Subhajit Sahu
 
CC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfCC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdf
HasanAfwaaz1
 
Computer architecture lesson 1
Computer architecture lesson 1Computer architecture lesson 1
Computer architecture lesson 1
AbdulwadoodKhan9
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computing
Mehul Patel
 
Parallel & Distributed processing
Parallel & Distributed processingParallel & Distributed processing
Parallel & Distributed processing
Syed Zaid Irshad
 
Unit 1 introduction to os
Unit 1 introduction to osUnit 1 introduction to os
Unit 1 introduction to os
GaneshThapa27
 
Generation of computer
Generation of computerGeneration of computer
Generation of computer
rameshkumar1646
 
Ge6151 computer programming notes
Ge6151 computer programming notesGe6151 computer programming notes
Ge6151 computer programming notes
shanmura
 
Introduction to Computers
Introduction to ComputersIntroduction to Computers
Introduction to Computers
Prabu U
 
Ch1Intro.pdf Computer organization and org.
Ch1Intro.pdf Computer organization and org.Ch1Intro.pdf Computer organization and org.
Ch1Intro.pdf Computer organization and org.
gadisaAdamu
 
Unit i
Unit  iUnit  i
computer application in hospitality Industry, periyar university unit 1
computer application in hospitality Industry, periyar university  unit 1computer application in hospitality Industry, periyar university  unit 1
computer application in hospitality Industry, periyar university unit 1
admin information
 
Unit i
Unit  iUnit  i
computer applicationin hospitality Industry1 periyar university unit1
computer applicationin hospitality Industry1 periyar university  unit1computer applicationin hospitality Industry1 periyar university  unit1
computer applicationin hospitality Industry1 periyar university unit1
admin information
 
Unit I
Unit  IUnit  I
MIS CHAPTER THREE.ppt
MIS CHAPTER THREE.pptMIS CHAPTER THREE.ppt
MIS CHAPTER THREE.ppt
AYNETUTEREFE1
 
Cpu architecture
Cpu architecture Cpu architecture
Cpu architecture
VishalSingh996259
 
Chapter - 1
Chapter - 1Chapter - 1
Chapter - 1
Munazza-Mah-Jabeen
 

Similar to INTRODUCTION TO PARALLEL PROCESSING (20)

Super-Computer Architecture
Super-Computer Architecture Super-Computer Architecture
Super-Computer Architecture
 
Advanced computer architecture
Advanced computer architectureAdvanced computer architecture
Advanced computer architecture
 
Vector processor : Notes
Vector processor : NotesVector processor : Notes
Vector processor : Notes
 
CC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfCC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdf
 
Computer architecture lesson 1
Computer architecture lesson 1Computer architecture lesson 1
Computer architecture lesson 1
 
Introduction to parallel_computing
Introduction to parallel_computingIntroduction to parallel_computing
Introduction to parallel_computing
 
Parallel & Distributed processing
Parallel & Distributed processingParallel & Distributed processing
Parallel & Distributed processing
 
Unit 1 introduction to os
Unit 1 introduction to osUnit 1 introduction to os
Unit 1 introduction to os
 
Generation of computer
Generation of computerGeneration of computer
Generation of computer
 
Ge6151 computer programming notes
Ge6151 computer programming notesGe6151 computer programming notes
Ge6151 computer programming notes
 
Introduction to Computers
Introduction to ComputersIntroduction to Computers
Introduction to Computers
 
Ch1Intro.pdf Computer organization and org.
Ch1Intro.pdf Computer organization and org.Ch1Intro.pdf Computer organization and org.
Ch1Intro.pdf Computer organization and org.
 
Unit i
Unit  iUnit  i
Unit i
 
computer application in hospitality Industry, periyar university unit 1
computer application in hospitality Industry, periyar university  unit 1computer application in hospitality Industry, periyar university  unit 1
computer application in hospitality Industry, periyar university unit 1
 
Unit i
Unit  iUnit  i
Unit i
 
computer applicationin hospitality Industry1 periyar university unit1
computer applicationin hospitality Industry1 periyar university  unit1computer applicationin hospitality Industry1 periyar university  unit1
computer applicationin hospitality Industry1 periyar university unit1
 
Unit I
Unit  IUnit  I
Unit I
 
MIS CHAPTER THREE.ppt
MIS CHAPTER THREE.pptMIS CHAPTER THREE.ppt
MIS CHAPTER THREE.ppt
 
Cpu architecture
Cpu architecture Cpu architecture
Cpu architecture
 
Chapter - 1
Chapter - 1Chapter - 1
Chapter - 1
 

Recently uploaded

The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
SupreethSP4
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 

Recently uploaded (20)

The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 

INTRODUCTION TO PARALLEL PROCESSING

  • 1. By GS kosta INTRODUCTION TO PARALLEL PROCESSING Parallel computer structures will be characterized as pipelined computers,array processors, and multiprocessor systems. Several new computing concepts,including data flow and VLSI approaches. 1.1 EVOLUTION OF COMPUTER SYSTEMS physically marked by the rapid changing of building blocks from relays and vacuum tubes (l940-1950s) to discrete diodes and transistors (1950 1960s), to small- and medium-scale integrated (SSI/MSI) circuits (l960-1970s), and to large- and very-large-scale integrated (LSI/VLSI) devices (1970s and beyond).Increases in device speed and reliability and reductions in hardware cost and physical size have greatly enhanced computer performance 1.1.1 Generations of Computer Systems The first generation (1938-1953)The introduction of the first electronic analog computer in 1938 and the first electronic digital computer, ENIAC (Electronic Numerical Integrator and Computer), in 1946 marked the beginning of the first generation of computers. Electromechanical relays were used as switching devices in the 1940s, and vacuumtubes were used in the 1950s. The second generation (1952-1963)Transistors were invented in 1948. The first transistorized digital computer, TRAD!C, was built by Bell Laboratories in 1954.Discrete transistors and diodes were the building blocks: 800 transistors were used in TRADIC. Printed circuits appeared The third generation (1962-1975)This generation was marked by the use of small-scale integrated (SSI) and medium-scale integrated (MSI) circuits as the basic building blocks. Multilayered printed circuits were used.Core memory was still used in CDC-6600 and other machines but. by 1968, many fast computers, like CDC-7600, began to replace cores with solid- state memories. The fourth generation (1972-present) The present generation computers emphasize the use of large-scale integrated (LSI) circuits for both logic and memory sections. High-density packaging has appeared. High-level languages are being extended to handle both scalar and vector data. like the extended Fortran in many vector processors, The future Computers to be used in the 1990s may be the next generation. Very large-scale integrated (VLSI) chips will be used along with high-density modular design.Multiprocessors like the 16 processors in the S-1 project at Lawrence Livermore National Laboratory and in the Denelcor's HEP will be required.Cray-2 is expected to have four processors,to be delivered in 1985. More than 1000mega floating point operations persecond(megaflops) are expected in these future supercomputers. 1.1.2 TrendsTowardsParallel Processing ** According to Sidney Fern Bach:" Today's large computers (mainframes)wouldhere beenconsidered 'supercomputers' 10to 20 years ago.By the same token,today's supercomputers willhe considered'state-of-the-art' standard equipment 10 to 20_yearsFrom now." from an application point of view. the mainstream usage of computers is experiencing a trend of four ascending levels of sophistication: • Data processing • Information processing • Knowledge processing • Intelligence processing
  • 2. By GS kosta We are in an era which is promoting the use ofcomputers not only for conventionaldata-information processing.but also towardthe building of workable machine knowledge-intelligence systems to advance human civilization. Many computer scientists feel that the degree of parallelism exploitable at the two highest processing levels should be higher than that at thedata-information processing levels From an operating systempoint of view, computer systems have improved chronologically in four phases: • Batch processing • Multiprogramming • Time sharing • Multiprocessing In these four operating modes. the degree of parallelism increases sharply from phase to phase.The general trend is to emphasize parallel processing of information. In what follows. the term information is used with an extended meaning to include data, information, knowledge, and intelligence. We formally define parallel processing as follows: Parallel processing is an efficient form of information processing which emphasizes the exploitation of concurrent events in the computing process.Concurrency implies parallelism, simultaneity, and pipelining. Parallel events may occur in multiple resources during the same time interval; simultaneous events may occur at the same time instant; and pipelined events may occur in overlapped time spans. parallelprocessing can be challenged in four programmatic levels: • Job or program level • Task or procedure level • Interinstruction level • Intrainstructionlevel The highest job level is often conducted algorithmically. The lowest intrainstruction level is often implemented directly by hardware means. Hardware roles increase from high to low levels. On the other hand,software implementations increase from low to high levels. The trade-off between hardware and software approaches to solve a problem is always a very controversial issue.As hardware cost declines and software cost increases, more and more hardware methods are replacing the conventionalsoftware approaches.The trend is also supported by the increasing demand for a faster real-time, resource-sharing, and fault-tolerant computing environment. parallel processing is concerned,the general architectural trend is being shiftedaway from conventional uniprocessor systems to multiprocessor systems orto an array of processing elements controlled by one uniprocessor.In all cases,a high degree of pipe lining is being incorporated into the various systemlevels. 1.2 P ARALLEUSM IN UNIPROCESSOR SYSTEMS 1.2.1 Basic Uniprocessor Architecture A typical uniprocessor computer consists of three major components: the main memory, the central processing unit (CPU),and the input-output (l/O) subsystem.
  • 3. By GS kosta The architectures of two commercially available uniprocessor computers are given below to show the possible interconnection of structures among the three subsystems. We will examine majorcomponents in the CPU and in the 1/0 subsystem. 1.2.2 Parallel Processing Mechanisms A numberof parallel processingmechanisms havebeendeveloped in uniprocessor computers.We identify them in the following six categories: • Multiplicity of functional units • Parallelism and pipelining within the CPU • Overlapped CPU and I/0 operations • Use of a hierarchical memory system • Balancing of subsystembandwidths • Multiprogramming and time sharing Multiplicity of functional units The early computer had only one arithmetic and logic unit in its CPU. Furthermore. the ALU could only perform one function at a time, a rather slowprocess for executing a long sequence of arithmetic logic instructions.In practice, many of the functions of the ALU can be distributed to multiple and specialized functional units which can operate in paralle l. The CDC-6600 (designed in 1964) has I 0 functional units built into its CPU (Figure 1.5). These I 0 units are independent of each other and may operate simultaneously. A scoreboard is used to keep track of the availability of the functional units and registers being demanded. With 10 functional units and 24 registers available, the instructionissue rate canbe significantlyincreased. Anothergood example of a multifunction uniprocessoris the IBM 360/91 (1968), which has two parallel execution units Parallelism and pipelining within the CPU Parallel adders,using such techniques as carry-lookahead and carry-save, arc now built into almost all ALUs. This is in contrast to the bit-serial adders used in the first-generation machines. High-speed multiplier recoding and convergence division arc techniques for exploring parallelism and the sharing of hardware resources for the functions of multiply and divide The use of multiple functional units is a form of parallelism with the CPU. Various phases of instruction executions arc now pipelined, including instruction fetch, decode,operand fetch, arithmetic logic execution, and store result. To facilitate overlapped instruction executions through the pipe, instruction prefetch and data buffering techniques have been developed.
  • 4. By GS kosta Overlapped CPU and 1/0 operations 1/0 operations can be performed simultaneously with the CPU computations by using separate 1/0 controllers, channels, or I/0 processors.The direct-memory-access (DMA) channel can be used to provide direct information transfer between the 1/0 devices and the main memory. The DMA is conducted on a cycle-stealingbasis,which is apparent to the CPU. Use of hierarchical memory system Usually, the CPU is about 1000 times faster than memory access.A hierarchical memory systemcan be used to close up the speed gap. Computer memory hierarchy is conceptually illustrated in Figure 1.6.The innermost level is the register files directly addressable by ALU. Cache memory can be used to serve as a buffer between the CPU and the main memory. Block access of the main memory can be achieved through multi way inter leaving across parallel memory modules (see Figure 1.4). Virtual memory space can be established with the use of disks and tape units at the outerlevels. Multiprogramming and Time Sharing Multiprogramming Within the same time interval, there may be multiple processes active in a computer. competing for memory. 1/0. and CPG resources.We are aware of the fact that some computer programs are CPU- hound (computation intensive),and some are I/O-bound (input-output intensive) Time sharing Multiprogramming on a uniprocessoris centered around the sharing of the CPU by many programs. Sometimes a high-priority program may occupy the CPU for too long to allow others to share.This problem can be overcome by using a rime-sharingoperating system. 1.3 PARALLEL COMPUTER STRUCTURES Parallel computers are those systems that emphasize parallel processing.The basic architectural features of parallel computers are introduced below.' We divide parallel computers into three architectural configurations: • Pipeline computers • Array processors • Multiprocessor systems A pipeline computer performs overlapped computations to exploit temporal parallelism An array processoruses multiple synchronized arithmetic logic units to achieve spatial parallelism. A multiprocessor systemachieves asynchronous parallelism through a setof interactive processors withsharedresources (memories, database,etc.). 1.4 ARCHITECTURAL CLASSIFICATION SCHEMES Three computer architectural classification schemes are presented in this section .Flynn'., classification (1966) is based on the multiplicity of instruction streams and data streams in a computer system. F eng's scheme (1972) is based on serial versus parallel processing. handler’s classification (1977) is determined by the degree of parallelism and pipelining in various subsystem levels. 1.4.1 Multiplicity of Instruction-Data Streams In general, digital computers may be classified into four categories, according to the multiplicity of instruction and data streams. This scheme for classifying computer organizations was introduced by Michael J. Flynn. • Computer organizations are characterized by the multiplicity of the hardware provided to service the instruction and data streams. Listed below are Flynn's four machine organizations: • Single instruction stream-single data stream (SISD) • Single instruction stream-multiple data stream (SIMD) • Multiple instruction stream-single data stream (MISD) • Multiple instruction stream-multiple data stream (MI MD)
  • 5. By GS kosta SISD computer organization This organization, shown in figure 1.16a, represents most serial computers available today. Instructions are executed sequentially but may be overlapped in their execution stages (pipelining). Most SISD uniprocessor systems are pipelined. An SISD computer may have more than one functional unit in it. All the functional units are under the supervision of one control unit. SIMD computer organization This class corresponds to array processors.Introduced in Section 1.3.2. As illustrated in Figure 1.16b, there are multiple processing elements supervised by the same control unit. All PEs receive the same instruction broad cast from the control unit but operate on different data sets from distinct data streams. The shared memory subsystemmay contain multiple modules. MISD computer organization This organization is conceptually illustrated in Figure l.l6c. There are n processorunits, each receiving distinct instructions operating over the same data stream and its derivatives. The results (output)of one processorbec ome the input (operands) of the next processorin the micropipe. This structure has received much less attention and has been cha llenged as impractical by some computer architects.No real embodiment of this class exists. MIMD computer organization Most multiprocessor systems and multiple computer systems can be classified in this category (Figure 1.16d). An intrinsic MIMDcomputer implies interactions among then processors because all memory streams are derived from the same data space shared by all processors.If the n data streams were derived from disjointed subspaces ofthe shared memories, then we would have the so-called multiple SISD (MSISD) operation, which is nothing but a set of n independent SISD uniprocessorsystems. 1.4.2 Serial Versus Parallel Processing Tse-yun Feng has suggested the use ofthe degree of parallelism to classify various computer architectures. There are four types of processing methods that can be seen from this diagram: • Word-serial and bit-serial (WSBS) • Word-parallel and bit-serial (WPBS) • Word-serial and bit-parallel (WSBP)
  • 6. By GS kosta • Word-parallel and bit-parallel (WPBP) WSBS has been called bit-serial processing because one bit (n = m = 1) is processed at a time. a rather slow process.This was done only in the first~generationcomputers.WPBS (n = 1, m > 1) has been called bis (bit-slice) processin9 because an m-bit slice is processed at a time. WSBP (n > 1, m = l ), as found in most existing computers, has been called word-slice processi11g because one word of 11 bits is processed at a time. Finally, WPBP (n > I, m > l) is known as fully parallel processing(orsimply parallel processing,if no confusion exists), in which an array of n · m bits is processed at one time, the fastest processing mode of the four. In Table 1.4, we have listed a number of computer systems undereach processing mode. The sys temparameters n, m are also shown for each system.The bit-slice processors,like STARAN. ~PP, and DAP. all have long bit slices. llliac-IV and PEPE are two word- slice array processors. 1.4.3 Parallelism Versus Pipelining Wolfgang Handler has proposed a classification scheme for identifying the parallelism degree and pipe lining degree built into the hardware structures of a computer system.He considers parallel-pipeline processing at three subsystemlevels: • Processorcontrol unit (PCU) • Arithmetic logic unit (ALU) • Bit-level circuit (BLC) The functions of PCU and ALU should be clear to us.Each PCU corresponds to one processoror one CPU. The ALU is equivalent to the processing element (PE)we specified for SIMD array processors.The BLC corresponds to the combinational logic circuitry needed to perform !-bit operations in the ALU. A computer systemC can be characterized by a triple containing six independent entities. as defined below: T(C) = < K x K', D x D', W x W'> (1.13) where K = the number of processors (PCUs) within the computer D = the number of ALUs (or PEs) under the control of one PCU W = the word length of an ALU or of a PE W' =the number of pipeline stages in all ALUs or in a PE D' = the number of ALUs that can be pipelined (pipeline chaining to be described in Chapter 4) K' = the number of PCUs that can be pipelined Several real computer examples are used to clarify the above parametric descriptions.The Texas Instrument's Advanced Scientific Computer (Tl-ASC)has one controller controlling four arithmetic pipelines, each has 64-bit word lengths and eight stages.Thus,we have T(ASC) = <1 x 1, 4 X 1, 64 x 8> = < 1, 4, 64 x 8>
  • 8. By GS kosta Amdahl's law In computer architecture, Amdahl's law (or Amdahl's argument[1] ) is a formula which gives the theoretical speedup in latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved. It is named after computer scientist Gene Amdahl, and was presented at the AFIPS Spring Joint Computer Conference in 1967 Amdahl's law is often used in parallel computing to predict the theoretical speedup when using multiple processors. Amdahl's law applies only to the cases where the problem size is fixed.
  • 9. By GS kosta Moore’s Law The quest for higher-performance digital computers seems unending. In the past two decades,the performance ofmicroprocessors has enjoyed an exponentialgrowth.The growth ofmicroprocessorspeed/performanceby a factor of 2 every 18 months (or about 60% per year)is known as Moore’slaw.This growth is the result ofa combination of two factors: 1. Increase in complexity (related both to higher device density and to larger size) ofVLSI chips,projectedto rise to around 10M transistors per chip for microprocessors, and 1B for dynamic random-access memories (DRAMs), by the year 2000 [SIA94] 2. Introductionof,andimprovementsin,architectural features suchas on- chip cache memories, large instruction buffers, multiple instruction issue per cycle, multithreading, deep pipelines, out-of-order instruction execution, and branch prediction Moore’s lawwas originally formulated in 1965 in terms ofthe doubling of chip complexity every year(laterrevised to every 18months)based only on a small numberof data points[Scha97].Moore’srevised prediction matchesalmost perfectly the actualincreasesin the number of transistors in DRAM and microprocessor chips. Moore’s lawseems to hold regardlessofhowone measures processorperformance:counting the numberofexecuted instructionspersecond(IPS),counting the numberoffloating-point operationspersecond (FLOPS),or using sophisticatedbenchmark suites thatattempt to measure theprocessor'sperformance onreal applications.This is because allof these measures,though Figure 1.1. The exponential grow th of microprocessor performance, know n as Moore’s law , show n overthe past two decades.
  • 10. By GS kosta numerically different,tend to rise at roughly the same rate .Figure 1.1 shows that the performanceofactualprocessors has in fact followed Moore’s lawquite closely since1980 and is on the verge ofreaching the GIPS (giga IPS = 109 IPS) milestone PRINCIPLES OF SCALABLE PERFORMANCE 1. Performance MetricsandMeasures 1.1. ParallelismProfileinPrograms 1.1.1. Degree of ParallelismThe numberof processorsusedatany instanttoexecute aprogram iscalledthe degree of parallelism(DOP);thiscanvaryovertime. DOP assumesaninfinite numberof processorsare available;thisisnotachievableinreal machines,sosome parallel program segmentsmustbe executedsequentiallyassmallerparallelsegments.Otherresourcesmayimpose limiting conditions. A plotof DOP vs.time iscalleda parallelismprofile. 1.1.2. Average Parallelism - 1 Assume the following: n homogeneous processors maximum parallelism in a profile is m Ideally, n >> m D, the computing capacity of a processor, is something like MIPS or Mflops w/o regard for memory latency, etc. i is the number of processors busy in an observation period (e.g. DOP = i ) W is the total work (instructions or computations) performed by a program A is the average parallelism in the program 1.1.3. Average Parallelism – 2 1.1.4. Average Parallelism – 3 1.1.5. Available Parallelism Various studies have shown that the potential parallelism in scientific and engineering calculations can be very high (e.g. hundreds or thousands of instructions per clock cycle). But in real machines, the actual parallelism is much smaller (e.g. 10 or 20). 1.1.6. Basic Blocks A basic block is a sequence or block of instructions with one entry and one exit. Basic blocks are frequently used as the focus of optimizers in compilers (since its easier to manage the use of registers utilized in the block). Limiting optimization to basic blocks limits the instruction level parallelism that can be obtained (to about 2 to 5 in typical code). 1.1.7. Asymptotic Speedup – 1 1.1.8. Asymptotic Speedup – 2
  • 11. By GS kosta 1.2. Mean Performance We seek to obtain a measure that characterizes the mean, or average, performance of a set of benchmark programs with potentially many different execution modes (e.g. scalar, vector, sequential, parallel). We may also wish to associate weights with these programs to emphasize these different modes and yield a more meaningful performance measure. 1.2.1. Arithmetic Mean The arithmetic mean is familiar (sum of the terms divided by the number of terms). Our measures will use execution rates expressed in MIPS or Mflops. The arithmetic mean of a set of execution rates is proportional to the sum of the inverses of the execution times; it is not inversely proportional to the sum of the execution times. Thus arithmetic mean fails to represent real times consumed by the benchmarks when executed. 1.2.2. Harmonic Mean Instead of using arithmetic or geometric mean, we use the harmonic mean execution rate, which is just the inverse of the arithmetic mean of the executiontime (thus guaranteeing the inverse relation not exhibitedby the other means). 1.2.3. Weighted Harmonic Mean If we associate weights fi with the benchmarks, then we can compute the weighted harmonic mean: 1.2.4. Weighted Harmonic Mean Speedup T1 = 1/R1 = 1 is the sequential execution time on a single processor with rate R1 = 1. Ti = 1/Ri = 1/i = is the execution time using i processors with a combined execution rate of Ri = i. Now suppose a program has n execution modes with associated weights f1 … f n. The w eighted harmonic mean speedup is definedas: 1.2.5. Amdahl’s Law Assume Ri = i, and w (the weights) are (a, 0, …, 0, 1-a). Basically this means the system is used sequentially (with probability a) or all n processors are used (with probability 1- a). This yieldsthe speedup equation known as Amdahl’s law: The implication is that the best speedup possible is 1/ a, regardless of n, the number of processors. 1.3. Efficiency, Utilizations, and Quality 1.3.1. System Efficiency – 1
  • 12. By GS kosta Assume the following definitions: O (n) = total number of “unit operations” performed by an n processor system in completing a program P. T (n) = execution time required to execute the program P on an n processor system. O (n) can be considered similar to the total number of instructions executed by the n processors, perhaps scaled by a constant factor. If we define O (1) = T (1), then it is logical to expect that T (n) < O (n) when n > 1 if the program P is able to make any use at all of the extra processor(s). 1.3.2. System Efficiency – 2 Clearly, the speedup factor (how much faster the program runs with n processors) can now be expressed as S (n) = T (1) / T (n) Recall that we expect T (n) < T (1), so S (n) ³ 1. System efficiency is defined as E (n) = S (n) / n = T (1) / ( n ´ T (n) ) It indicates the actual degree of speedup achieved in a system as compared with the maximum possible speedup. Thus 1 / n £ E (n) £ 1. The value is 1/n when only one processor is used (regardless of n), and the value is 1 when all processors are fully utilized. 1.3.3. Redundancy The redundancy in a parallel computation is defined as R (n) = O (n) / O (1) What values can R (n) obtain? R (n) = 1 when O (n) = O (1), or when the number of operations performed is independent of the number of processors, n. This is the ideal case. R (n) = n when all processors performs the same number of operations as when only a single processor is used; this implies that n completely redundant computations are performed! The R (n) figure indicates to what extent the software parallelismis carried over to the hardware implementation without having extra operations performed. 1.3.4. System Utilization System utilization is defined as U (n) = R (n) xE (n) = O (n) / ( nxT (n) ) It indicates the degree to which the system resources were kept busy during execution of the program. Since 1 £ R (n) £ n, and 1 / n £ E (n) £ 1, the best possible value for U (n) is 1, and the worst is 1 / n. SPEEDUP PERFORMANCE LAWS The main objective is to produce the results as early as possible. In other words minimal turnaround time is the primary goal. Three performance laws defined below: 1. Amdahl’s Law(1967) is based on fixed workload or fixed problem size 2. Gustafson’s Law(1987) is applied to scalable problems, where the problem size increases with the increase in machine size. 3. The speed up model by Sun and Ni(1993) is for scaled problems bounded by memory capacity. Amdahl’s Law for fixed workload In many practical applications the computational workload is often fixed with a fixed problem size. As the number of processors increases, the fixed workload is distributed. Speedup obtained for time-critical applications is called fixed-load speedup. Fixed-Load Speedup
  • 13. By GS kosta The ideal speed up formula given below: is based on a fixed workload, regardless of machine size. We consider below two cases of DOP< n and of DOP ≥ n. Parallel algorithm In computer science, a parallel algorithm, as opposed to a traditional serial algorithm, is an algorithm which can be executed a piece at a time on many different processing devices, and then combined together again at the end to get the correct result.[1] Many parallel algorithms are executed concurrently – though in general concurrent algorithms are a distinct concept – and thus these concepts are often conflated, with which aspect of an algorithm is parallel and which is concurrent not being clearly distinguished. Further, non-parallel, non-concurrent algorithms are often referred to as "sequential algorithms", by contrast with concurrent algorithms. Examples of Parallel Algorithms This section describes and analyzes several parallel algorithms. These algorithms provide examples of how to analyze algorithms in terms of work and depth and of how to use nested data-parallel constructs. They also introduce some important ideas concerning parallel algorithms. We mention again that the main goals are to have thecode closely match the high-level intuition of the algorithm, and to make it easy to analyzethe asymptotic performance from the code. Parallel Algorithm Complexity Analysis of an algorithm helps us determine whether the algorithm is useful or not. Generally, an algorithm is analyzed based on its execution time (Time Complexity) and the amount of space (Space Complexity) it requires. Since we have sophisticated memory devices available at reasonable cost, storage space is no longer an issue. Hence, spa ce complexity is not given so much of importance. Parallel algorithms are designed to improve the computation speed of a computer. For analyzing a Parallel Algorithm, we norma lly consider the following parameters −  Time complexity (Execution Time),
  • 14. By GS kosta  Totalnumber of processors used, and  Totalcost. Time Complexity The main reason behind developing parallel algorithms was to reduce the computation time of an algorithm. Thus, evaluating th e execution time of an algorithm is extremely important in analyzing its efficiency. Execution time is measured on the basis of the time taken by the algorithm to solve a problem. The total execution time is calculated from the moment when the algorithm starts executing to the moment it stops.If all the processors do not start or end execution at the same time, then the total execution time of the algorithm is the moment when the first processorstarted its execution to the moment when the last processor stops its execution. Time complexity of an algorithm can be classified into three categories−  Worst-case complexity − When the amount of time required by an algorithm for a given input is maximum.  Average-case complexity − When the amount of time required by an algorithm for a given input is average.  Best-case complexity − When the amount of time required by an algorithm for a given input is minimum. Asymptotic Analysis The complexity or efficiency of an algorithm is the number of steps executed by the algorithm to get the desired output.Asymptotic analysis is done to calculate the complexity of an algorithm in its theoretical analysis. In asymptotic analysis,a large length of input is used to calculate the complexity function of the algorithm. Note − Asymptotic is a condition where a line tends to meet a curve, but they do not intersect. Here the line and the curve is asymptotic to each other. Asymptotic notation is the easiest way to describe the fastest and slowest possible execution time for an algorithm using hig h bounds and low bounds on speed. For this, we use the following notations −  Big O notation  Omega notation  Thetanotation Big O notation In mathematics, Big O notation is used to represent the asymptotic characteristics offunctions.It represents the behavioro fa function for large inputs in a simple and accurate method. It is a method of representing the upper bound of an algorithm’s execution time. It represents the longest amount of time that the algorithm could take to complete its execution. The function − f(n) = O(g(n)) iff there exists positive constants c and n0 such that f(n) ≤ c * g(n) for all nwhere n ≥ n0. Omega notation Omega notation is a method of representing the lower bound of an algorithm’s execution time. The function − f(n) = Ω (g(n)) iff there exists positive constants c and n0 such that f(n) ≥ c * g(n) for all nwhere n ≥ n0. Theta Notation Theta notation is a method of representing both the lower bound and the upper bound of an algorithm’s execution time. The function − f(n) = θ(g(n)) iff there exists positive constants c1, c2, and n0 such that c1 * g(n) ≤ f(n) ≤ c2 * g(n) for all n where n ≥ n0. Speedup of anAlgorithm The performance of a parallel algorithm is determined by calculating its speedup. Speedup is defined as the ratio of the worst-case execution time of the fastest known sequential algorithm for a particular problem to the worst-case execution time of the parallel algorithm. speedup = Worst case execution time of the fastest known sequential for a particular problem / Worst case execution time of theparallel algorithm
  • 15. By GS kosta Number of ProcessorsUsed The number of processors used is an important factor in analyzing the efficiency of a parallel algorithm. The cost to buy, ma intain, and run the computers are calculated. Larger the number of processors used by an algorithmto solve a problem, more costly becomes the obtained result. Total Cost Total cost of a parallel algorithm is the product of time complexity and the number of processors used in that particular alg orithm. Total Cost = Time complexity × Number of processors used Therefore, the efficiency of a parallel algorithm is – Efficiency = Worst case execution time of sequential algorithm / Worst case execution time of the parallel algorithm Models of Parallel Processing Parallel processors come in manydifferentvarieties. 1. SIMD VERSUS MIMD ARCHITECTURES Within the SIMD category, two fundamental design choices exist: 1. Synchronous versus asynchronous SIMD. In a SIMD machine, each processor can execute or ignore the instruction being broadcast based on its local stateor data-dependent conditions. However, this leads to some inefficiency in executing conditional computations. For example, an “if-then-else” statement is executed by first enabling the processors for which the condition is satisfied and then flipping the “enable” bit before getting into the “else” part. On the average, half of the processors will be idle for each branch. The situation is even worsefor “case” statements involving multiway branches. A possiblecure is to use the asynchronous version of SIMD, known as SPMD (spim-dee or single-program, multiple data), where each processor runs its own copy of the common program. Theadvantage of SPMD is that in an “if-then-else” computation, each processor will only spend time on the relevant branch. The disadvantages include the need for occasional synchronization and the higher complexity of each processor, which must now have a program memory and instruction fetch/decode logic. 2. Custom- versus commodity-chip SIMD. A SIMD machine can be designed based on commodity (off-the-shelf) components or with custom chips. In the first approach, components tend to be inexpensive because of mass production. However, such general-purpose components will likely contain elements that may not be needed for a particular design. These extra components may complicate the design, manufacture, and testing of theSIMD machine and may introduce speed penalties as well. Customcomponents (including ASICs = application-specificICs, multichip modules, or WSI = wafer-scale integrated circuits) generally offer better performance but lead to much higher cost in view of their development costs being borne by a relatively small number of parallel machine users (as op posed to commodity microprocessors that are produced in millions). As integrating multiple processors along with ample memory on a single VLSI chip becomes feasible, a typeof convergence between the two approaches appears imminent. Within the MIMD class, three fundamental issuesor design choices are subjects of ongoing debates in the research community: 1. MPP—massively ormoderatelyparallel processor. Is it more cost-effective to build a parallelprocessorout ofa relatively small numberof powerfulprocessorsora massive numberofvery simple processors (the“herd ofelephants” orthe “army ofants” approach)? Referring to Amdahl’s law,the first choice does betteron the inherently sequentialpart ofa computation while the secondapproach might allowa higherspeed-upforthe parallelizable part.A generalanswercannot be given to this question,as the best choiceis both application-and technology-dependent. 2. Tightly versusloosely coupled MIMD. Which is a betterapproach to high-performance computing,thatofusing specially designedmultiprocessors/multicomputerora collection ofordinary workstations thatare interconnected by commoditynetworks (such as EthernetorATM)and whose interactions are coordinated byspecialsystemsoftware anddistributed file systems? The latterchoice,sometimes referred to as network ofworkstations (NOW)orclustercomputing,has beengainingpopularity in recent years.However,many open problems exist fortaking full advantage ofsuch network-based loosely coupled architectures. The hardware,systemsoftware,and applications aspects ofNOWs are being investigated by numerous researchgroups. 3. .Explicit message passing versusvirtualshared memory. Which scheme is better,that offorcing the users to explicitly specify all messagesthatmust be sent betweenprocessors orto allowthemto programin an abstracthigher-levelmodel,with the required messagesautomatically generated bythe systemsoftware? This question is essentially very similar to the one asked in the early days ofhigh-levellanguages and virtualmemory.At some point in the past,programming in as sembly languages anddoing explicit transfers betweensecondary andprimary memories could lead to higherefficiency.However,nowadays,software is so complex and compilers and operating systems so advanced (notto mention processingpowerso cheap)that it no longermakes sense to hand-optimize the programs,except in limited time-critical instances.However,we are not yet at that point in parallel processing,and hiding the explicit communication structure ofa parallel machine fromthe programmer has nontrivial consequencesforperformance
  • 16. By GS kosta THE PRAM SHARED-MEMORY MODEL The theoretical model used for conventional or sequential computers (SISD class) is known as therandom-access machine (RAM) (not to be confused with random-access memory, which has the same acronym). Theparallel version of RAM [PRAM (pea-ram)], constitutes an abstract model of the class of global-memory parallel processors. The abstraction consists of ignoring the details of the processor-to-memory interconnection network and taking theview that each processor can access any memory location in each machine cycle, independent of what other processors are doing DISTRIBUTED-MEMORY OR GRAPH MODELS This network is usually represented as a graph, with vertices corresponding to processor–memory nodes and edges corresponding to communication links. If communication links are unidirectional, then directed edges are used. Undirected edges imply bidirectional communication, although not necessarily in both directions at once. Important parameters of an interconnection network include 1. Network diameter: thelongest of the shortest paths between various pairs of nodes, which should be relatively small if network latency is to be minimized. The network diameter is more important with store-and- forward routing (when a message is stored in its entirety and retransmitted by intermediate nodes) than with wormhole routing (when a message is quickly relayed through a node in small pieces). 2. Bisection (band)width: thesmallest number (total capacity) of links that need to be cut in order to divide the network into two subnetworks of half the size. This is important when nodes communicate with each other in a random fashion. A small bisection (band)width limits the rate of data transfer between thetwo halves of the network, thus affecting theperformance of communication-intensive algorithms. 3. Vertex or node degree: the number of communication ports required of each node, which should be a constant independent of network size if the architecture is to be readily scalable to larger sizes. Thenode degree has a direct effect on thecost of each node, with the effect being more significant for parallel ports containing several wires or when the node is required to communicate over all of its ports at once. CIRCUIT MODEL AND PHYSICAL REALIZATIONS In a sense, the only sure way to predict the performance of a parallel architecture on a given set of problems is to actually build themachine and run the programs on it. Because this is often impossible or very costly, the next best thing is to model themachine at the circuit level, so that all computationaland signal propagation delays can be taken into account. Unfortunately, this is also impossible for a complex supercomputer, both because generating and debugging detailed circuit specifications are not much easier than a fullblown implementation and because a circuit simulator would take eons to run the simulation. Despitetheabove observations, we can produce and evaluate circuit-level designs for specific applications. GLOBAL VERSUS DISTRIBUTED MEMORY Within the MIMDclass ofparallelprocessors,memory can be globalordistributed. Global memory may be visualized as being in a central location where all processors can access it with equal ease (or with equal difficulty, if you are a half-empty-glass typeof person). Figure 4.3 shows a possiblehardware organization for a global-memory parallel processor. Processors can access memory through a special processor-to-memory network. A global-memory multiprocessor is characterized by the typeand number p of processors, the capacity and number m of memory modules, and the network architecture. Even though p and m are independent parameters, achieving high performance typically requires that they be comparable in magnitude (e.g., too few memory modules will cause contention among the processors and too many would complicate the network design).
  • 17. By GS kosta Distributed-memory architectures can be conceptually viewed as in Fig. 4.5. A collection of p processors, each with its own privatememory, communicates through an interconnection network. Here, the latency of the interconnection network may be less critical, as each processor is likely to access its own local memory most of the time. However, the communication bandwidth of thenetwork may or may not be critical, depending on the typeof parallel applications and the extent of task interdependencies. Note that each processor is usually connected to thenetwork through multiple links or channels (this is the norm here ,although it can also be the case for shared-memory parallel processors). Cache coherence In computer architecture, cache coherence is the uniformity of shared resource data that ends up stored in multiple local caches. When clients in a system maintain caches of a common memory resource, problems may arise with incoherent data, which is particularly the case with CPUs in a multiprocessingsystem. In the illustration on the right, consider both the clients have a cached copy of a particular memory block from a previous read. Suppose the client on the bottom updates/changes that memory block, the client on the top could be left with an invalid cache of memory without any notification of the change. Cache coherence is intended to manage such conflicts by maintaining a coherent view of the data values in multiple caches. The following are the requirements for cache coherence:[2] Write Propagation Changes to the data in any cache must be propagated to other copies(of that cache line) in the peer caches. Transaction Serialization Reads/Writes to a single memory location must be seen by all processors in the same order. Coherence protocols Coherence Protocols apply cache coherence in multiprocessor systems. The intention is that two clients must never see different values of the same shared data. The protocol must implement the basic requirements for coherence. It can be tailor made for the target system/application. Protocols can also be classified as Snooping(Snoopy/Broadcast) or Directory based. Typically, early systems used directory based protocols where a directory would keep a track of the data being shared and the sharers. In Snoopy protocols , the transaction request. (read/write/upgrade) are sent out to all processors. All processors snoop the request and respond appropriately. Write Propagation in Snoopy protocols can be implemented by either of the following: Write Invalidate When a write operation is observed to a location that a cache has a copy of, the cache controller invalidates its own copy of the snooped memory location, and thus forcing reads from main memory of the new value on their next access.[4] Write Update When a write operation is observed to a location that a cache has a copy of, the cache controller updates its own copy of the snooped memory location with the new data.
  • 18. By GS kosta If the protocol design states that whenever any copy of the shared data is changed, all the other copies must be "updated" to reflect the change, then it is a write update protocol. If the design states that on a write to a cached copy by any processor requires other processors to discard/invalidate their cached copies, then it is a write invalidate protocol. However, scalability is one shortcoming of broadcast protocols. Various models and protocols have been devised for maintaining coherence. Parallel Algorithm - Models The model of a parallel algorithm is developed by considering a strategy for dividing the data and processing method and applying a suitable strategy to reduce interactions. In this chapter, we will discuss the following Parallel Algorithm Models −  Data parallel model  Task graph model  Work pool model  Master slave model  Producer consumer or pipeline model  Hybrid model Data Parallel In data parallel model, tasks are assigned to processes and each task performs similar types of operations on different data. Data parallelism is a consequence of single operations that is being applied on multiple data items. Data-parallel model can be applied on shared-address spaces and message-passing paradigms. In data-parallel model, interaction overheads can be reduced by selecting a locality preserving decomposition, by using optimized collective interaction routines, or by overlapping computation and interaction. The primary characteristic of data-parallel model problems is that the intensity of data parallelism increases with the size of the problem, which in turn makes it possible to use more processes to solve larger problems. Example − Dense matrix multiplication.
  • 19. By GS kosta Task Graph Model In the task graph model, parallelism is expressed by a task graph. A task graph can be either trivial or nontrivial. In this model, the correlation among the tasks are utilized to promote locality or to minimize interaction costs. This model is enforced to solve problems in which the quantity of data associated with the tasks is huge compared to the number of computation associated with them. The tasks are assigned to help improve the cost of data movement among the tasks. Examples − Parallel quick sort, sparse matrix factorization, and parallel algorithms derived via divide-and-conquer approach. Here, problems are divided into atomic tasks and implemented as a graph. Each task is an independent unit of job that has dependencies on one or more antecedent task. After
  • 20. By GS kosta the completion of a task, the output of an antecedent task is passed to the dependent task. A task with antecedent task starts execution only when its entire antecedent task is completed. The final output of the graph is received when the last dependent task is completed (Task 6 in the above figure). Work Pool Model In work pool model, tasks are dynamically assigned to the processes for balancing the load. Therefore, any process may potentially execute any task. This model is used when the quantity of data associated with tasks is comparatively smallerthan the computation associated with the tasks. There is no desired pre-assigning of tasks onto the processes. Assigning of tasks is centralized or decentralized. Pointers to the tasks are saved in a physically shared list, in a priority queue, or in a hash table or tree, or they could be saved in a physically distributed data structure. The task may be available in the beginning, or may be generated dynamically. If the task is generated dynamically and a decentralized assigning of task is done, then a termination detection algorithm is required so that all the processes can actually detect the completion of the entire program and stop looking for more tasks. Example − Parallel tree search Master-Slave Model In the master-slave model, one or more master processes generate task and allocate it to slave processes. The tasks may be allocated beforehand if −
  • 21. By GS kosta  the master can estimate the volume of the tasks, or  a random assigning can do a satisfactory job of balancing load, or  slaves are assigned smaller pieces of task at different times. This model is generally equally suitable to shared-address-space or message- passing paradigms, since the interaction is naturally two ways. In some cases, a task may need to be completed in phases, and the task in each phase must be completed before the task in the next phases can be generated. The master- slave model can be generalized to hierarchical or multi-level master-slave model in which the top level master feeds the large portion of tasks to the second-level master, who further subdivides the tasks among its own slaves and may perform a part of the task itself. Precautions in using the master-slave model Care should be taken to assure that the master does not become a congestion point. It may happen if the tasks are too small or the workers are comparatively fast. The tasks should be selected in a way that the cost of performing a task dominates the cost of communication and the cost of synchronization. Asynchronous interaction may help overlap interaction and the computation associated with work generation by the master. Pipeline Model It is also known as the producer-consumer model. Here a set of data is passed on through a series of processes, each of which performs some task on it. Here, the arrival of new data generates the execution of a new task by a process in the queue. The processes could form a queue in the shape of linear or multidimensional arrays, trees, or general graphs with or without cycles. This model is a chain of producers and consumers. Each process in the queue can be considered as a consumer of a sequence of data items for the process preceding it in the queue and as a producer of data for the process following it in the queue. The queue does not need to be a linear chain; it can be a directed graph. The most common
  • 22. By GS kosta interaction minimization technique applicable to this model is overlapping interaction with computation. Example − Parallel LU factorization algorithm. Hybrid Models A hybrid algorithm model is required when more than one model may be needed to solve a problem. A hybrid model may be composed of either multiple models applied hierarchically or multiple models applied sequentially to different phases of a parallel algorithm. Example − Parallel quick sort Shared memory/Parallel Processing in Memory In computer science, shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between programs. Depending on context, programs may run on a single processor or on multiple separate processors. Using memory for communication inside a single program, e.g. among its multiple threads, is also referred to as shared memory. In hardware[edit] In computer hardware, shared memory refers to a (typically large) block of random access memory (RAM) that can be accessed by several different central processing units (CPUs) in a multiprocessor computer system. Shared memory systems may use:[1]  uniform memory access (UMA): all the processors share the physical memory uniformly;  non-uniform memory access (NUMA): memory access time depends on the memory location relative to a processor;  cache-only memory architecture (COMA): the local memories for the processors at each node is used as cache instead of as actual main memory. In software[edit] In computer software, shared memory is either
  • 23. By GS kosta  a method of inter-process communication (IPC), i.e. a way of exchanging data between programs running at the same time. One process will create an area in RAMwhich other processes can access;  a method of conserving memory space by directing accesses to what would ordinarily be copies of a piece of data to a single instance instead, by using virtual memorymappings or with explicit support of the program in question. This is most often used for shared libraries and for XIP.