This document discusses parallel processing and the evolution of computer systems. It covers several topics:
- The evolution of computer systems from vacuum tubes to integrated circuits, organized into generations.
- Concepts of parallel processing including Flynn's classification of computer architectures based on instruction and data streams.
- Parallel processing mechanisms in uniprocessor systems including pipelining and memory hierarchies.
- Three classes of parallel computer structures: pipeline computers, array processors, and multiprocessor systems.
- Architectural classification schemes including Flynn's, Feng's based on serial vs parallel processing, and Handler's based on parallelism levels.
Computer Architecture – An IntroductionDilum Bandara
Overview on high-level design of internal components of a computer. Cover step-by-step execution of a program through ALU while accessing & updating registers
Computer Architecture – An IntroductionDilum Bandara
Overview on high-level design of internal components of a computer. Cover step-by-step execution of a program through ALU while accessing & updating registers
Topics included:
===============================================
The different types of computers
The basic structure of a computer and its operation
Machine instructions and their execution
Integer, floating-point, and character representations
Addition and subtraction of binary numbers
Basic performance issues in computer systems
A brief history of computer development
In this presentation, you will learn the fundamentals of Multi Processors and Multi Computers in only a few minutes.
Meanings, features, attributes, applications, and examples of multiprocessors and multi computers.
So, let's get started. If you enjoy this and find the information beneficial, please like and share it with your friends.
Virtual Memory
• Copy-on-Write
• Page Replacement
• Allocation of Frames
• Thrashing
• Operating-System Examples
Background
Page Table When Some PagesAre Not in Main Memory
Steps in Handling a Page Fault
Memory management is the act of managing computer memory. The essential requirement of memory management is to provide ways to dynamically allocate portions of memory to programs at their request, and free it for reuse when no longer needed. This is critical to any advanced computer system where more than a single process might be underway at any time
Pipelining is an speed up technique where multiple instructions are overlapped in execution on a processor. It is an important topic in Computer Architecture.
This slide try to relate the problem with real life scenario for easily understanding the concept and show the major inner mechanism.
This is a brief introductory lecture I conducted on von Neumann Architecture. Von Neumann is a fundamental computer hardware architecture based on the store program concept, designed by John von Neumann.
Parallel computing is computing architecture paradigm ., in which processing required to solve a problem is done in more than one processor parallel way.
From the perspective of Design and Analysis of Algorithm. I made these slide by collecting data from many sites.
I am Danish Javed. Student of BSCS Hons. at ITU Information Technology University Lahore, Punjab, Pakistan.
Interfacing With High Level Programming Language
High Level Programming Language
Categories of programming languages
Processing a High-Level Language Program
Advantages of high-level languages
Interface-Based Programming
Interfaces in Object Oriented Programming Languages
Implementing an Interface
Topics included:
===============================================
The different types of computers
The basic structure of a computer and its operation
Machine instructions and their execution
Integer, floating-point, and character representations
Addition and subtraction of binary numbers
Basic performance issues in computer systems
A brief history of computer development
In this presentation, you will learn the fundamentals of Multi Processors and Multi Computers in only a few minutes.
Meanings, features, attributes, applications, and examples of multiprocessors and multi computers.
So, let's get started. If you enjoy this and find the information beneficial, please like and share it with your friends.
Virtual Memory
• Copy-on-Write
• Page Replacement
• Allocation of Frames
• Thrashing
• Operating-System Examples
Background
Page Table When Some PagesAre Not in Main Memory
Steps in Handling a Page Fault
Memory management is the act of managing computer memory. The essential requirement of memory management is to provide ways to dynamically allocate portions of memory to programs at their request, and free it for reuse when no longer needed. This is critical to any advanced computer system where more than a single process might be underway at any time
Pipelining is an speed up technique where multiple instructions are overlapped in execution on a processor. It is an important topic in Computer Architecture.
This slide try to relate the problem with real life scenario for easily understanding the concept and show the major inner mechanism.
This is a brief introductory lecture I conducted on von Neumann Architecture. Von Neumann is a fundamental computer hardware architecture based on the store program concept, designed by John von Neumann.
Parallel computing is computing architecture paradigm ., in which processing required to solve a problem is done in more than one processor parallel way.
From the perspective of Design and Analysis of Algorithm. I made these slide by collecting data from many sites.
I am Danish Javed. Student of BSCS Hons. at ITU Information Technology University Lahore, Punjab, Pakistan.
Interfacing With High Level Programming Language
High Level Programming Language
Categories of programming languages
Processing a High-Level Language Program
Advantages of high-level languages
Interface-Based Programming
Interfaces in Object Oriented Programming Languages
Implementing an Interface
Highlighted notes while studying Concurrent Data Structures:
Vector processor
Source: Wikipedia
GPUs are (very long) vector processors. So are Intel/AMD CPUs! Most architectures now provide some form of vector operations. This is because computations, but the memory is too far away (high latency).
In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set containing instructions that operate on one-dimensional arrays of data called vectors, compared to the scalar processors, whose instructions operate on single data items. Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks. Vector machines appeared in the early 1970s and dominated supercomputer design through the 1970s into the 1990s, notably the various Cray platforms. The rapid fall in the price-to-performance ratio of conventional microprocessor designs led to the vector supercomputer's demise in the later 1990s.
Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation.
For over 40 years, virtually all computers have followed a common machine model known as the von Neumann computer. Name after the Hungarian mathématicien John von Neumann.
A von Neumann computer uses the stored-program concept. The CPU executes a stored program that specifies a sequence of read and write operations on the memory.
computer application in hospitality Industry, periyar university unit 1admin information
in this presentation b.sc hotel management 1 st year student computer application in hospitality Industry subject is the, this subject under periyar university hotel management 1st year students subject.
computer applicationin hospitality Industry1 periyar university unit1admin information
in this power point periyar university bsc hotel management 1st year students com computer applicationin hospitality Industry-1 sylabus 1st unit topic is there
in this power point periyar university bsc hotel management 1st year students com computer applicationin hospitality Industry-1 sylabus 1st unit topic is there
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
1. By GS kosta
INTRODUCTION TO PARALLEL PROCESSING
Parallel computer structures will be
characterized as pipelined computers,array processors, and multiprocessor systems. Several new computing concepts,including data
flow and VLSI approaches.
1.1 EVOLUTION OF COMPUTER SYSTEMS
physically marked by the rapid changing of building blocks from relays and vacuum tubes (l940-1950s) to discrete diodes and
transistors (1950 1960s), to small- and medium-scale integrated (SSI/MSI) circuits (l960-1970s), and to large- and very-large-scale
integrated (LSI/VLSI) devices (1970s and beyond).Increases in device speed and reliability and reductions in hardware cost and
physical size have greatly enhanced computer performance
1.1.1 Generations of Computer Systems
The first generation (1938-1953)The introduction of the first electronic analog computer in 1938 and the first electronic
digital computer, ENIAC (Electronic Numerical Integrator and Computer), in 1946 marked the beginning of the first
generation of computers. Electromechanical relays were used as switching devices in the 1940s, and vacuumtubes were
used in the 1950s.
The second generation (1952-1963)Transistors were invented in 1948. The first transistorized digital computer, TRAD!C, was
built by Bell Laboratories in 1954.Discrete transistors and diodes were the building blocks: 800 transistors were
used in TRADIC. Printed circuits appeared
The third generation (1962-1975)This generation was marked by the use of small-scale integrated (SSI) and medium-scale
integrated (MSI) circuits as the basic building blocks. Multilayered printed circuits were used.Core memory was
still used in CDC-6600 and other machines but. by 1968, many fast computers, like CDC-7600, began to replace cores with solid-
state memories.
The fourth generation (1972-present) The present generation computers emphasize the use of large-scale integrated (LSI) circuits
for both logic and memory sections. High-density packaging has appeared. High-level languages are being extended to handle both
scalar and vector data. like the extended Fortran in many vector processors,
The future Computers to be used in the 1990s may be the next generation. Very large-scale integrated (VLSI) chips will be used
along with high-density modular design.Multiprocessors like the 16 processors in the S-1 project at Lawrence Livermore National
Laboratory and in the Denelcor's HEP will be required.Cray-2 is expected to have four processors,to be delivered in 1985. More
than 1000mega floating point operations persecond(megaflops) are expected in these future supercomputers.
1.1.2 TrendsTowardsParallel Processing **
According to Sidney Fern Bach:" Today's large computers (mainframes)wouldhere beenconsidered 'supercomputers' 10to 20
years ago.By the same token,today's supercomputers willhe considered'state-of-the-art' standard equipment 10 to 20_yearsFrom
now." from an application point of view. the mainstream usage of computers is experiencing a trend of four ascending
levels of sophistication:
• Data processing
• Information processing • Knowledge processing • Intelligence processing
2. By GS kosta
We are in an era which is promoting the use ofcomputers not only for conventionaldata-information processing.but also towardthe
building of workable machine knowledge-intelligence systems to advance human civilization. Many computer scientists feel that the
degree of parallelism exploitable at the two highest processing levels should be higher than that at thedata-information processing
levels
From an operating systempoint of view, computer systems have improved chronologically in four phases:
• Batch processing
• Multiprogramming
• Time sharing
• Multiprocessing
In these four operating modes. the degree of parallelism increases sharply from phase to phase.The general trend is to emphasize
parallel processing of information. In what follows. the term information is used with an extended meaning to include
data, information, knowledge, and intelligence. We formally define parallel processing as follows:
Parallel processing is an efficient form of information processing which emphasizes the exploitation of concurrent events in the
computing process.Concurrency implies parallelism, simultaneity, and pipelining. Parallel events may occur in multiple resources
during the same time interval; simultaneous events may occur at the same time instant; and pipelined events may occur in
overlapped time spans.
parallelprocessing can be challenged in four programmatic levels:
• Job or program level
• Task or procedure level
• Interinstruction level
• Intrainstructionlevel The highest job level is often conducted algorithmically. The lowest intrainstruction level is often
implemented directly by hardware means. Hardware roles increase from high to low levels. On the other hand,software
implementations increase from low to high levels. The trade-off between hardware and software approaches to solve a problem is
always a very controversial issue.As hardware cost declines and software cost increases, more and more hardware methods are
replacing the conventionalsoftware approaches.The trend is also supported by the increasing demand for a faster real-time,
resource-sharing, and fault-tolerant computing environment.
parallel processing is concerned,the general architectural trend is being shiftedaway from conventional uniprocessor systems to
multiprocessor systems orto an array of processing elements controlled by one uniprocessor.In all cases,a high degree of pipe
lining is being incorporated into the various systemlevels.
1.2 P ARALLEUSM IN UNIPROCESSOR SYSTEMS
1.2.1 Basic Uniprocessor Architecture
A typical uniprocessor computer consists of three major components: the main memory, the central processing unit
(CPU),and the input-output (l/O) subsystem.
3. By GS kosta
The architectures of two commercially available uniprocessor computers are given below to show the possible
interconnection of structures among the three subsystems. We will examine majorcomponents in the CPU and in
the 1/0 subsystem.
1.2.2 Parallel Processing Mechanisms
A numberof parallel processingmechanisms havebeendeveloped in uniprocessor computers.We identify them in the following six
categories:
• Multiplicity of functional units
• Parallelism and pipelining within the CPU
• Overlapped CPU and I/0 operations
• Use of a hierarchical memory system
• Balancing of subsystembandwidths
• Multiprogramming and time sharing
Multiplicity of functional units The early computer had only one arithmetic and logic unit in its CPU. Furthermore. the ALU could
only perform one function at a time, a rather slowprocess for executing a long sequence of arithmetic logic instructions.In practice,
many of the functions of the ALU can be distributed to multiple and specialized functional units which can operate in paralle l. The
CDC-6600 (designed in 1964) has I 0 functional units built into its CPU (Figure 1.5). These I 0 units are independent of each other
and may operate simultaneously. A scoreboard is used to keep track of the availability of the functional units and
registers being demanded. With 10 functional units and 24 registers available, the
instructionissue rate canbe significantlyincreased.
Anothergood example of a multifunction uniprocessoris the IBM 360/91 (1968), which has two parallel execution units
Parallelism and pipelining within the CPU Parallel
adders,using such techniques as carry-lookahead and
carry-save, arc now built into almost all ALUs. This is in
contrast to the bit-serial adders used in the first-generation
machines. High-speed multiplier recoding and
convergence division arc techniques for exploring
parallelism and the sharing of hardware resources for the
functions of multiply and divide The use of multiple
functional units is a form of parallelism with the CPU.
Various phases of instruction executions arc now
pipelined, including instruction fetch, decode,operand
fetch, arithmetic logic execution, and store result. To
facilitate overlapped instruction executions through the
pipe, instruction prefetch and data buffering techniques
have been developed.
4. By GS kosta
Overlapped CPU and 1/0 operations 1/0 operations can be performed simultaneously with the CPU computations by using
separate 1/0 controllers, channels, or I/0 processors.The direct-memory-access (DMA) channel can be used to provide direct
information transfer between the 1/0 devices and the main memory. The DMA is conducted on a cycle-stealingbasis,which is
apparent to the CPU.
Use of hierarchical memory system Usually, the CPU is
about 1000 times faster than memory access.A
hierarchical memory systemcan be used to close up the
speed gap. Computer memory hierarchy is conceptually
illustrated in Figure 1.6.The innermost level is the register
files directly addressable by ALU. Cache memory can be
used to serve as a buffer between the CPU and the main
memory. Block access of the main memory can be
achieved through multi way inter leaving across parallel
memory modules (see Figure 1.4). Virtual memory space
can be established with the use of disks and tape units at
the outerlevels.
Multiprogramming and Time Sharing
Multiprogramming Within the same time interval, there
may be multiple processes active in a computer.
competing for memory. 1/0. and CPG resources.We are
aware of the fact that some computer programs are CPU-
hound (computation intensive),and some are I/O-bound
(input-output intensive)
Time sharing Multiprogramming on a uniprocessoris
centered around the sharing of the CPU by many programs. Sometimes a high-priority program may occupy
the CPU for too long to allow others to share.This problem can be overcome by using a rime-sharingoperating system.
1.3 PARALLEL COMPUTER STRUCTURES
Parallel computers are those systems that emphasize parallel processing.The basic architectural features of parallel computers are
introduced below.' We divide parallel computers into three architectural configurations:
• Pipeline computers
• Array processors
• Multiprocessor systems
A pipeline computer performs overlapped computations to exploit temporal parallelism An array processoruses multiple
synchronized arithmetic logic units to achieve spatial parallelism. A multiprocessor systemachieves asynchronous parallelism
through a setof interactive processors withsharedresources (memories,
database,etc.).
1.4 ARCHITECTURAL CLASSIFICATION SCHEMES
Three computer architectural classification schemes are presented in this section .Flynn'., classification (1966) is based on the
multiplicity of instruction streams and data streams in a computer system. F eng's scheme (1972) is based on serial versus
parallel processing. handler’s classification (1977) is determined by the degree of parallelism and pipelining in various subsystem
levels.
1.4.1 Multiplicity of Instruction-Data Streams
In general, digital computers may be classified into four categories, according to the multiplicity of instruction and data
streams. This scheme for classifying computer organizations was introduced by Michael J. Flynn.
• Computer organizations are characterized by the
multiplicity of the hardware provided to service
the instruction and data streams. Listed below
are Flynn's four machine organizations:
• Single instruction stream-single data stream
(SISD)
• Single instruction stream-multiple data
stream (SIMD)
• Multiple instruction stream-single data
stream (MISD)
• Multiple instruction stream-multiple data
stream (MI MD)
5. By GS kosta
SISD computer organization This organization, shown in figure 1.16a, represents most serial computers available today.
Instructions are executed sequentially but may be overlapped in their execution stages (pipelining). Most SISD uniprocessor
systems are pipelined. An SISD computer may have more than one functional unit in it. All the functional units are under the
supervision of one control unit.
SIMD computer organization This class corresponds to array processors.Introduced in Section 1.3.2. As illustrated in Figure
1.16b, there are multiple processing elements supervised by the same control unit. All PEs receive the same instruction broad cast
from the control unit but operate on different data sets from distinct data streams. The shared memory subsystemmay contain
multiple modules.
MISD computer organization This organization is conceptually illustrated in Figure l.l6c. There are n processorunits, each
receiving distinct instructions operating over the same data stream and its derivatives. The results (output)of one processorbec ome
the input (operands) of the next processorin the micropipe. This structure has received much less attention and has been cha llenged
as impractical by some computer architects.No real embodiment of this class exists.
MIMD computer organization Most multiprocessor systems and multiple computer systems can be classified in this category
(Figure 1.16d). An intrinsic MIMDcomputer implies
interactions among then processors because all memory
streams are derived from the same data space shared by
all processors.If the n data streams were derived from
disjointed subspaces ofthe shared memories, then we
would have the so-called multiple SISD (MSISD)
operation, which is nothing but a set of n
independent SISD uniprocessorsystems.
1.4.2 Serial Versus Parallel Processing
Tse-yun Feng has suggested the use ofthe degree of
parallelism to
classify various
computer
architectures.
There are four types of processing methods that can be seen from this diagram:
• Word-serial and bit-serial (WSBS)
• Word-parallel and bit-serial (WPBS)
• Word-serial and bit-parallel (WSBP)
6. By GS kosta
• Word-parallel and bit-parallel (WPBP)
WSBS has been called bit-serial processing because one bit (n = m = 1) is processed at a time. a rather slow process.This was
done only in the first~generationcomputers.WPBS (n = 1, m > 1) has been called bis (bit-slice) processin9 because an m-bit slice is
processed at a time. WSBP (n > 1, m = l ), as found in most existing computers, has been called word-slice processi11g because one
word of 11 bits is processed at a time. Finally, WPBP (n > I, m > l) is known as fully parallel processing(orsimply parallel
processing,if no confusion exists), in which an array of n · m bits is processed at one time, the fastest processing mode of the four.
In Table 1.4, we have listed a number of computer systems undereach processing mode. The sys temparameters n, m are also shown
for each system.The bit-slice processors,like STARAN. ~PP, and DAP. all have long bit slices. llliac-IV and PEPE are two word-
slice array processors.
1.4.3 Parallelism Versus Pipelining
Wolfgang Handler has proposed a classification scheme for identifying the parallelism degree and pipe lining degree built into the
hardware structures of a computer system.He considers parallel-pipeline processing at three subsystemlevels:
• Processorcontrol unit (PCU)
• Arithmetic logic unit (ALU)
• Bit-level circuit (BLC)
The functions of PCU and ALU should be clear to us.Each PCU corresponds to one processoror one CPU. The ALU is equivalent
to the processing element (PE)we specified for SIMD array processors.The BLC corresponds to the combinational logic circuitry
needed to perform !-bit operations in the ALU. A computer systemC can be characterized by a triple containing six independent
entities. as defined below:
T(C) = < K x K', D x D', W x W'> (1.13)
where K = the number of processors (PCUs) within the computer
D = the number of ALUs (or PEs) under the control of one PCU
W = the word length of an ALU or of a PE
W' =the number of pipeline stages in all ALUs or in a PE
D' = the number of ALUs that can be pipelined (pipeline chaining to be
described in Chapter 4)
K' = the number of PCUs that can be pipelined
Several real computer examples are used to clarify the above parametric descriptions.The Texas Instrument's Advanced Scientific
Computer (Tl-ASC)has one controller controlling four arithmetic pipelines, each has 64-bit word lengths and eight stages.Thus,we
have
T(ASC) = <1 x 1, 4 X 1, 64 x 8> = < 1, 4, 64 x 8>
8. By GS kosta
Amdahl's law
In computer architecture, Amdahl's law (or Amdahl's
argument[1]
) is a formula which gives the
theoretical speedup in latency of the execution of a
task at fixed workload that can be expected of a
system whose resources are improved. It is named
after computer scientist Gene Amdahl, and was
presented at the AFIPS Spring Joint Computer
Conference in 1967
Amdahl's law is often used in parallel computing to
predict the theoretical speedup when using multiple
processors.
Amdahl's law applies only to the cases where the
problem size is fixed.
9. By GS kosta
Moore’s Law
The quest for higher-performance digital computers seems unending. In
the past two decades,the performance ofmicroprocessors has enjoyed an
exponentialgrowth.The growth ofmicroprocessorspeed/performanceby
a factor of 2 every 18 months (or about 60% per
year)is known as Moore’slaw.This growth is the result ofa combination
of two factors:
1. Increase in complexity (related both to higher device density and to
larger size) ofVLSI chips,projectedto rise to around 10M transistors per
chip for microprocessors, and 1B for dynamic random-access memories
(DRAMs), by the year 2000 [SIA94]
2. Introductionof,andimprovementsin,architectural features suchas on-
chip cache memories, large instruction buffers, multiple instruction issue
per cycle, multithreading, deep pipelines, out-of-order instruction
execution, and branch prediction
Moore’s lawwas originally formulated in 1965 in terms ofthe doubling
of chip complexity every year(laterrevised to every 18months)based
only on a small numberof data points[Scha97].Moore’srevised
prediction matchesalmost perfectly the actualincreasesin the
number of transistors in DRAM and microprocessor chips.
Moore’s lawseems to hold regardlessofhowone measures
processorperformance:counting the numberofexecuted
instructionspersecond(IPS),counting the numberoffloating-point
operationspersecond (FLOPS),or using sophisticatedbenchmark
suites thatattempt to measure theprocessor'sperformance onreal
applications.This is because allof these measures,though
Figure 1.1. The exponential grow th of microprocessor performance,
know n as Moore’s law , show n overthe past two decades.
10. By GS kosta
numerically different,tend to rise at roughly the same rate .Figure 1.1 shows that the performanceofactualprocessors has in fact followed
Moore’s lawquite closely since1980 and is on the verge ofreaching the GIPS (giga IPS = 109 IPS) milestone
PRINCIPLES OF SCALABLE PERFORMANCE
1. Performance MetricsandMeasures
1.1. ParallelismProfileinPrograms
1.1.1. Degree of ParallelismThe numberof processorsusedatany instanttoexecute aprogram iscalledthe degree of
parallelism(DOP);thiscanvaryovertime.
DOP assumesaninfinite numberof processorsare available;thisisnotachievableinreal machines,sosome parallel
program segmentsmustbe executedsequentiallyassmallerparallelsegments.Otherresourcesmayimpose limiting
conditions.
A plotof DOP vs.time iscalleda parallelismprofile.
1.1.2. Average Parallelism - 1
Assume the following:
n homogeneous processors
maximum parallelism in a profile is m
Ideally, n >> m
D, the computing capacity of a processor, is something
like MIPS or Mflops w/o regard for memory latency, etc.
i is the number of processors busy in an observation
period (e.g. DOP = i )
W is the total work (instructions or computations)
performed by a program
A is the average parallelism in the program
1.1.3. Average Parallelism – 2 1.1.4. Average Parallelism – 3
1.1.5. Available Parallelism
Various studies have shown that the potential parallelism in scientific and engineering calculations can be very
high (e.g. hundreds or thousands of instructions per clock cycle).
But in real machines, the actual parallelism is much smaller (e.g. 10 or 20).
1.1.6. Basic Blocks
A basic block is a sequence or block of instructions with one entry and one exit.
Basic blocks are frequently used as the focus of optimizers in compilers (since its easier to manage the use of
registers utilized in the block).
Limiting optimization to basic blocks limits the instruction level parallelism that can be obtained (to about 2 to 5 in
typical code).
1.1.7. Asymptotic Speedup – 1 1.1.8. Asymptotic Speedup – 2
11. By GS kosta
1.2. Mean Performance
We seek to obtain a measure that characterizes the mean, or average, performance of a set of
benchmark programs with potentially many different execution modes (e.g. scalar, vector, sequential, parallel).
We may also wish to associate weights with these programs to emphasize these different modes and yield a more
meaningful performance measure.
1.2.1. Arithmetic Mean
The arithmetic mean is familiar (sum of the terms divided by the number of terms).
Our measures will use execution rates expressed in MIPS or Mflops.
The arithmetic mean of a set of execution rates is proportional to the sum of the inverses of the execution times; it
is not inversely proportional to the sum of the execution times.
Thus arithmetic mean fails to represent real times consumed by the benchmarks when executed.
1.2.2. Harmonic Mean
Instead of using arithmetic or geometric mean, we use the harmonic mean execution rate,
which is just the inverse of the arithmetic mean of the executiontime (thus guaranteeing
the inverse relation not exhibitedby the other means).
1.2.3. Weighted Harmonic Mean
If we associate weights fi with the benchmarks, then we can compute the weighted harmonic
mean:
1.2.4. Weighted Harmonic Mean Speedup
T1 = 1/R1 = 1 is the sequential execution time on a
single processor with rate R1 = 1.
Ti = 1/Ri = 1/i = is the execution time using i
processors with a combined execution rate of Ri = i.
Now suppose a program has n execution
modes with associated weights f1 … f n. The w eighted
harmonic mean speedup is definedas:
1.2.5. Amdahl’s Law
Assume Ri = i, and w (the weights) are (a, 0, …, 0, 1-a).
Basically this means the system is used sequentially (with probability a) or
all n processors are used (with probability 1- a).
This yieldsthe speedup equation known as Amdahl’s law:
The implication is that the best speedup possible is 1/ a, regardless of n, the number of processors.
1.3. Efficiency, Utilizations, and Quality
1.3.1. System Efficiency – 1
12. By GS kosta
Assume the following definitions:
O (n) = total number of “unit operations” performed by an n processor system in completing a program P.
T (n) = execution time required to execute the program P on an n processor system.
O (n) can be considered similar to the total number of instructions executed by the n processors, perhaps scaled by
a constant factor.
If we define O (1) = T (1), then it is logical to expect that T (n) < O (n) when n > 1 if the program P is able to make
any use at all of the extra processor(s).
1.3.2. System Efficiency – 2
Clearly, the speedup factor (how much faster the program runs with n processors) can now be expressed as
S (n) = T (1) / T (n)
Recall that we expect T (n) < T (1), so S (n) ³ 1.
System efficiency is defined as
E (n) = S (n) / n = T (1) / ( n ´ T (n) )
It indicates the actual degree of speedup achieved in a system as compared with the maximum possible speedup.
Thus 1 / n £ E (n) £ 1. The value is 1/n when only one
processor is used (regardless of n), and the value is 1 when all processors are fully utilized.
1.3.3. Redundancy
The redundancy in a parallel computation is defined as
R (n) = O (n) / O (1)
What values can R (n) obtain?
R (n) = 1 when O (n) = O (1), or when the number of operations performed is independent of the number of
processors, n. This is the ideal case.
R (n) = n when all processors performs the same number of operations as when only a single processor is used; this
implies that n completely redundant computations are performed!
The R (n) figure indicates to what extent the software parallelismis carried over to the hardware implementation
without having extra operations performed.
1.3.4. System Utilization
System utilization is defined as
U (n) = R (n) xE (n) = O (n) / ( nxT (n) )
It indicates the degree to which the system resources were kept busy during execution of the
program. Since 1 £ R (n) £ n, and 1 / n £ E (n) £
1, the best possible value for U (n) is 1, and the
worst is 1 / n.
SPEEDUP PERFORMANCE LAWS
The main objective is to produce the results as early as possible. In other words minimal turnaround time is the
primary goal.
Three performance laws defined below:
1. Amdahl’s Law(1967) is based on fixed workload or fixed problem size
2. Gustafson’s Law(1987) is applied to scalable problems, where the problem size increases with the increase in
machine size.
3. The speed up model by Sun and Ni(1993) is for scaled problems bounded by memory capacity.
Amdahl’s Law for fixed workload
In many practical applications the computational workload is often fixed with a fixed problem size. As the number
of processors increases, the fixed workload is distributed.
Speedup obtained for time-critical applications is called fixed-load speedup.
Fixed-Load Speedup
13. By GS kosta
The ideal speed up formula given below:
is based on a fixed workload, regardless of machine size.
We consider below two cases of DOP< n and of DOP ≥ n.
Parallel algorithm
In computer science, a parallel algorithm, as opposed to a traditional serial algorithm, is an algorithm which can
be executed a piece at a time on many different processing devices, and then combined together again at the end
to get the correct result.[1]
Many parallel algorithms are executed concurrently – though in general concurrent algorithms are a distinct
concept – and thus these concepts are often conflated, with which aspect of an algorithm is parallel and which is
concurrent not being clearly distinguished. Further, non-parallel, non-concurrent algorithms are often referred to
as "sequential algorithms", by contrast with concurrent algorithms.
Examples of Parallel Algorithms
This section describes and analyzes several parallel algorithms. These algorithms provide examples of how to analyze algorithms in terms of work
and depth and of how to use nested data-parallel constructs. They also introduce some important ideas concerning parallel algorithms. We mention
again that the main goals are to have thecode closely match the high-level intuition of the algorithm, and to make it easy to analyzethe asymptotic
performance from the code.
Parallel Algorithm Complexity
Analysis of an algorithm helps us determine whether the algorithm is useful or not. Generally, an algorithm is analyzed based on its
execution time (Time Complexity) and the amount of space (Space Complexity) it requires.
Since we have sophisticated memory devices available at reasonable cost, storage space is no longer an issue. Hence, spa ce
complexity is not given so much of importance.
Parallel algorithms are designed to improve the computation speed of a computer. For analyzing a Parallel Algorithm, we norma lly
consider the following parameters −
Time complexity (Execution Time),
14. By GS kosta
Totalnumber of processors used, and
Totalcost.
Time Complexity
The main reason behind developing parallel algorithms was to reduce the computation time of an algorithm. Thus, evaluating th e
execution time of an algorithm is extremely important in analyzing its efficiency.
Execution time is measured on the basis of the time taken by the algorithm to solve a problem. The total execution time is calculated
from the moment when the algorithm starts executing to the moment it stops.If all the processors do not start or end execution at the
same time, then the total execution time of the algorithm is the moment when the first processorstarted its execution to the moment
when the last processor stops its execution.
Time complexity of an algorithm can be classified into three categories−
Worst-case complexity − When the amount of time required by an algorithm for a given input is maximum.
Average-case complexity − When the amount of time required by an algorithm for a given input is average.
Best-case complexity − When the amount of time required by an algorithm for a given input is minimum.
Asymptotic Analysis
The complexity or efficiency of an algorithm is the number of steps executed by the algorithm to get the desired output.Asymptotic
analysis is done to calculate the complexity of an algorithm in its theoretical analysis. In asymptotic analysis,a large length of input
is used to calculate the complexity function of the algorithm.
Note − Asymptotic is a condition where a line tends to meet a curve, but they do not intersect. Here the line and the curve is
asymptotic to each other.
Asymptotic notation is the easiest way to describe the fastest and slowest possible execution time for an algorithm using hig h bounds
and low bounds on speed. For this, we use the following notations −
Big O notation
Omega notation
Thetanotation
Big O notation
In mathematics, Big O notation is used to represent the asymptotic characteristics offunctions.It represents the behavioro fa function
for large inputs in a simple and accurate method. It is a method of representing the upper bound of an algorithm’s execution time. It
represents the longest amount of time that the algorithm could take to complete its execution. The function −
f(n) = O(g(n))
iff there exists positive constants c and n0 such that f(n) ≤ c * g(n) for all nwhere n ≥ n0.
Omega notation
Omega notation is a method of representing the lower bound of an algorithm’s execution time. The function −
f(n) = Ω (g(n))
iff there exists positive constants c and n0 such that f(n) ≥ c * g(n) for all nwhere n ≥ n0.
Theta Notation
Theta notation is a method of representing both the lower bound and the upper bound of an algorithm’s execution time. The function
−
f(n) = θ(g(n))
iff there exists positive constants c1, c2, and n0 such that c1 * g(n) ≤ f(n) ≤ c2 * g(n) for all n where n ≥ n0.
Speedup of anAlgorithm
The performance of a parallel algorithm is determined by calculating its speedup. Speedup is defined as the ratio of the worst-case
execution time of the fastest known sequential algorithm for a particular problem to the worst-case execution time of the parallel
algorithm.
speedup = Worst case execution time of the fastest known sequential for a particular problem / Worst case execution time of theparallel
algorithm
15. By GS kosta
Number of ProcessorsUsed
The number of processors used is an important factor in analyzing the efficiency of a parallel algorithm. The cost to buy, ma intain,
and run the computers are calculated. Larger the number of processors used by an algorithmto solve a problem, more costly becomes
the obtained result.
Total Cost
Total cost of a parallel algorithm is the product of time complexity and the number of processors used in that particular alg orithm.
Total Cost = Time complexity × Number of processors used
Therefore, the efficiency of a parallel algorithm is –
Efficiency = Worst case execution time of sequential algorithm / Worst case execution time of the parallel algorithm
Models of Parallel Processing
Parallel processors come in manydifferentvarieties.
1. SIMD VERSUS MIMD ARCHITECTURES
Within the SIMD category, two fundamental design choices exist:
1. Synchronous versus asynchronous SIMD. In a SIMD machine, each processor can execute or ignore the instruction being broadcast
based on its local stateor data-dependent conditions. However, this leads to some inefficiency in executing conditional computations.
For example, an “if-then-else” statement is executed by first enabling the processors for which the condition is satisfied and then flipping
the “enable” bit before getting into the “else” part. On the average, half of the processors will be idle for each branch. The situation is
even worsefor “case” statements involving multiway branches. A possiblecure is to use the asynchronous version of SIMD, known as
SPMD (spim-dee or single-program, multiple data), where each processor runs its own copy of the common program. Theadvantage of
SPMD is that in an “if-then-else” computation, each processor will only spend time on the relevant branch. The disadvantages include
the need for occasional synchronization and the higher complexity of each processor, which must now have a program memory and
instruction fetch/decode logic.
2. Custom- versus commodity-chip SIMD. A SIMD machine can be designed based on commodity (off-the-shelf) components or with
custom chips. In the first approach, components tend to be inexpensive because of mass production. However, such general-purpose
components will likely contain elements that may not be needed for a particular design. These extra components may complicate the
design, manufacture, and testing of theSIMD machine and may introduce speed penalties as well. Customcomponents (including ASICs
= application-specificICs, multichip modules, or WSI = wafer-scale integrated circuits) generally offer better performance but lead to
much higher cost in view of their development costs being borne by a relatively small number of parallel machine users (as op posed to
commodity microprocessors that are produced in millions). As integrating multiple processors along with ample memory on a single
VLSI chip becomes feasible, a typeof convergence between the two approaches appears imminent.
Within the MIMD class, three fundamental issuesor design choices are subjects of ongoing debates in the research community:
1. MPP—massively ormoderatelyparallel processor. Is it more cost-effective to build a parallelprocessorout ofa relatively small
numberof powerfulprocessorsora massive numberofvery simple processors (the“herd ofelephants” orthe “army ofants”
approach)? Referring to Amdahl’s law,the first choice does betteron the inherently sequentialpart ofa computation while the
secondapproach might allowa higherspeed-upforthe parallelizable part.A generalanswercannot be given to this question,as
the best choiceis both application-and technology-dependent.
2. Tightly versusloosely coupled MIMD. Which is a betterapproach to high-performance computing,thatofusing specially
designedmultiprocessors/multicomputerora collection ofordinary workstations thatare interconnected by commoditynetworks
(such as EthernetorATM)and whose interactions are coordinated byspecialsystemsoftware anddistributed file systems? The
latterchoice,sometimes referred to as network ofworkstations (NOW)orclustercomputing,has beengainingpopularity in
recent years.However,many open problems exist fortaking full advantage ofsuch network-based loosely coupled architectures.
The hardware,systemsoftware,and applications aspects ofNOWs are being investigated by numerous researchgroups.
3. .Explicit message passing versusvirtualshared memory. Which scheme is better,that offorcing the users to explicitly specify all
messagesthatmust be sent betweenprocessors orto allowthemto programin an abstracthigher-levelmodel,with the required
messagesautomatically generated bythe systemsoftware? This question is essentially very similar to the one asked in the early
days ofhigh-levellanguages and virtualmemory.At some point in the past,programming in as sembly languages anddoing
explicit transfers betweensecondary andprimary memories could lead to higherefficiency.However,nowadays,software is so
complex and compilers and operating systems so advanced (notto mention processingpowerso cheap)that it no longermakes
sense to hand-optimize the programs,except in limited time-critical instances.However,we are not yet at that point in parallel
processing,and hiding the explicit communication structure ofa parallel machine fromthe programmer has nontrivial
consequencesforperformance
16. By GS kosta
THE PRAM SHARED-MEMORY MODEL
The theoretical model used for conventional or sequential computers (SISD class) is
known as therandom-access machine (RAM) (not to be confused with random-access
memory, which has the same acronym). Theparallel version of RAM [PRAM (pea-ram)],
constitutes an abstract model of the class of global-memory parallel processors. The
abstraction consists of ignoring the details of the processor-to-memory interconnection
network and taking theview that each processor can access any memory location in each
machine cycle, independent of what other processors are doing
DISTRIBUTED-MEMORY OR GRAPH MODELS
This network is usually represented as a graph, with vertices corresponding to processor–memory nodes and edges corresponding to
communication links. If communication links are unidirectional, then directed edges are used. Undirected edges imply bidirectional
communication, although not necessarily in both directions at once. Important parameters of an interconnection network include
1. Network diameter: thelongest of the shortest paths
between various pairs of nodes, which should be
relatively small if network latency is to be minimized.
The network diameter is more important with store-and-
forward routing (when a message is stored in its entirety
and retransmitted by intermediate nodes) than with
wormhole routing (when a message is quickly relayed
through a node in small pieces).
2. Bisection (band)width: thesmallest number (total
capacity) of links that need to be cut in order to divide
the network into two subnetworks of half the size. This
is important when nodes communicate with each other
in a random fashion. A small bisection (band)width
limits the rate of data transfer between thetwo halves of
the network, thus affecting theperformance of
communication-intensive algorithms.
3. Vertex or node degree: the number of communication
ports required of each node, which should be a constant
independent of network size if the architecture is to be
readily scalable to larger sizes. Thenode degree has a
direct effect on thecost of each node, with the effect
being more significant for parallel ports containing
several wires or when the node is required to
communicate over all of its ports at once.
CIRCUIT MODEL AND PHYSICAL REALIZATIONS
In a sense, the only sure way to predict the performance of a parallel architecture on a
given set of problems is to actually build themachine and run the programs on it. Because
this is often impossible or very costly, the next best thing is to model themachine at the
circuit level, so that all computationaland signal propagation delays can be taken into
account. Unfortunately, this is also impossible for a complex supercomputer, both because
generating and debugging detailed circuit specifications are not much easier than a
fullblown
implementation and because a circuit simulator would take eons to run the simulation.
Despitetheabove observations, we can produce and evaluate circuit-level designs for
specific applications.
GLOBAL VERSUS DISTRIBUTED MEMORY
Within the MIMDclass ofparallelprocessors,memory can be globalordistributed.
Global memory may be visualized as being in a central location where all processors can
access it with equal ease (or with equal difficulty, if you are a half-empty-glass typeof
person). Figure 4.3 shows a possiblehardware organization for a global-memory parallel
processor. Processors can access memory through a special processor-to-memory network.
A global-memory multiprocessor is characterized by the typeand number p of processors,
the capacity and number m of memory modules, and the network architecture. Even though
p and m are independent parameters, achieving high performance typically requires that they
be comparable in magnitude (e.g., too few memory modules will cause contention among
the processors and too many would complicate the network design).
17. By GS kosta
Distributed-memory architectures can be conceptually viewed as in Fig. 4.5. A collection of p processors, each with its own privatememory,
communicates through an interconnection network. Here, the latency of the
interconnection network may be less critical, as each processor is likely to
access its own local memory most of the time. However, the communication
bandwidth of thenetwork may or may not be critical, depending on the typeof
parallel applications and the extent of task interdependencies. Note that each
processor is usually connected to thenetwork through multiple links or channels
(this is the norm here ,although it can also be the case for shared-memory
parallel processors).
Cache coherence
In computer architecture, cache coherence is the uniformity
of shared resource data that ends up stored in multiple local
caches. When clients in a system maintain caches of a
common memory resource, problems may arise with
incoherent data, which is particularly the case with CPUs in
a multiprocessingsystem.
In the illustration on the right, consider both the clients have
a cached copy of a particular memory block from a previous
read. Suppose the client on the bottom updates/changes
that memory block, the client on the top could be left with an
invalid cache of memory without any notification of the
change. Cache coherence is intended to manage such
conflicts by maintaining a coherent view of the data values in
multiple caches.
The following are the requirements for cache coherence:[2]
Write Propagation
Changes to the data in any cache must be propagated to other copies(of that cache line) in the peer
caches.
Transaction Serialization
Reads/Writes to a single memory location must be seen by all processors in the same order.
Coherence protocols
Coherence Protocols apply cache coherence in multiprocessor systems. The intention is that two clients must
never see different values of the same shared data.
The protocol must implement the basic requirements for coherence. It can be tailor made for the target
system/application.
Protocols can also be classified as Snooping(Snoopy/Broadcast) or Directory based. Typically, early systems
used directory based protocols where a directory would keep a track of the data being shared and the sharers. In
Snoopy protocols , the transaction request. (read/write/upgrade) are sent out to all processors. All processors
snoop the request and respond appropriately.
Write Propagation in Snoopy protocols can be implemented by either of the following:
Write Invalidate
When a write operation is observed to a location that a cache has a copy of, the cache controller
invalidates its own copy of the snooped memory location, and thus forcing reads from main memory of the
new value on their next access.[4]
Write Update
When a write operation is observed to a location that a cache has a copy of, the cache controller updates
its own copy of the snooped memory location with the new data.
18. By GS kosta
If the protocol design states that whenever any copy of the shared data is changed, all the other copies
must be "updated" to reflect the change, then it is a write update protocol. If the design states that on a
write to a cached copy by any processor requires other processors to discard/invalidate their cached
copies, then it is a write invalidate protocol.
However, scalability is one shortcoming of broadcast protocols.
Various models and protocols have been devised for maintaining coherence.
Parallel Algorithm - Models
The model of a parallel algorithm is developed by considering a strategy for dividing the
data and processing method and applying a suitable strategy to reduce interactions. In
this chapter, we will discuss the following Parallel Algorithm Models −
Data parallel model
Task graph model
Work pool model
Master slave model
Producer consumer or pipeline model
Hybrid model
Data Parallel
In data parallel model, tasks are assigned to processes and each task performs similar
types of operations on different data. Data parallelism is a consequence of single
operations that is being applied on multiple data items.
Data-parallel model can be applied on shared-address spaces and message-passing
paradigms. In data-parallel model, interaction overheads can be reduced by selecting a
locality preserving decomposition, by using optimized collective interaction routines, or
by overlapping computation and interaction.
The primary characteristic of data-parallel model problems is that the intensity of data
parallelism increases with the size of the problem, which in turn makes it possible to
use more processes to solve larger problems.
Example − Dense matrix multiplication.
19. By GS kosta
Task Graph Model
In the task graph model, parallelism is expressed by a task graph. A task graph can
be either trivial or nontrivial. In this model, the correlation among the tasks are utilized
to promote locality or to minimize interaction costs. This model is enforced to solve
problems in which the quantity of data associated with the tasks is huge compared to
the number of computation associated with them. The tasks are assigned to help
improve the cost of data movement among the tasks.
Examples − Parallel quick sort, sparse matrix factorization, and parallel algorithms
derived via divide-and-conquer approach.
Here, problems are divided into atomic tasks and implemented as a graph. Each task is
an independent unit of job that has dependencies on one or more antecedent task. After
20. By GS kosta
the completion of a task, the output of an antecedent task is passed to the dependent
task. A task with antecedent task starts execution only when its entire antecedent task
is completed. The final output of the graph is received when the last dependent task is
completed (Task 6 in the above figure).
Work Pool Model
In work pool model, tasks are dynamically assigned to the processes for balancing the
load. Therefore, any process may potentially execute any task. This model is used when
the quantity of data associated with tasks is comparatively smallerthan the computation
associated with the tasks.
There is no desired pre-assigning of tasks onto the processes. Assigning of tasks is
centralized or decentralized. Pointers to the tasks are saved in a physically shared list,
in a priority queue, or in a hash table or tree, or they could be saved in a physically
distributed data structure.
The task may be available in the beginning, or may be generated dynamically. If the
task is generated dynamically and a decentralized assigning of task is done, then a
termination detection algorithm is required so that all the processes can actually detect
the completion of the entire program and stop looking for more tasks.
Example − Parallel tree search
Master-Slave Model
In the master-slave model, one or more master processes generate task and allocate it
to slave processes. The tasks may be allocated beforehand if −
21. By GS kosta
the master can estimate the volume of the tasks, or
a random assigning can do a satisfactory job of balancing load, or
slaves are assigned smaller pieces of task at different times.
This model is generally equally suitable to shared-address-space or message-
passing paradigms, since the interaction is naturally two ways.
In some cases, a task may need to be completed in phases, and the task in each phase
must be completed before the task in the next phases can be generated. The master-
slave model can be generalized to hierarchical or multi-level master-slave
model in which the top level master feeds the large portion of tasks to the second-level
master, who further subdivides the tasks among its own slaves and may perform a part
of the task itself.
Precautions in using the master-slave model
Care should be taken to assure that the master does not become a congestion point. It
may happen if the tasks are too small or the workers are comparatively fast.
The tasks should be selected in a way that the cost of performing a task dominates the
cost of communication and the cost of synchronization.
Asynchronous interaction may help overlap interaction and the computation associated
with work generation by the master.
Pipeline Model
It is also known as the producer-consumer model. Here a set of data is passed on
through a series of processes, each of which performs some task on it. Here, the arrival
of new data generates the execution of a new task by a process in the queue. The
processes could form a queue in the shape of linear or multidimensional arrays, trees,
or general graphs with or without cycles.
This model is a chain of producers and consumers. Each process in the queue can be
considered as a consumer of a sequence of data items for the process preceding it in
the queue and as a producer of data for the process following it in the queue. The queue
does not need to be a linear chain; it can be a directed graph. The most common
22. By GS kosta
interaction minimization technique applicable to this model is overlapping interaction
with computation.
Example − Parallel LU factorization algorithm.
Hybrid Models
A hybrid algorithm model is required when more than one model may be needed to
solve a problem.
A hybrid model may be composed of either multiple models applied hierarchically or
multiple models applied sequentially to different phases of a parallel algorithm.
Example − Parallel quick sort
Shared memory/Parallel Processing in Memory
In computer science, shared memory is memory that may be
simultaneously accessed by multiple programs with an intent
to provide communication among them or avoid redundant
copies. Shared memory is an efficient means of passing data
between programs. Depending on context, programs may
run on a single processor or on multiple separate
processors.
Using memory for communication inside a single
program, e.g. among its multiple threads, is also referred
to as shared memory.
In hardware[edit]
In computer hardware, shared memory refers to a (typically large) block of random access memory (RAM) that
can be accessed by several different central processing units (CPUs) in a multiprocessor computer system.
Shared memory systems may use:[1]
uniform memory access (UMA): all the processors share the physical memory uniformly;
non-uniform memory access (NUMA): memory access time depends on the memory location relative to a
processor;
cache-only memory architecture (COMA): the local memories for the processors at each node is used as
cache instead of as actual main memory.
In software[edit]
In computer software, shared memory is either
23. By GS kosta
a method of inter-process communication (IPC), i.e. a way of exchanging data between programs running at
the same time. One process will create an area in RAMwhich other processes can access;
a method of conserving memory space by directing accesses to what would ordinarily be copies of a piece of
data to a single instance instead, by using virtual memorymappings or with explicit support of the program in
question. This is most often used for shared libraries and for XIP.