CSA unit5.pptx

What is Parallel Computing
•Traditionally, software has been written for
serial computation:
–To be run on a single computer having a single
Central Processing Unit (CPU);
–A problem is broken into a discrete series of
instructions.
–Instructions are executed one after another.
–Only one instruction may execute at any moment
in time.

•Parallel computing is the simultaneous use of
multiple compute resources to solve a
computational problem.
–To be run using multiple CPUs
–A problem is broken into discrete parts that can
be solved concurrently
–Each part is further broken down to a series of
instructions
•Instructions from each part execute simultaneously
on different CPUs

•The primary reasons for using parallel
computing:
–Save time
–Solve large problems
–Provide concurrency (do multiple things at the
same time)
–Taking advantage of non-local resources
–Overcoming memory constraints
–Cost savings

Basic Design
•Basic design
–Memory is used to store both program and data
instructions
–Program instructions are coded data which tell
the computer to do something
–Data is simply information to be used by the
program
•A central processing unit (CPU) gets
instructions and/or data from memory,
decodes the instructions and then
sequentially performs them.

10
Classification of parallel architectures
•Flynn’s taxonomy
•Classification based on the memory
arrangement
•Classification based on type of interconnection

11
Flynn’s Taxonomy
– The most universally accepted method of
classifying computer systems
– Any computer can be placed in one of 4 broad
categories
» SISD: Single instruction stream, single data
stream
» SIMD: Single instruction stream, multiple data
streams
» MIMD: Multiple instruction streams, multiple
data streams

12
SISD
Processing
element (PE)
Main memory
(M)
Instructions
Data
Control Unit PE Memory
PE
IS
IS DS

Single Instruction, Single Data
(SISD)
•A serial (non-parallel) computer
•Single instruction: only one instruction
stream is being acted on by the CPU during
any one clock cycle
•Single data: only one data stream is being
used as input during any one clock cycle
•This is the oldest and until recently, the most
prevalent form of computer
•Examples: most PCs, single CPU workstations
and mainframes

•A type of parallel computer
•Single instruction: All processing units execute the same instruction at any
given clock cycle
•Multiple data: Each processing unit can operate on a different data element
•Best suited for specialized problems characterized by a high degree of
regularity such as image processing.
•Two varieties: Processor Arrays and Vector Pipelines
•Examples:
–Processor Arrays: Connection Machine CM-2, Maspar MP-1, MP-2
–Vector Pipelines: IBM 9000, Cray C90, Fujitsu VP, NEC SX-2, Hitachi S820

Applications
•Image processing
• Matrix manipulations
• Sorting

• A single data stream is fed into multiple processing
units.
• Each processing unit operates on the data
independently via independent instruction streams.
• Many functional units perform different operations
on the same data.

Applications
•Classification
• Robot vision

•Currently, the most common type of parallel
computer. Most modern computers fall into this
category.
•Multiple Instruction: every processor may be
executing a different instruction stream
•Multiple Data: every processor may be working
with a different data stream
•Execution can be synchronous or asynchronous,
deterministic or non-deterministic

Classification based on memory
arrangement

23
PE1 PEn
Processors
Interconnection
network
Shared memory
Shared memory - multiprocessors
I/O1
I/On
PE1
Interconnection
network
M1
P1
PEn
Mn
Pn
Message passing - multicomputers

•Multiple processors can operate independently but share
the same memory resources.
•Changes in a memory location effected by one processor
are visible to all other processors.
•Processors easily communicate by means of shared
variables
•Shared memory machines can be divided into two main
classes based upon memory access times: UMA and NUMA

Shared Memory: Pro and Con
•Advantages
–Global address space provides a user-friendly programming
perspective to memory
–Data sharing between tasks is both fast and uniform due to
the proximity of memory to CPUs
•Disadvantages:
–Primary disadvantage is the lack of scalability between
memory and CPUs.
– Adding more CPUs can geometrically increases traffic on
the shared memory
–Programmer responsibility for synchronization constructs
that insure "correct" access of global memory.
–Expense: it becomes increasingly difficult and expensive to
design and produce shared memory machines with ever
increasing numbers of processors.

28
Distributed memory multicomputers
PE
Interconnection
network
M
PE
M
PE
M
PE
M
PE
M
PE
M

•Processors have their own local memory.
•Memory addresses in one processor do not map to another
processor, so there is no concept of global address space across
all processors.
•Because each processor has its own local memory, it operates
independently.
•Changes it makes to its local memory have no effect on the
memory of other processors. Hence, the concept of cache
coherency does not apply.
•When a processor needs access to data in another processor, it
is usually the task of the programmer to explicitly define how
and when data is communicated. Synchronization between tasks
is likewise the programmer's responsibility

Distributed Memory: Pro and Con
•Advantages
–Memory is scalable with number of processors. Increase the number of
processors and the size of memory increases proportionately.
–Each processor can rapidly access its own memory without interference
and without the overhead incurred with trying to maintain cache
coherency.
–Cost effectiveness: can use commodity, off-the-shelf processors and
networking.
•Disadvantages
–The programmer is responsible for many of the details associated with
data communication between processors.
–It may be difficult to map existing data structures, based on global
memory, to this memory organization.
–Non-uniform memory access (NUMA) times

Classification based on type of
interconnections

33
•Static networks
•Dynamic networks

Scalar and vector processors
•Scalar processors are the most basic type of
processor.
•These process one item at a time, typically
integers or floating point numbers, which are
numbers too large or small to be represented by
integers.
• As each instruction is handled sequentially,
basic scalar processing can take up some time.

• vector processors operate on an array of data points.
• Rather than handling each item individually, multiple
items that all have the same instruction can be handled at
once.
•This can save time over scalar processing, but also adds
complexity to a system, which can slow other functions.
•Vector processing works best when there is a large amount
of data to be processed, groups of which can be handled by
one instruction

Conventional scalar processor
Initialize I = 0
20 Read A(I)
Read B(I)
Store C(I) = A(I) + B(I)
Increment I = I + 1
If I  100 go to 20
Continue

Vector processor
Single vector instruction
C(1:100) = A(1:100) + B(1:100)

Vector processor can process
• Vector Data type
• Apply same operation on all elements of the
vector
• No dependencies amongst elements
• Same motivation as SIMD

What is vector processing?
•A vector processor is one that can compute operations on entire
vectors with one simple instruction.
•A vector compiler will attempt to translate loops into single
vector instructions.
•Example - Suppose we have the following for loop:
for i = 1, n
X(i) = Y(i) + Z(i)
continue
•This will be translated into one long vector of length n and a
vector add instruction will be executed.

Why is this more efficient?
#1: Because there is only a need for one instruction, the vector
processor will not have to fetch and decode as many
instructions; Thus, memory bandwidth and the control unit
overhead are reduced considerably.
#2: The Vector Processor, after receiving the instruction, will be
told that it must fetch x amount of pairs of operands. When
received, they will be passed on directly to a pipelined data unit
to process them.

There are 2 specific kinds of machines
#1: Memory to memory: operands are fetched from
memory and passed on directly to the functional unit.
The results are then written back out to memory to
complete the process.
#2: Register to register: operands are loaded into a set
of vector registers, the operands are fetched from the
vector registers and the results are returned to a vector
register.

Vector Instruction Set Advantages
•Compact
–one short instruction encodes N operations
•Expressive, tells hardware that these N
operations:
–are independent
–use the same functional unit
–access disjoint registers
–access registers in the same pattern as previous instructions
–access a contiguous block of memory (unit-stride load/store)
–access memory in a known pattern (strided load/store)
•Scalable
–can run same object code on more parallel pipelines or lanes

Disadvantages
•Not as fast with scalar instructions
•Complexity
•Difficulties in implementing
•High price of on-chip vector memory systems
•Increased code complexity

Applications
•Servers
•Home Cinema
•Super Computing
•Cluster Computing
•Mainframes

Fall 2008 Introduction to Parallel Processing 45
Array Computers
•An array processor is a synchronous parallel
computer with multiple arithmetic logic units, called
processing elements (PE), that can operate in
parallel.
•The PEs are synchronized to perform the same
function at the same time.
•Only a few array computers are designed primarily
for numerical computation, while the others are for
research purposes.

Fall 2008 Introduction to Parallel Processing 46
Functional structure of
array computer

• Array processors are also known as multiprocessors or vector processors.
They perform computations on large arrays of data. Thus, they are used to
improve the performance of the computer.
• Two types of Array Processor:
 Attached Array Processors
 SIMD Array Processors
Attached Array Processors:
• An attached array processor is a processor which is attached to a general
purpose computer and its purpose is to enhance and improve the performance
of that computer in numerical computational tasks.
• It achieves high performance by means of parallel processing with multiple
functional units.

SIMD Array Processors
• SIMD is the organization of a single computer containing multiple processors
operating in parallel.
• The processing units are made to operate under the control of a common
control unit, thus providing a single instruction stream and multiple data
streams.
• A general block diagram of an array processor is next slide.
• It contains a set of identical processing elements (PE's), each of which is
having a local memory M.
• Each processor element includes an ALU and registers.
• The master control unit controls all the operations of the processor elements.
It also decodes the instructions and determines how the instruction is to be
executed.

• The main memory is used for storing the program.
• The control unit is responsible for fetching the instructions.
• Vector instructions are send to all PE's simultaneously and results are
returned to the memory.
• The best known SIMD array processor is the ILLIAC IV computer
developed by the Burroughs corps. SIMD processors are highly specialized
computers.
• They are only suitable for numerical problems that can be expressed in vector
or matrix form and they are not suitable for other types of computations.

Why use the Array Processor
•Array processors increases the overall instruction processing speed.
•As most of the Array processors operates asynchronously from the host CPU, hence
it improves the overall capacity of the system.
•Array Processors has its own local memory, hence providing extra memory for
systems with low memory.

CSA unit5.pptx

Recommended

Recommended

More Related Content

Similar to CSA unit5.pptx

Similar to CSA unit5.pptx (20)

Recently uploaded

Recently uploaded (20)

CSA unit5.pptx