Parallel Processing Presentation2

COMPUTERARCHITECTURE
Parallel Processing
Presented by:
MUHAMMAD DANIYAL QURESHI
COMPUTER SCIENCE
SHAH ABDUL LATIF UNIVERSITY

Sequential Vs Parallel Processing
Let’s assume we make one sandwich by taking one slice of bread,
then one slice of cheese perhaps a piece of meat then ends with
Second slice of bread. Normal Sequential Process will make
sandwiches by repeating the same sequence of taking slice of bread,
slice of cheese then meat and then ends with another slice of bread.
While in Parallel processing, multiple sandwiches can be made in less
time by taking multiple slices of bread then multiple slices of cheese
then multiple pieces of meat and multiple slices of bread and keeping
them at a right position to complete ingredients of each sandwich.

Parallel Processing
 Parallel processing is the processing of divided program instructions
among multiple processors to run a program in less time.
 A computation-intensive program that took one hour to run and a tape
copying program that took one hour to run would take a total of two
hours to run. ( Sequential Processing )
 An early form of parallel processing allowed the enclosed execution of
both programs together. The computer would start an I/O operation,
and while it was waiting for the operation to complete, it would execute
the processor-intensive program. The total execution time for the two
jobs would be a little over one hour. ( Parallel Processing )

Processor Systems
S.I.S.D.
S.I.M.D.
 Single Instruction, Single Data Stream
 Single processor executes Single Instruction stream to
operate on the data stored on a Single Memory.
 Single Instruction, Multiple Data Stream
 A single machine instruction controls the simultaneous
execution of a number of processing elements on a lockstep
basis. Each processing element has an associated data
memory, so that each instruction is executed on a different set
of data by the different processors

Processor Systems
M.I.S.D.
M.I.M.D.
 Multiple Instruction, Single Data Stream
 A sequence of data is transmitted to a set of processors,
each of which executes a different instruction sequence.
This structure is not commercially implemented.
 Multiple Instruction, Multiple Data Stream
 A set of processors simultaneously execute different
instruction sequences on different data sets
 SMPs, clusters and NUMA systems fit this category.

Parallel Processor
Classification
With the MIMD organization, the
processors are general purpose;
each is able to process all of the
instructions necessary to perform
the appropriate data
transformation. MIMDs can be
further subdivided by the means in
which the processors
communicate. If the processors
share a common memory, then
each processor accesses programs
and data stored in the shared
memory.

Alternative Computer Organization
Single Instruction, Single Data Stream
There is some sort of control unit that provides an instruction stream
to a processing unit. The processing unit operates on a single data
stream from a memory unit

Single Instruction, Multiple Data Stream
There is still a single control unit, provided for a single instruction stream to
multiple Processing Unit. Each Processing Unit may have its own dedicated
memory, or there may be a shared memory

Multiple Instruction, Multiple Data Stream (Shared Memory)
In MIMD, There are multiple control units, each feeding a separate instruction stream to its own
Processing unit. The MIMD may be a shared-memory multiprocessor or a distributed- memory
multicomputer.

Multiple Instruction, Multiple Data Stream (Distributed Memory)
In MIMD, There are multiple control units, each feeding a separate instruction
stream to its own Processing unit. The MIMD may be a shared-memory
multiprocessor or a distributed-memory multicomputer.

Multiprocessor Operating System Design
Consideration
 OS routines need to allow several processors to execute the equal IS code at the same
time. Structures must be managed properly to avoid unacceptable operations. (
Simultaneous Parallel Processing )
 Any processor may perform scheduling so clashes must be avoided & Scheduler must
assign organized processes to available processors. ( Scheduling )
 Care must be taken to provide effective synchronization. Synchronization is a facility
that enforces mutual prevention and event arrangement. ( Synchronization )
 OS needs to use the available hardware parallelism to achieve the best performance. (
Memory Management )
 Scheduler and other portions of the operating system must recognize the loss of a
processor and restructure accordingly. OS should provide graceful message in face of
Processor failure. ( Reliability and fault tolerance )

Symmetric Multiprocessor System
 Symmetric means All processors can perform the same functions.
 Two or more similar processors of comparable capacity
 Processors share same memory and I/O facilities
 Processors are connected by a bus or other internal connection
 Memory access time is approximately the same for each processor
 All processors share access to I/O devices
 Either through same channels or different channels giving paths to
same devices
 System controlled by integrated operating system
 Provides interaction between processors and their programs at job,
task, file and data element levels

Symmetric Multiprocessor System
The processors can intercommunicate through shared memory. It may also be possible for
processors to exchange signals directly. The memory is often organized so that multiple accesses
to separate blocks of memory are possible. Sometimes, Each processor may also have its own
main memory and I/O channels in addition to the shared resources.

Bus Organization In S.M.P.
Advantages
 Simplest approach to
multiprocessor organization.
 Easy to expand the system by
attaching more processors to the
bus.
 The bus is essentially a passive
medium and the failure of any
attached device should not cause
failure of the whole system.
Disadvantages
 Performance is limited by bus-
cycle time because memory
references pass through the
shared bus.
 Each processor should have
cache, which reduces the
number of bus accesses.
 Leads to problems with cache
coherence.

Cache Coherence
Definition
 Cache coherence is the
consistency of shared resource
data that ends up stored in
multiple local caches.
 When clients in a system maintain
caches of a common memory
resource, problems may arise with
unpredictable data, which is
particularly the case with CPUs in
a multiprocessing system.
Writing Policies
 When a system writes data to cache,
it must at some point write that data
to the backing store. The timing of
this write is controlled by what is
known as the write policy.
 APPROACHES:
 WRITE-THROUGH: Write is done
synchronously both to the cache and to
the backing store
 WRITE-BACK: The write to the backing
store is postponed until the cache
blocks containing the data are about to
be modified/replaced by new content.

Multiplied copy of the same data can exist in the different caches
simultaneously and if processors are allowed to update their own
copies freely, an unreliable view of memory can result.
Possible Problem With Cache Coherence Using Bus
Organization

Solutions To The Cache Coherence
Problems
Software Based
 Software-based protocol rely upon
the operating system and compiler.
 Compiler-based protocol performs
analysis on the code to determine
which data is unsafe for caching,
then mark those items respectively.
 Operating system prevent un-cache-
able items from being cached.
 Software-based protocol is affective
because overhead of problems is
transferred from run-time to compile
time.
Hardware Based
 Also known as Cache coherence protocols
 These solutions provide identification at
run-time of possible irregularity situations.
 Hardware-based leading to improved
performance over a software approach.
 Approaches are transparent to the
programmer and the compiler, reducing
the software development burden
 Can be divided into two categories:
 Directory protocols
 Snoopy protocols

Problems
Software Based
time.
Hardware Based
 DIRECTORY PROTOCOL:
 It collects & maintain the
information about copies of lines
reside.
 Contains the various local caches.
 Keeping the information up-to-
date.
 Manage the information which
caches copy of a line.
 DRAWBACK: Only for System with
less buses, not large-scale
systems.

Problems
Software Based
time.
Hardware Based
 SNOOPY CACHE PROTOCOL: Distributed
responsibility for maintaining the cache
coherence among all controllers &
Multiprocessor.
 BASIC APPROACH: Write Invalid & Write
Update
 Write invalid Protocol: Multiple readers but
single writer, only cache can write at a
time.
 Write Update Protocol: Multiple readers,
Multiple writers. Updated input is
distributed to all caches.

Cluster
 A Cluster is a group of tightly or loosely coupled computers that
work together as a single computer.
 Commonly but not always connected through fast local area
networks. But not always.
 A group of interconnected WHOLE COMPUTERS works together,
can create the misconception of being one machine having parallel
processing.
 A system that can refer run on its own apart from the cluster, used
in server systems are called whole computers.
 Each Computer in cluster is called a NODE.

Cluster Products
 In Picture: IBM Hydro Cluster
 VA Cluster, developed by D.E.C. in
1980’s
 Microsoft, Sun Microsystems and
other companies also offer Cluster
Package of Computers.
 Linux is the most widely used
operating systems ever since for
cluster computers around the
world.

Cluster
Architecture
The individual computers are connected by
some high-speed LAN or switch hardware.
Each computer is capable of operating
independently. In addition, a middleware
layer of software is installed in each
computer to enable cluster operation. The
cluster middleware provides a unified
system image to the user, known as a
single-system image. The middleware is
also responsible for providing high
availability, by means of load balancing and
responding to failures in individual
components. A cluster will also include
software tools for enabling the efficient
execution of programs that are capable of
parallel execution.

Comparing Cluster With Symmetric
Multiprocessors
Symmetric Multiprocessor
 Easier to manage and
configure.
 Less physical space and lower
power consumption.
 Well established and stable.
Clusters
 Far superior in terms of
incremental and absolute
scalability.
 Superior in terms of availability.
 All components of the system can
readily be made highly redundant.
 Both provide a configuration with multiple processors to support high demand applications.
 Both solutions are available commercially.

Parallelized Computing
 Effective use of a cluster requires executing software from a single
application in parallel.
 Following lists three general approaches to the problem:
 PARALLELIZED COMPILER:
 Determines at compile time which parts of an application can be executed in parallel.
 These are then split off to be assigned to different computers in the cluster.
 PARALLELIZED APPLICATION:
 Application written from the outset to run on a cluster and uses message passing to move data
between cluster nodes.
 PARAMETRIC COMPUTING:
 Can be used if the essence of the application is an algorithm or program that must be executed
a large number of times, each time with a different set of starting conditions or parameters.

Non-Uniform Memory Access
 Alternative to SMP and clustering
 Uniform memory access (UMA)
 All processors have access to all parts of main memory using loads and stores
 Access time to all regions of memory is the same
 Access time to memory for different processors is the same
 Non-uniform memory access (NUMA)
 All processors have access to all parts of main memory using loads and stores
 Access time of processor differs depending on which region of main memory is
being accessed
 Different processors access different regions of memory at different speeds
 Cache-coherent NUMA (CC-NUMA)
 A NUMA system in which cache coherence is maintained among the caches of the
various processors

Objective Of N.U.M.A. In Comparison
SYMMETRIC
MULTIPROCESS
 Has Practical limit to
number of processors
that can be used.
 Has Bus traffic limits to
between 16 and 64
processors.
CLUSTER
 Each node has its own
private main memory.
 Coherency is
maintained by software
rather than hardware.
NONUNIFORM
MEMORYACCESS
 NUMA preserves
SMP feeling while
giving large scale
multiprocessing.
 Objective is to
maintain a
transparent system
wide memory while
permitting multiple
multiprocessor
nodes, each with its
own bus or internal
interconnect system

Cache Coherent Non Uniform Memory
Access Organization
There are multiple independent nodes, each of which is, in
effect, an SMP organization. Thus, each node contains multiple
processors, each with its own L1 and L2 caches & main memory.
The node is the basic building block of the overall CC-NUMA
organization. For example, each Silicon Graphics Origin node
includes two MIPS R10000 processors; each Sequent NUMA-Q
node includes four Pentium II processors. The nodes are
interconnected by means of some communications facility,
which could be a switching mechanism, a ring, or some other
networking facility. Each node in the CC-NUMA system includes
some main memory. From the point of view of the processors,
however, there is only a single addressable memory, with each
location having a unique system wide address.

Cache Coherent Non Uniform Memory
Access Organization
When a processor initiates a memory access, if the requested
memory location is not in that processor’s cache, then the L2
cache initiates a fetch operation. If the desired line is in the local
portion of the main memory, the line is fetched across the local
bus. If the desired line is in a remote portion of the main
memory, then an automatic request is sent out to fetch that line
across the interconnection network, deliver it to the local bus,
and then deliver it to the requesting cache on that bus. All of
this activity is automatic and transparent to the processor and its
cache. In this configuration, cache coherence is a central
concern. Although implementations differ as to details, in
general terms we can say that each node must maintain some
sort of directory that gives it an indication of the location of
various portions of memory and also cache status information.

N.U.M.A.
 It can deliver effective performance at higher levels of parallelism
than SMP without requiring major software changes.
 Bus traffic on any individual node is limited to a request that the bus
can handle.
 If many of the memory accesses are to remote nodes, performance
begins to break down
 Does not clearly look like an SMP.
 Software changes will be required to move an operating system and
applications from an SMP to a CC-NUMA system.
 Concern with ease of use.

Thanks
:)

Parallel Processing Presentation2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Parallel Processing Presentation2

Similar to Parallel Processing Presentation2 (20)

Parallel Processing Presentation2