2. COMPUTERARCHITECTURE
Sequential Vs Parallel Processing
Let’s assume we make one sandwich by taking one slice of bread,
then one slice of cheese perhaps a piece of meat then ends with
Second slice of bread. Normal Sequential Process will make
sandwiches by repeating the same sequence of taking slice of bread,
slice of cheese then meat and then ends with another slice of bread.
While in Parallel processing, multiple sandwiches can be made in less
time by taking multiple slices of bread then multiple slices of cheese
then multiple pieces of meat and multiple slices of bread and keeping
them at a right position to complete ingredients of each sandwich.
3. COMPUTERARCHITECTURE
Parallel Processing
Parallel processing is the processing of divided program instructions
among multiple processors to run a program in less time.
A computation-intensive program that took one hour to run and a tape
copying program that took one hour to run would take a total of two
hours to run. ( Sequential Processing )
An early form of parallel processing allowed the enclosed execution of
both programs together. The computer would start an I/O operation,
and while it was waiting for the operation to complete, it would execute
the processor-intensive program. The total execution time for the two
jobs would be a little over one hour. ( Parallel Processing )
4. COMPUTERARCHITECTURE
Processor Systems
S.I.S.D.
S.I.M.D.
Single Instruction, Single Data Stream
Single processor executes Single Instruction stream to
operate on the data stored on a Single Memory.
Single Instruction, Multiple Data Stream
A single machine instruction controls the simultaneous
execution of a number of processing elements on a lockstep
basis. Each processing element has an associated data
memory, so that each instruction is executed on a different set
of data by the different processors
5. COMPUTERARCHITECTURE
Processor Systems
M.I.S.D.
M.I.M.D.
Multiple Instruction, Single Data Stream
A sequence of data is transmitted to a set of processors,
each of which executes a different instruction sequence.
This structure is not commercially implemented.
Multiple Instruction, Multiple Data Stream
A set of processors simultaneously execute different
instruction sequences on different data sets
SMPs, clusters and NUMA systems fit this category.
6. COMPUTERARCHITECTURE
Parallel Processor
Classification
With the MIMD organization, the
processors are general purpose;
each is able to process all of the
instructions necessary to perform
the appropriate data
transformation. MIMDs can be
further subdivided by the means in
which the processors
communicate. If the processors
share a common memory, then
each processor accesses programs
and data stored in the shared
memory.
7. COMPUTERARCHITECTURE
Alternative Computer Organization
Single Instruction, Single Data Stream
There is some sort of control unit that provides an instruction stream
to a processing unit. The processing unit operates on a single data
stream from a memory unit
8. COMPUTERARCHITECTURE
Alternative Computer Organization
Single Instruction, Multiple Data Stream
There is still a single control unit, provided for a single instruction stream to
multiple Processing Unit. Each Processing Unit may have its own dedicated
memory, or there may be a shared memory
9. COMPUTERARCHITECTURE
Alternative Computer Organization
Multiple Instruction, Multiple Data Stream (Shared Memory)
In MIMD, There are multiple control units, each feeding a separate instruction stream to its own
Processing unit. The MIMD may be a shared-memory multiprocessor or a distributed- memory
multicomputer.
10. COMPUTERARCHITECTURE
Alternative Computer Organization
Multiple Instruction, Multiple Data Stream (Distributed Memory)
In MIMD, There are multiple control units, each feeding a separate instruction
stream to its own Processing unit. The MIMD may be a shared-memory
multiprocessor or a distributed-memory multicomputer.
11. COMPUTERARCHITECTURE
Multiprocessor Operating System Design
Consideration
OS routines need to allow several processors to execute the equal IS code at the same
time. Structures must be managed properly to avoid unacceptable operations. (
Simultaneous Parallel Processing )
Any processor may perform scheduling so clashes must be avoided & Scheduler must
assign organized processes to available processors. ( Scheduling )
Care must be taken to provide effective synchronization. Synchronization is a facility
that enforces mutual prevention and event arrangement. ( Synchronization )
OS needs to use the available hardware parallelism to achieve the best performance. (
Memory Management )
Scheduler and other portions of the operating system must recognize the loss of a
processor and restructure accordingly. OS should provide graceful message in face of
Processor failure. ( Reliability and fault tolerance )
12. COMPUTERARCHITECTURE
Symmetric Multiprocessor System
Symmetric means All processors can perform the same functions.
Two or more similar processors of comparable capacity
Processors share same memory and I/O facilities
Processors are connected by a bus or other internal connection
Memory access time is approximately the same for each processor
All processors share access to I/O devices
Either through same channels or different channels giving paths to
same devices
System controlled by integrated operating system
Provides interaction between processors and their programs at job,
task, file and data element levels
13. COMPUTERARCHITECTURE
Symmetric Multiprocessor System
The processors can intercommunicate through shared memory. It may also be possible for
processors to exchange signals directly. The memory is often organized so that multiple accesses
to separate blocks of memory are possible. Sometimes, Each processor may also have its own
main memory and I/O channels in addition to the shared resources.
14. COMPUTERARCHITECTURE
Bus Organization In S.M.P.
Advantages
Simplest approach to
multiprocessor organization.
Easy to expand the system by
attaching more processors to the
bus.
The bus is essentially a passive
medium and the failure of any
attached device should not cause
failure of the whole system.
Disadvantages
Performance is limited by bus-
cycle time because memory
references pass through the
shared bus.
Each processor should have
cache, which reduces the
number of bus accesses.
Leads to problems with cache
coherence.
15. COMPUTERARCHITECTURE
Cache Coherence
Definition
Cache coherence is the
consistency of shared resource
data that ends up stored in
multiple local caches.
When clients in a system maintain
caches of a common memory
resource, problems may arise with
unpredictable data, which is
particularly the case with CPUs in
a multiprocessing system.
Writing Policies
When a system writes data to cache,
it must at some point write that data
to the backing store. The timing of
this write is controlled by what is
known as the write policy.
APPROACHES:
WRITE-THROUGH: Write is done
synchronously both to the cache and to
the backing store
WRITE-BACK: The write to the backing
store is postponed until the cache
blocks containing the data are about to
be modified/replaced by new content.
16. COMPUTERARCHITECTURE
Multiplied copy of the same data can exist in the different caches
simultaneously and if processors are allowed to update their own
copies freely, an unreliable view of memory can result.
Possible Problem With Cache Coherence Using Bus
Organization
17. COMPUTERARCHITECTURE
Solutions To The Cache Coherence
Problems
Software Based
Software-based protocol rely upon
the operating system and compiler.
Compiler-based protocol performs
analysis on the code to determine
which data is unsafe for caching,
then mark those items respectively.
Operating system prevent un-cache-
able items from being cached.
Software-based protocol is affective
because overhead of problems is
transferred from run-time to compile
time.
Hardware Based
Also known as Cache coherence protocols
These solutions provide identification at
run-time of possible irregularity situations.
Hardware-based leading to improved
performance over a software approach.
Approaches are transparent to the
programmer and the compiler, reducing
the software development burden
Can be divided into two categories:
Directory protocols
Snoopy protocols
18. COMPUTERARCHITECTURE
Solutions To The Cache Coherence
Problems
Software Based
Software-based protocol rely upon
the operating system and compiler.
Compiler-based protocol performs
analysis on the code to determine
which data is unsafe for caching,
then mark those items respectively.
Operating system prevent un-cache-
able items from being cached.
Software-based protocol is affective
because overhead of problems is
transferred from run-time to compile
time.
Hardware Based
DIRECTORY PROTOCOL:
It collects & maintain the
information about copies of lines
reside.
Contains the various local caches.
Keeping the information up-to-
date.
Manage the information which
caches copy of a line.
DRAWBACK: Only for System with
less buses, not large-scale
systems.
19. COMPUTERARCHITECTURE
Solutions To The Cache Coherence
Problems
Software Based
Software-based protocol rely upon
the operating system and compiler.
Compiler-based protocol performs
analysis on the code to determine
which data is unsafe for caching,
then mark those items respectively.
Operating system prevent un-cache-
able items from being cached.
Software-based protocol is affective
because overhead of problems is
transferred from run-time to compile
time.
Hardware Based
SNOOPY CACHE PROTOCOL: Distributed
responsibility for maintaining the cache
coherence among all controllers &
Multiprocessor.
BASIC APPROACH: Write Invalid & Write
Update
Write invalid Protocol: Multiple readers but
single writer, only cache can write at a
time.
Write Update Protocol: Multiple readers,
Multiple writers. Updated input is
distributed to all caches.
20. COMPUTERARCHITECTURE
Cluster
A Cluster is a group of tightly or loosely coupled computers that
work together as a single computer.
Commonly but not always connected through fast local area
networks. But not always.
A group of interconnected WHOLE COMPUTERS works together,
can create the misconception of being one machine having parallel
processing.
A system that can refer run on its own apart from the cluster, used
in server systems are called whole computers.
Each Computer in cluster is called a NODE.
21. COMPUTERARCHITECTURE
Cluster Products
In Picture: IBM Hydro Cluster
VA Cluster, developed by D.E.C. in
1980’s
Microsoft, Sun Microsystems and
other companies also offer Cluster
Package of Computers.
Linux is the most widely used
operating systems ever since for
cluster computers around the
world.
22. COMPUTERARCHITECTURE
Cluster
Architecture
The individual computers are connected by
some high-speed LAN or switch hardware.
Each computer is capable of operating
independently. In addition, a middleware
layer of software is installed in each
computer to enable cluster operation. The
cluster middleware provides a unified
system image to the user, known as a
single-system image. The middleware is
also responsible for providing high
availability, by means of load balancing and
responding to failures in individual
components. A cluster will also include
software tools for enabling the efficient
execution of programs that are capable of
parallel execution.
23. COMPUTERARCHITECTURE
Comparing Cluster With Symmetric
Multiprocessors
Symmetric Multiprocessor
Easier to manage and
configure.
Less physical space and lower
power consumption.
Well established and stable.
Clusters
Far superior in terms of
incremental and absolute
scalability.
Superior in terms of availability.
All components of the system can
readily be made highly redundant.
Both provide a configuration with multiple processors to support high demand applications.
Both solutions are available commercially.
24. COMPUTERARCHITECTURE
Parallelized Computing
Effective use of a cluster requires executing software from a single
application in parallel.
Following lists three general approaches to the problem:
PARALLELIZED COMPILER:
Determines at compile time which parts of an application can be executed in parallel.
These are then split off to be assigned to different computers in the cluster.
PARALLELIZED APPLICATION:
Application written from the outset to run on a cluster and uses message passing to move data
between cluster nodes.
PARAMETRIC COMPUTING:
Can be used if the essence of the application is an algorithm or program that must be executed
a large number of times, each time with a different set of starting conditions or parameters.
25. COMPUTERARCHITECTURE
Non-Uniform Memory Access
Alternative to SMP and clustering
Uniform memory access (UMA)
All processors have access to all parts of main memory using loads and stores
Access time to all regions of memory is the same
Access time to memory for different processors is the same
Non-uniform memory access (NUMA)
All processors have access to all parts of main memory using loads and stores
Access time of processor differs depending on which region of main memory is
being accessed
Different processors access different regions of memory at different speeds
Cache-coherent NUMA (CC-NUMA)
A NUMA system in which cache coherence is maintained among the caches of the
various processors
26. COMPUTERARCHITECTURE
Objective Of N.U.M.A. In Comparison
SYMMETRIC
MULTIPROCESS
Has Practical limit to
number of processors
that can be used.
Has Bus traffic limits to
between 16 and 64
processors.
CLUSTER
Each node has its own
private main memory.
Coherency is
maintained by software
rather than hardware.
NONUNIFORM
MEMORYACCESS
NUMA preserves
SMP feeling while
giving large scale
multiprocessing.
Objective is to
maintain a
transparent system
wide memory while
permitting multiple
multiprocessor
nodes, each with its
own bus or internal
interconnect system
27. COMPUTERARCHITECTURE
Cache Coherent Non Uniform Memory
Access Organization
There are multiple independent nodes, each of which is, in
effect, an SMP organization. Thus, each node contains multiple
processors, each with its own L1 and L2 caches & main memory.
The node is the basic building block of the overall CC-NUMA
organization. For example, each Silicon Graphics Origin node
includes two MIPS R10000 processors; each Sequent NUMA-Q
node includes four Pentium II processors. The nodes are
interconnected by means of some communications facility,
which could be a switching mechanism, a ring, or some other
networking facility. Each node in the CC-NUMA system includes
some main memory. From the point of view of the processors,
however, there is only a single addressable memory, with each
location having a unique system wide address.
28. COMPUTERARCHITECTURE
Cache Coherent Non Uniform Memory
Access Organization
When a processor initiates a memory access, if the requested
memory location is not in that processor’s cache, then the L2
cache initiates a fetch operation. If the desired line is in the local
portion of the main memory, the line is fetched across the local
bus. If the desired line is in a remote portion of the main
memory, then an automatic request is sent out to fetch that line
across the interconnection network, deliver it to the local bus,
and then deliver it to the requesting cache on that bus. All of
this activity is automatic and transparent to the processor and its
cache. In this configuration, cache coherence is a central
concern. Although implementations differ as to details, in
general terms we can say that each node must maintain some
sort of directory that gives it an indication of the location of
various portions of memory and also cache status information.
29. COMPUTERARCHITECTURE
N.U.M.A.
It can deliver effective performance at higher levels of parallelism
than SMP without requiring major software changes.
Bus traffic on any individual node is limited to a request that the bus
can handle.
If many of the memory accesses are to remote nodes, performance
begins to break down
Does not clearly look like an SMP.
Software changes will be required to move an operating system and
applications from an SMP to a CC-NUMA system.
Concern with ease of use.