The document discusses parallel computing and different types of parallel architectures. It begins with an overview of parallelism and how problems can be solved simultaneously using multiple processors. It then describes different classifications of parallel architectures including SISD, SIMD, MIMD, MISD, and vector processors. Specific examples like symmetric multiprocessors and hardware multithreading are also summarized. The document provides information on parallel computing concepts at a high level.
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
COA-Unit4-PPT.pptx
1. CHRIST
Deemed to be University
Unit IV - PARALLELISM
● Parallel processing challenges – Flynn’s classification – SISD, MIMD,
SIMD, SPMD, and Vector Architectures - Hardware multithreading –
Multi-core processors and other Shared Memory Multiprocessors -
Introduction to Graphics Processing Units, Clusters, Warehouse Scale
Computers and other Message-Passing Multiprocessors.
Excellence and Service
2. CHRIST
Excellence and Service
Deemed to be University
What is Parallelism?
● Doing Things Simultaneously
○ Same thing or different things
○ Solving one larger problem
● Serial Computing
○ Problem is broken into stream of instructions that are executed
sequentially one after another on a single processor.
○ One instruction executes at a time.
● Parallel Computing
○ Problem divided into parts that can be solved concurrently.
○ Each part further broken into stream of instructions
○ Instructions from different parts executes simultaneously.
3. CHRIST
Deemed to be University
Serial computation
● Traditionally in serial computation, used only a single computer
having a single Central Processing Unit (CPU).
● In the serial computation, a large problem is broken into smaller parts
but these sub part are executed one by one.
● Only a single instruction may execute at a time. So it takes lot of time
for solving a large problem.
Excellence and Service
4. CHRIST
Deemed to be University
Serial computation Cont……..
Problem
CPU
N
Excellence and Service
2 1
N-1 …….
Instructions
5. CHRIST
Deemed to be University
Parallel Computing
Sub problems Instructions
Problem
CPU
CPU
CPU
CPU
Excellence and Service
6. CHRIST
Deemed to be University
Different forms of parallel computing
● Bit level
● Instruction level
● Data parallelism
● Task parallelism
Excellence and Service
7. CHRIST
Deemed to be University
Advantages of Parallel Computing
● Solve large problem easily.
● Save money and time.
● Data are transmitted fast.
● Provide concurrency.
● Communicate in the proper way.
● Good performance.
● Choose best hardware and software primitives.
Excellence and Service
8. CHRIST
Deemed to be University
Use of Parallel Computing
● Atmosphere, Earth, Environment
● Physics - applied, nuclear, particle, condensed matter, high pressure,
fusion, photonics
● Bioscience, Biotechnology, Genetics
● Chemistry, Molecular Sciences
● Geology, Seismology
● Mechanical Engineering - from prosthetics to spacecraft
● Electrical Engineering, Circuit Design, Microelectronics
● Computer Science, Mathematics
Excellence and Service
9. CHRIST
Excellence and Service
Deemed to be University
Use of Parallel Computing
● Scientific Computing.
○ Numerically Intensive Simulations
● Database Operations and Information Systems
○ Web based services, Web search engines, Online transaction
processing.
○ Client and inventory database management, Data mining, MIS
○ Geographic information systems, Seismic data Processing
● Artificial intelligence, Machine Learning, Deep Learning
● Real time systems and Control Applications
○ Hardware and Robotic Control, Speech processing, Pattern
Recognition.
10. CHRIST
Deemed to be University
Parallel Computer Architectural Model
Parallel architectural model is classified into two categories as below.
○ Shared memory
○ Distributed memory
Excellence and Service
11. CHRIST
Deemed to be University
Flynn’s & Feng’s Classification Taxonomy
S I S D
Single Instruction,
Single Data
S I M D
Single Instruction,
Multiple Data
M I S D M I M D
Multiple Instruction, Multiple Instruction,
Single Data Multiple Data
Excellence and Service
12. CHRIST
Deemed to be University
A Taxonomy of Parallel Processor Architectures
Excellence and Service
13. CHRIST
Deemed to be University
SISD (single-instruction single-data streams)
● SISD is a serial computer or it is a non – parallel computer
system. This is the most common type of computer. In this
computer system use only single instruction and single data
stream.
● Single Instruction: Only one instruction stream is being
used by the CPU.
● Single Data: Only one data stream is being used as input.
Excellence and Service
14. CHRIST
Deemed to be University
SISD cont………
● Block Diagram of SISD :
Excellence and Service
16. CHRIST
Deemed to be University
SIMD architecture………
● A type of parallel computer
● Single instruction: All processing units execute the same instruction at
any given clock cycle
● Multiple data: Each processing unit can operate on a different data
element
● Best suited for specialized problems characterized by a high degree of
regularity, such as graphics/image processing.
● Synchronous (lockstep) and deterministic execution
● Two varieties: ProcessorArrays and Vector Pipelines
● Examples:
○ Processor Arrays: Connection Machine CM-2, MasPar MP-1 &
MP-2, ILLIAC IV
○ Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu
VP, NEC SX-2, Hitachi S820, ETA10
Excellence and Service
18. CHRIST
Deemed to be University
MIMD architecture……
● Currently, the most common type of parallel computer. Most modern
computers fall into this category.
● Multiple Instruction: every processor may be executing a different
instruction stream
● Multiple Data: every processor may be working with a different data
stream
● Execution can be synchronous or asynchronous, deterministic or non-
deterministic
● Examples: most current supercomputers, networked parallel computer
clusters and "grids", multi-processor SMP computers, multi-core PCs.
Excellence and Service
19. CHRIST
Deemed to be University
MIMD (with shared memory)
Excellence and Service
20. CHRIST
Deemed to be University
MIMD (with distributed memory)
Excellence and Service
21. CHRIST
Deemed to be University
Multiple Instruction, Single Data (MISD)
● A single data stream is fed into multiple processing units.
● Each processing unit operates on the data independently via
independent instruction streams.
● Few actual examples of this class of parallel computer have ever
existed. One is the experimental Carnegie-Mellon C.mmp computer
(1971).
● Some conceivable uses might be:
○ multiple frequency filters operating on a single signal stream
multiple cryptography algorithms attempting to crack a single
coded message.
Excellence and Service
22. CHRIST
Excellence and Service
Deemed to be University
Instruction and Data Streams
● An alternate classification: parallel system
● SPMD: Single Program Multiple Data
○ A parallel program on a MIMD computer
○ Conditional code for different processors
Data Streams
Single Multiple
Instruction
Streams
Single SISD:
Intel Pentium 4
SIMD: SSE
instructions of x86
Multiple MISD:
No examples today
MIMD:
Intel Xeon e5345
24. CHRIST
Excellence and Service
Deemed to be University
Vector Processors
● Highly pipelined function units
● Stream data from/to vector registers (with multiple elements in a vector
register) to units
○ Data collected from memory into registers
○ Results stored from registers to memory
● Example: Vector extension to MIPS
○ 32 × 64-element registers (64-bit elements)
○ Vector instructions
■ lv, sv: load/store to /from vector registers
■ addv.d: add vectors of double
■ addvs.d: add scalar to each element of vector of double
● Significantly reduces instruction-fetch bandwidth
25. CHRIST
Excellence and Service
Deemed to be University
Vector Processors
● In computing, a vector processor or array processor is a central
processing unit (CPU) that implements an instruction
set containing instructions that operate on one-dimensional arrays of
data called vectors,
● Compared to the scalar processors, whose instructions operate on
single data items.
● Vector processors can greatly improve performance on certain
workloads, notably numerical simulation and similar tasks.
● Vector machines appeared in the early 1970s and
dominated supercomputer design through the 1970s into the 1990s,
notably the various Cray platforms.
● The rapid fall in the price-to-performance ratio of
conventional microprocessor designs led to the vector supercomputer's
demise in the later 1990s.
26. CHRIST
Excellence and Service
Deemed to be University
Vector Processors
● An older and, as we shall see, more elegant interpretation of SIMD is
called a vector architecture, which has been closely identified with
computers designed by Seymour Cray starting in the 1970s.
● It is also a great match to problems with lots of data-level parallelism.
● Rather than having 64 ALUs perform 64 additions simultaneously, like
the old array processors, the vector architectures pipelined the ALU to
get good performance at lower cost.
27. CHRIST
Excellence and Service
Deemed to be University
Vector Processors
● Highly pipelined function units
● Stream data from/to vector registers (with multiple elements in a vector
register) to units
○ Data collected from memory into registers
○ Results stored from registers to memory
● Example: Vector extension to MIPS
○ 32 × 64-element registers (64-bit elements)
○ Vector instructions
■ lv, sv: load/store to /from vector registers
■ addv.d: add vectors of double
■ addvs.d: add scalar to each element of vector of double
● Significantly reduces instruction-fetch bandwidth
28. CHRIST
Deemed to be University
Excellence and Service
Vector versus Scalar
Vector instructions have several important properties compared to
conventional instruction set architectures, which are called scalar
architectures in this context:
● A single vector instruction is equivalent to executing an entire loop.
The instruction fetch and decode bandwidth needed is dramatically
reduced.
● Hardware does not have to check for data hazards within a vector
instruction.
● Vector architectures and compilers have a reputation of making it
much easier than when using MIMD multiprocessors to write efficient
applications when they contain data-level parallelism.
29. CHRIST
Deemed to be University
Excellence and Service
Vector versus Scalar
● Hardware need only check for data hazards between two vector
instructions once per vector operand
● The cost of the latency to main memory is seen only once for the entire
vector, rather than once for each word of the vector.
● Control hazards that would normally arise from the loop branch are
non-existent.
● The savings in instruction bandwidth and hazard checking plus the
efficient use of memory bandwidth give vector architectures
advantages in power and energy versus scalar architectures.
31. CHRIST
Deemed to be University
Excellence and Service
Vector Processor Classification
● Memory to memory architecture
● Register to register architecture
32. CHRIST
Deemed to be University
Excellence and Service
Vector Processor Classification
Memory to memory architecture
• In memory to memory architecture, source operands, intermediate and
final results are retrieved (read) directly from the main memory.
• For memory to memory vector instructions, the information of the base
address, the offset, the increment, and the vector length must be
specified in order to enable streams of data transfers between the main
memory and pipelines.
• The processors like TI-ASC, CDC STAR-100, and Cyber-205 have
vector instructions in memory to memory formats.
• The main points about memory to memory architecture are:
• There is no limitation of size
• Speed is comparatively slow in this architecture
33. CHRIST
Deemed to be University
Excellence and Service
Vector Processor Classification
Register to register architecture
• In register to register architecture, operands and results are retrieved
indirectly from the main memory through the use of large number of
vector registers or scalar registers.
• The processors like Cray-1 and the Fujitsu VP-200 use vector
instructions in register to register formats.
• The main points about register to register architecture are:
• Register to register architecture has limited size.
• Speed is very high as compared to the memory to memory
architecture.
• The hardware cost is high in this architecture.
35. CHRIST
Excellence and Service
Deemed to be University
Symmetric Multiprocessors
● A stand alone computer with the following characteristics
○ Two or more similar processors of comparable capacity
○ Processors share same memory and I/O
○ Processors are connected by a bus or other internal connection
○ Memory access time is approximately the same for each processor
○ All processors share access to I/O
■ Either through same channels or different channels giving
paths to same devices
○ All processors can perform the same functions (hence symmetric)
○ System controlled by integrated operating system
■ providing interaction between processors
■ Interaction at job, task, file and data element levels
36. CHRIST
Deemed to be University
Symmetric Multiprocessors
Excellence and Service
37. CHRIST
Excellence and Service
Deemed to be University
SMP Advantages
● Performance
○ If some work can be done in parallel
● Availability
○ Since all processors can perform the same functions, failure of a
single processor does not halt the system
● Incremental growth
○ User can enhance performance by adding additional processors
● Scaling
○ Vendors can offer range of products based on number of
processors
38. CHRIST
Deemed to be University
Block Diagram of Tightly Coupled
Multiprocessor
Excellence and Service
39. CHRIST
Deemed to be University
Symmetric Multiprocessor Organization
Excellence and Service
40. CHRIST
Excellence and Service
Deemed to be University
Multithreading: Basics
● Thread
○ Instruction stream with state (registers and memory)
○ Register state is also called “thread context”
● Threads could be part of the same process (program) or from different
programs
○ Threads in the same program share the same address space (shared
memory model)
● Traditionally, the processor keeps track of the context of a single
thread
● Multitasking: When a new thread needs to be executed, old thread’s
context in hardware written back to memory and new thread’s context
loaded
41. CHRIST
Excellence and Service
Deemed to be University
Multithreading: Basics
● The most important measure of performance for a processor is the rate
at which it executes instructions. This can be expressed as
MIPS rate = f * IPC
● where f is the processor clock frequency, in MHz, and IPC
(instructions per cycle) is the average number of instructions executed
per cycle.
● Accordingly, designers have pursued the goal of increased performance
on two fronts:
○ increasing clock frequency and
○ increasing the number of instructions executed or, more properly,
the number of instructions that complete during a processor cycle.
42. CHRIST
Excellence and Service
Deemed to be University
Multithreading: Basics
● Designers have increased IPC by using an instruction pipeline and then
by using multiple parallel instruction pipelines in a superscalar
architecture.
● With pipelined and multiple-pipeline designs, the principal problem is
to maximize the utilization of each pipeline stage.
● An alternative approach, which allows for a high degree of instruction-
level parallelism without increasing circuit complexity or power
consumption, is called multithreading.
● In essence, the instruction stream is divided into several smaller
streams, known as threads, such that the threads can be executed in
parallel.
43. CHRIST
Excellence and Service
Deemed to be University
Multithreading: Basics
● Process: An instance of a program running on a computer. A process
embodies two key characteristics:
○ Resource ownership:A process includes a virtual address space to
hold the process image; the process image is the collection of
program, data, stack, and attributes that define the process.
○ Scheduling/execution: The execution of a process follows an
execution path (trace) through one or more programs.
● Process switch: An operation that switches the processor from one
process to another, by saving all the process control data, registers, and
other information for the first and replacing them with the process
information for the second
44. CHRIST
Excellence and Service
Deemed to be University
Multithreading: Basics
● Thread: A dispatchable unit of work within a process. It includes a
processor context (which includes the program counter and stack
pointer) and its own data area for a stack (to enable subroutine
branching).
● Thread switch: The act of switching processor control from one
thread to another within the same process
45. CHRIST
Excellence and Service
Deemed to be University
Hardware Multithreading
● General idea: Have multiple thread contexts in a single processor
○ When the hardware executes from those hardware contexts
determines the granularity of multithreading
● Why?
○ To tolerate latency (initial motivation)
■ Latency of memory operations, dependent instructions, branch
resolution
■ By utilizing processing resources more efficiently
○ To improve system throughput
■ By exploiting thread-level parallelism
■ By improving superscalar processor utilization
○ To reduce context switch penalty
46. CHRIST
Excellence and Service
Deemed to be University
Hardware Multithreading
● Benefit
+ Latency tolerance
+ Better hardware utilization (when?)
+ Reduced context switch penalty
● Cost
- Requires multiple thread contexts to be implemented in hardware
(area, power, latency cost)
- Usually reduced single-thread performance
- Resource sharing, contention
- Switching penalty (can be reduced with additional hardware)
47. CHRIST
Deemed to be University
Types of Multithreading
● Fine-grained (Interleaved multithreading)
○ Cycle by cycle
● Coarse-grained (Blocked multithreading)
○ Switch on event (e.g., cache miss)
○ Switch on quantum/timeout
● Simultaneous multithreading (SMT)
○ Instructions from multiple threads executed concurrently in the
same cycle
● Chip multiprocessing
○ In this case, multiple cores are implemented on a single chip and
each core handles separate threads
Excellence and Service
48. CHRIST
Deemed to be University
Fine-grained Multithreading
● Idea: Switch to another thread every cycle such that no two instructions
from the thread are in the pipeline concurrently
● Improves pipeline utilization by taking advantage of multiple threads
● Alternative way of looking at it: Tolerates the control and data
dependency latencies by overlapping the latency with useful work from
other threads
Excellence and Service
49. CHRIST
Deemed to be University
Fine-grained Multithreading
● Advantages
+ No need for dependency checking between instructions
(only one instruction in pipeline from a single thread)
+ No need for branch prediction logic
+ Otherwise-bubble cycles used for executing useful instructions
from different threads
+ Improved system throughput, latency tolerance, utilization
● Disadvantages
- Extra hardware complexity: multiple hardware contexts, thread
selection logic
- Reduced single thread performance (one instruction fetched every
N cycles)
- Resource contention between threads in caches and memory
- Dependency checking logic between threads remains (load/store)
Excellence and Service
50. CHRIST
Deemed to be University
Coarse-grained Multithreading
● Idea: When a thread is stalled due to some event, switch to
a different hardware context
○ Switch-on-event multithreading
● Possible stall events
○ Cache misses
○ Synchronization events (e.g., load an empty location)
○ FP operations
Excellence and Service
51. CHRIST
Deemed to be University
Fine-grained vs. Coarse-grained MT
● Fine-grained advantages
+ Simpler to implement, can eliminate dependency checking, branch
prediction logic completely
+ Switching need not have any performance overhead (i.e. dead
cycles)
+ Coarse-grained requires a pipeline flush or a lot of hardware
to save pipeline state
Higher performance overhead with deep pipelines and
large windows
● Disadvantages
- Low single thread performance: each thread gets 1/Nth of the
bandwidth of the pipeline
Excellence and Service
52. CHRIST
Excellence and Service
Deemed to be University
Fine-grained vs. Coarse-grained MT
● Fine-grained advantages
+ Simpler to implement, can eliminate dependency checking, branch
prediction logic completely
+ Switching need not have any performance overhead (i.e. dead
cycles)
+ Coarse-grained requires a pipeline flush or a lot of hardware
to save pipeline state
Higher performance overhead with deep pipelines and
large windows
● Disadvantages
- Low single thread performance: each thread gets 1/Nth of the
bandwidth of the pipeline
53. CHRIST
Excellence and Service
Deemed to be University
Simultaneous multithreading (SMT)
● SMT is a variation on hardware multithreading that uses the resources of
a multiple-issue, dynamically scheduled pipelined processor to exploit
thread-level parallelism at the same time it exploits instruction level
parallelism.
● The key insight that motivates SMT is that multiple-issue processors
often have more functional unit parallelism available than most single
threads can effectively use.
● Furthermore, with register renaming and dynamic scheduling, multiple
instructions from independent threads can be issued without regard to the
dependences among them; the resolution of the dependences can be
handled by the dynamic scheduling capability
54. CHRIST
Deemed to be University
Approaches to Explicit Multithreading
● Interleaved
○ Fine-grained
○ Processor deals with two or more thread contexts at a time
○ Switching thread at each clock cycle
○ If thread is blocked it is skipped
● Blocked
○ Coarse-grained
○ Thread executed until event causes delay
○ E.g.Cache miss
○ Effective on in-order processor
○ Avoids pipeline stall
● Simultaneous (SMT)
○ Instructions simultaneously issued from multiple threads to execution
units of superscalar processor
● Chip multiprocessing
○ Processor is replicated on a single chip
○ Each processor handles separate threads
Excellence and Service
55. CHRIST
Deemed to be University
Scalar Processor Approaches
● Single-threaded scalar
○ Simple pipeline
○ No multithreading
● Interleaved multithreaded scalar
○ Easiest multithreading to implement
○ Switch threads at each clock cycle
○ Pipeline stages kept close to fully occupied
○ Hardware needs to switch thread context between cycles
● Blocked multithreaded scalar
○ Thread executed until latency event occurs
○ Would stop pipeline
○ Processor switches to another thread
Excellence and Service
57. CHRIST
Excellence and Service
Deemed to be University
Multiple Instruction Issue Processors (1)
● Superscalar
○ No multithreading
● Interleaved multithreading superscalar:
○ Each cycle, as many instructions as possible issued from single thread
○ Delays due to thread switches eliminated
○ Number of instructions issued in cycle limited by dependencies
● Blocked multithreaded superscalar
○ Instructions from one thread
○ Blocked multithreading used
58. CHRIST
Deemed to be University
Multiple Instruction Issue Diagram (1)
Excellence and Service
59. CHRIST
Excellence and Service
Deemed to be University
Multiple Instruction Issue Processors (2)
● Very long instruction word (VLIW)
○ E.g. IA-64
○ Multiple instructions in single word
○ Typically constructed by compiler
○ Operations that may be executed in parallel in same word
○ May pad with no-ops
● Interleaved multithreading VLIW
○ Similar efficiencies to interleaved multithreading on superscalar
architecture
● Blocked multithreaded VLIW
○ Similar efficiencies to blocked multithreading on superscalar
architecture
60. CHRIST
Deemed to be University
Multiple Instruction Issue Diagram (2)
Excellence and Service
61. CHRIST
Deemed to be University
Parallel, Simultaneous Execution of Multiple
Threads
● Simultaneous multithreading
○ Issue multiple instructions at a time
○ One thread may fill all horizontal slots
○ Instructions from two or more threads may be issued
○ With enough threads, can issue maximum number of instructions on
each cycle
● Chip multiprocessor
○ Multiple processors
○ Each has two-issue superscalar processor
○ Each processor is assigned thread
■ Can issue up to two instructions per cycle per thread
Excellence and Service
63. CHRIST
Deemed to be University
Examples
● Some Pentium 4
○ Intel calls it hyperthreading
○ SMT with support for two threads
○ Single multithreaded processor, logically two processors
● IBM Power5
○ High-end PowerPC
○ Combines chip multiprocessing with SMT
○ Chip has two separate processors
○ Each supporting two threads concurrently using SMT
Excellence and Service
64. CHRIST
Deemed to be University
Intel® Hyper-Threading Technology
● Intel® Hyper-Threading Technology is a hardware innovation that allows
more than one thread to run on each core. More threads means more work
can be done in parallel.
● How does Hyper-Threading work?
● When Intel® Hyper-Threading Technology is active, the CPU exposes two
execution contexts per physical core. This means that one physical core
now works like two “logical cores” that can handle different software
threads.
● The ten-core Intel® Core™ i9-10900K processor, for example, has 20 threads
when Hyper-Threading is enabled.
● Two logical cores can work through tasks more efficiently than a traditional
single-threaded core. By taking advantage of idle time when the core would
formerly be waiting for other tasks to complete, Intel® Hyper-Threading
Technology improves CPU throughput (by up to 30% in server
applications).
Excellence and Service
65. CHRIST
Deemed to be University
Clusters
● Alternative to SMP
● High performance
● High availability
● Server applications
● A group of interconnected whole computers
● Working together as unified resource
● Illusion of being one machine
● Each computer called a node
Excellence and Service
66. CHRIST
Deemed to be University
Clusters Benefits
● Absolute scalability
● Incremental scalability
● High availability
● Superior price/performance
Excellence and Service
67. CHRIST
Deemed to be University
Cluster Configurations - Standby Server, No
Shared Disk
Excellence and Service
68. CHRIST
Deemed to be University
Cluster Configurations - Shared Disk
Excellence and Service
70. CHRIST
Deemed to be University
Operating Systems Design Issues
● Failure Management
○ High availability
○ Fault tolerant
○ Failover
■ Switching applications & data from failed system to alternative
within cluster
○ Failback
■ Restoration of applications and data to original system
■ After problem is fixed
● Load balancing
○ Incremental scalability
○ Automatically include new computers in scheduling
○ Middleware needs to recognise that processes may switch between
machines
Excellence and Service
71. CHRIST
Excellence and Service
Deemed to be University
Parallelizing
● Single application executing in parallel on a number of machines in cluster
○ Complier
■ Determines at compile time which parts can be executed in parallel
■ Split off for different computers
○ Application
■ Application written from scratch to be parallel
■ Message passing to move data between nodes
■ Hard to program
■ Best end result
○ Parametric computing
■ If a problem is repeated execution of algorithm on different sets of
data
■ e.g. simulation using different scenarios
■ Needs effective tools to organize and run
72. CHRIST
Deemed to be University
Cluster Computer Architecture
Excellence and Service
73. CHRIST
Deemed to be University
Cluster Middleware
● Unified image to user
○ Single system image
● Single point of entry
● Single file hierarchy
● Single control point
● Single virtual networking
● Single memory space
● Single job management system
● Single user interface
● Single I/O space
● Single process space
● Checkpointing
● Process migration
Excellence and Service
74. CHRIST
Deemed to be University
Blade Servers
● Common implementation of cluster
● Server houses multiple server modules (blades) in single chassis
○ Save space
○ Improve system management
○ Chassis provides power supply
○ Each blade has processor, memory, disk
Excellence and Service
75. CHRIST
Deemed to be University
Cluster v. SMP
● Both provide multiprocessor support to high demand applications.
● Both available commercially
○ SMP for longer
● SMP:
○ Easier to manage and control
○ Closer to single processor systems
■ Scheduling is main difference
■ Less physical space
■ Lower power consumption
● Clustering:
○ Superior incremental & absolute scalability
○ Superior availability
■ Redundancy
Excellence and Service
76. CHRIST
Deemed to be University
Nonuniform Memory Access (NUMA)
● Alternative to SMP & clustering
● Uniform memory access
○ All processors have access to all parts of memory
■ Using load & store
○ Access time to all regions of memory is the same
○ Access time to memory for different processors same
○ As used by SMP
● Nonuniform memory access
○ All processors have access to all parts of memory
■ Using load & store
○ Access time of processor differs depending on region of memory
○ Different processors access different regions of memory at different speeds
● Cache coherent NUMA
○ Cache coherence is maintained among the caches of the various processors
○ Significantly different from SMP and clusters
Excellence and Service
77. CHRIST
Deemed to be University
Motivation
● SMP has practical limit to number of processors
○ Bus traffic limits to between 16 and 64 processors
● In clusters each node has own memory
○ Apps do not see large global memory
○ Coherence maintained by software not hardware
● NUMA retains SMP flavour while giving large scale
multiprocessing
○ e.g. Silicon Graphics Origin NUMA 1024 MIPS R10000 processors
● Objective is to maintain transparent system wide memory
while permitting multiprocessor nodes, each with own bus or
internal interconnection system
Excellence and Service
79. CHRIST
Excellence and Service
Deemed to be University
CC-NUMA Operation
● Each processor has own L1 and L2 cache
● Each node has own main memory
● Nodes connected by some networking facility
● Each processor sees single addressable memory space
● Memory request order:
○ L1 cache (local to processor)
○ L2 cache (local to processor)
○ Main memory (local to node)
○ Remote memory
■ Delivered to requesting (local to processor) cache
● Automatic and transparent
80. CHRIST
Deemed to be University
Excellence and Service
Memory Access Sequence
● Each node maintains directory of location of portions of
memory and cache status
● e.g. node 2 processor 3 (P2-3) requests location 798 which is
in memory of node 1
○ P2-3 issues read request on snoopy bus of node 2
○ Directory on node 2 recognises location is on node 1
○ Node 2 directory requests node 1’s directory
○ Node 1 directory requests contents of 798
○ Node 1 memory puts data on (node 1 local) bus
○ Node 1 directory gets data from (node 1 local) bus
○ Data transferred to node 2’s directory
○ Node 2 directory puts data on (node 2 local) bus
○ Data picked up, put in P2-3’s cache and delivered to processor
81. CHRIST
Excellence and Service
Deemed to be University
Cache Coherence
● Node 1 directory keeps note that node 2 has copy of data
● If data modified in cache, this is broadcast to other nodes
● Local directories monitor and purge local cache if necessary
● Local directory monitors changes to local data in remote caches and marks
memory invalid until writeback
● Local directory forces writeback if memory location requested by another
processor
82. CHRIST
Excellence and Service
Deemed to be University
NUMA Pros & Cons
● Effective performance at higher levels of parallelism than SMP
● No major software changes
● Performance can breakdown if too much access to remote memory
○ Can be avoided by:
■ L1 & L2 cache design reducing all memory access
● Need good temporal locality of software
■ Good spatial locality of software
■ Virtual memory management moving pages to nodes that are using
them most
● Not transparent
○ Page allocation, process allocation and load balancing changes needed
● Availability?
83. CHRIST
Excellence and Service
Deemed to be University
Vector Computation
● Maths problems involving physical processes present different difficulties
for computation
○ Aerodynamics, seismology, meteorology
○ Continuous field simulation
● High precision
● Repeated floating point calculations on large arrays of numbers
● Supercomputers handle these types of problem
○ Hundreds of millions of flops
○ $10-15 million
○ Optimised for calculation rather than multitasking and I/O
○ Limited market
■ Research, government agencies, meteorology
● Array processor
○ Alternative to supercomputer
○ Configured as peripherals to mainframe & mini
○ Just run vector portion of problems
84. CHRIST
Deemed to be University
Vector Addition Example
Excellence and Service
85. CHRIST
Excellence and Service
Deemed to be University
Approaches
● General purpose computers rely on iteration to do vector calculations
● In example this needs six calculations
● Vector processing
○ Assume possible to operate on one-dimensional vector of data
○ All elements in a particular row can be calculated in parallel
● Parallel processing
○ Independent processors functioning in parallel
○ Use FORK N to start individual process at location N
○ JOIN N causes N independent processes to join and merge following
JOIN
■ O/S Co-ordinates JOINs
■ Execution is blocked until all N processes have reached JOIN
86. CHRIST
Excellence and Service
Deemed to be University
Processor Designs
● PipelinedALU
○ Within operations
○ Across operations
● ParallelALUs
● Parallel processors