SlideShare a Scribd company logo
1 of 86
CHRIST
Deemed to be University
Unit IV - PARALLELISM
● Parallel processing challenges – Flynn’s classification – SISD, MIMD,
SIMD, SPMD, and Vector Architectures - Hardware multithreading –
Multi-core processors and other Shared Memory Multiprocessors -
Introduction to Graphics Processing Units, Clusters, Warehouse Scale
Computers and other Message-Passing Multiprocessors.
Excellence and Service
CHRIST
Excellence and Service
Deemed to be University
What is Parallelism?
● Doing Things Simultaneously
○ Same thing or different things
○ Solving one larger problem
● Serial Computing
○ Problem is broken into stream of instructions that are executed
sequentially one after another on a single processor.
○ One instruction executes at a time.
● Parallel Computing
○ Problem divided into parts that can be solved concurrently.
○ Each part further broken into stream of instructions
○ Instructions from different parts executes simultaneously.
CHRIST
Deemed to be University
Serial computation
● Traditionally in serial computation, used only a single computer
having a single Central Processing Unit (CPU).
● In the serial computation, a large problem is broken into smaller parts
but these sub part are executed one by one.
● Only a single instruction may execute at a time. So it takes lot of time
for solving a large problem.
Excellence and Service
CHRIST
Deemed to be University
Serial computation Cont……..
Problem
CPU
N
Excellence and Service
2 1
N-1 …….
Instructions
CHRIST
Deemed to be University
Parallel Computing
Sub problems Instructions
Problem
CPU
CPU
CPU
CPU
Excellence and Service
CHRIST
Deemed to be University
Different forms of parallel computing
● Bit level
● Instruction level
● Data parallelism
● Task parallelism
Excellence and Service
CHRIST
Deemed to be University
Advantages of Parallel Computing
● Solve large problem easily.
● Save money and time.
● Data are transmitted fast.
● Provide concurrency.
● Communicate in the proper way.
● Good performance.
● Choose best hardware and software primitives.
Excellence and Service
CHRIST
Deemed to be University
Use of Parallel Computing
● Atmosphere, Earth, Environment
● Physics - applied, nuclear, particle, condensed matter, high pressure,
fusion, photonics
● Bioscience, Biotechnology, Genetics
● Chemistry, Molecular Sciences
● Geology, Seismology
● Mechanical Engineering - from prosthetics to spacecraft
● Electrical Engineering, Circuit Design, Microelectronics
● Computer Science, Mathematics
Excellence and Service
CHRIST
Excellence and Service
Deemed to be University
Use of Parallel Computing
● Scientific Computing.
○ Numerically Intensive Simulations
● Database Operations and Information Systems
○ Web based services, Web search engines, Online transaction
processing.
○ Client and inventory database management, Data mining, MIS
○ Geographic information systems, Seismic data Processing
● Artificial intelligence, Machine Learning, Deep Learning
● Real time systems and Control Applications
○ Hardware and Robotic Control, Speech processing, Pattern
Recognition.
CHRIST
Deemed to be University
Parallel Computer Architectural Model
Parallel architectural model is classified into two categories as below.
○ Shared memory
○ Distributed memory
Excellence and Service
CHRIST
Deemed to be University
Flynn’s & Feng’s Classification Taxonomy
S I S D
Single Instruction,
Single Data
S I M D
Single Instruction,
Multiple Data
M I S D M I M D
Multiple Instruction, Multiple Instruction,
Single Data Multiple Data
Excellence and Service
CHRIST
Deemed to be University
A Taxonomy of Parallel Processor Architectures
Excellence and Service
CHRIST
Deemed to be University
SISD (single-instruction single-data streams)
● SISD is a serial computer or it is a non – parallel computer
system. This is the most common type of computer. In this
computer system use only single instruction and single data
stream.
● Single Instruction: Only one instruction stream is being
used by the CPU.
● Single Data: Only one data stream is being used as input.
Excellence and Service
CHRIST
Deemed to be University
SISD cont………
● Block Diagram of SISD :
Excellence and Service
CHRIST
Deemed to be University
SIMD architecture
Excellence and Service
CHRIST
Deemed to be University
SIMD architecture………
● A type of parallel computer
● Single instruction: All processing units execute the same instruction at
any given clock cycle
● Multiple data: Each processing unit can operate on a different data
element
● Best suited for specialized problems characterized by a high degree of
regularity, such as graphics/image processing.
● Synchronous (lockstep) and deterministic execution
● Two varieties: ProcessorArrays and Vector Pipelines
● Examples:
○ Processor Arrays: Connection Machine CM-2, MasPar MP-1 &
MP-2, ILLIAC IV
○ Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu
VP, NEC SX-2, Hitachi S820, ETA10
Excellence and Service
CHRIST
Deemed to be University
MIMD architecture
Excellence and Service
CHRIST
Deemed to be University
MIMD architecture……
● Currently, the most common type of parallel computer. Most modern
computers fall into this category.
● Multiple Instruction: every processor may be executing a different
instruction stream
● Multiple Data: every processor may be working with a different data
stream
● Execution can be synchronous or asynchronous, deterministic or non-
deterministic
● Examples: most current supercomputers, networked parallel computer
clusters and "grids", multi-processor SMP computers, multi-core PCs.
Excellence and Service
CHRIST
Deemed to be University
MIMD (with shared memory)
Excellence and Service
CHRIST
Deemed to be University
MIMD (with distributed memory)
Excellence and Service
CHRIST
Deemed to be University
Multiple Instruction, Single Data (MISD)
● A single data stream is fed into multiple processing units.
● Each processing unit operates on the data independently via
independent instruction streams.
● Few actual examples of this class of parallel computer have ever
existed. One is the experimental Carnegie-Mellon C.mmp computer
(1971).
● Some conceivable uses might be:
○ multiple frequency filters operating on a single signal stream
multiple cryptography algorithms attempting to crack a single
coded message.
Excellence and Service
CHRIST
Excellence and Service
Deemed to be University
Instruction and Data Streams
● An alternate classification: parallel system
● SPMD: Single Program Multiple Data
○ A parallel program on a MIMD computer
○ Conditional code for different processors
Data Streams
Single Multiple
Instruction
Streams
Single SISD:
Intel Pentium 4
SIMD: SSE
instructions of x86
Multiple MISD:
No examples today
MIMD:
Intel Xeon e5345
CHRIST
Deemed to be University
Vector Processors
Excellence and Service
CHRIST
Excellence and Service
Deemed to be University
Vector Processors
● Highly pipelined function units
● Stream data from/to vector registers (with multiple elements in a vector
register) to units
○ Data collected from memory into registers
○ Results stored from registers to memory
● Example: Vector extension to MIPS
○ 32 × 64-element registers (64-bit elements)
○ Vector instructions
■ lv, sv: load/store to /from vector registers
■ addv.d: add vectors of double
■ addvs.d: add scalar to each element of vector of double
● Significantly reduces instruction-fetch bandwidth
CHRIST
Excellence and Service
Deemed to be University
Vector Processors
● In computing, a vector processor or array processor is a central
processing unit (CPU) that implements an instruction
set containing instructions that operate on one-dimensional arrays of
data called vectors,
● Compared to the scalar processors, whose instructions operate on
single data items.
● Vector processors can greatly improve performance on certain
workloads, notably numerical simulation and similar tasks.
● Vector machines appeared in the early 1970s and
dominated supercomputer design through the 1970s into the 1990s,
notably the various Cray platforms.
● The rapid fall in the price-to-performance ratio of
conventional microprocessor designs led to the vector supercomputer's
demise in the later 1990s.
CHRIST
Excellence and Service
Deemed to be University
Vector Processors
● An older and, as we shall see, more elegant interpretation of SIMD is
called a vector architecture, which has been closely identified with
computers designed by Seymour Cray starting in the 1970s.
● It is also a great match to problems with lots of data-level parallelism.
● Rather than having 64 ALUs perform 64 additions simultaneously, like
the old array processors, the vector architectures pipelined the ALU to
get good performance at lower cost.
CHRIST
Excellence and Service
Deemed to be University
Vector Processors
● Highly pipelined function units
● Stream data from/to vector registers (with multiple elements in a vector
register) to units
○ Data collected from memory into registers
○ Results stored from registers to memory
● Example: Vector extension to MIPS
○ 32 × 64-element registers (64-bit elements)
○ Vector instructions
■ lv, sv: load/store to /from vector registers
■ addv.d: add vectors of double
■ addvs.d: add scalar to each element of vector of double
● Significantly reduces instruction-fetch bandwidth
CHRIST
Deemed to be University
Excellence and Service
Vector versus Scalar
Vector instructions have several important properties compared to
conventional instruction set architectures, which are called scalar
architectures in this context:
● A single vector instruction is equivalent to executing an entire loop.
The instruction fetch and decode bandwidth needed is dramatically
reduced.
● Hardware does not have to check for data hazards within a vector
instruction.
● Vector architectures and compilers have a reputation of making it
much easier than when using MIMD multiprocessors to write efficient
applications when they contain data-level parallelism.
CHRIST
Deemed to be University
Excellence and Service
Vector versus Scalar
● Hardware need only check for data hazards between two vector
instructions once per vector operand
● The cost of the latency to main memory is seen only once for the entire
vector, rather than once for each word of the vector.
● Control hazards that would normally arise from the loop branch are
non-existent.
● The savings in instruction bandwidth and hazard checking plus the
efficient use of memory bandwidth give vector architectures
advantages in power and energy versus scalar architectures.
CHRIST
Deemed to be University
Vector Processor
Excellence and Service
CHRIST
Deemed to be University
Excellence and Service
Vector Processor Classification
● Memory to memory architecture
● Register to register architecture
CHRIST
Deemed to be University
Excellence and Service
Vector Processor Classification
Memory to memory architecture
• In memory to memory architecture, source operands, intermediate and
final results are retrieved (read) directly from the main memory.
• For memory to memory vector instructions, the information of the base
address, the offset, the increment, and the vector length must be
specified in order to enable streams of data transfers between the main
memory and pipelines.
• The processors like TI-ASC, CDC STAR-100, and Cyber-205 have
vector instructions in memory to memory formats.
• The main points about memory to memory architecture are:
• There is no limitation of size
• Speed is comparatively slow in this architecture
CHRIST
Deemed to be University
Excellence and Service
Vector Processor Classification
Register to register architecture
• In register to register architecture, operands and results are retrieved
indirectly from the main memory through the use of large number of
vector registers or scalar registers.
• The processors like Cray-1 and the Fujitsu VP-200 use vector
instructions in register to register formats.
• The main points about register to register architecture are:
• Register to register architecture has limited size.
• Speed is very high as compared to the memory to memory
architecture.
• The hardware cost is high in this architecture.
CHRIST
Deemed to be University
Processors
Excellence and Service
CHRIST
Excellence and Service
Deemed to be University
Symmetric Multiprocessors
● A stand alone computer with the following characteristics
○ Two or more similar processors of comparable capacity
○ Processors share same memory and I/O
○ Processors are connected by a bus or other internal connection
○ Memory access time is approximately the same for each processor
○ All processors share access to I/O
■ Either through same channels or different channels giving
paths to same devices
○ All processors can perform the same functions (hence symmetric)
○ System controlled by integrated operating system
■ providing interaction between processors
■ Interaction at job, task, file and data element levels
CHRIST
Deemed to be University
Symmetric Multiprocessors
Excellence and Service
CHRIST
Excellence and Service
Deemed to be University
SMP Advantages
● Performance
○ If some work can be done in parallel
● Availability
○ Since all processors can perform the same functions, failure of a
single processor does not halt the system
● Incremental growth
○ User can enhance performance by adding additional processors
● Scaling
○ Vendors can offer range of products based on number of
processors
CHRIST
Deemed to be University
Block Diagram of Tightly Coupled
Multiprocessor
Excellence and Service
CHRIST
Deemed to be University
Symmetric Multiprocessor Organization
Excellence and Service
CHRIST
Excellence and Service
Deemed to be University
Multithreading: Basics
● Thread
○ Instruction stream with state (registers and memory)
○ Register state is also called “thread context”
● Threads could be part of the same process (program) or from different
programs
○ Threads in the same program share the same address space (shared
memory model)
● Traditionally, the processor keeps track of the context of a single
thread
● Multitasking: When a new thread needs to be executed, old thread’s
context in hardware written back to memory and new thread’s context
loaded
CHRIST
Excellence and Service
Deemed to be University
Multithreading: Basics
● The most important measure of performance for a processor is the rate
at which it executes instructions. This can be expressed as
MIPS rate = f * IPC
● where f is the processor clock frequency, in MHz, and IPC
(instructions per cycle) is the average number of instructions executed
per cycle.
● Accordingly, designers have pursued the goal of increased performance
on two fronts:
○ increasing clock frequency and
○ increasing the number of instructions executed or, more properly,
the number of instructions that complete during a processor cycle.
CHRIST
Excellence and Service
Deemed to be University
Multithreading: Basics
● Designers have increased IPC by using an instruction pipeline and then
by using multiple parallel instruction pipelines in a superscalar
architecture.
● With pipelined and multiple-pipeline designs, the principal problem is
to maximize the utilization of each pipeline stage.
● An alternative approach, which allows for a high degree of instruction-
level parallelism without increasing circuit complexity or power
consumption, is called multithreading.
● In essence, the instruction stream is divided into several smaller
streams, known as threads, such that the threads can be executed in
parallel.
CHRIST
Excellence and Service
Deemed to be University
Multithreading: Basics
● Process: An instance of a program running on a computer. A process
embodies two key characteristics:
○ Resource ownership:A process includes a virtual address space to
hold the process image; the process image is the collection of
program, data, stack, and attributes that define the process.
○ Scheduling/execution: The execution of a process follows an
execution path (trace) through one or more programs.
● Process switch: An operation that switches the processor from one
process to another, by saving all the process control data, registers, and
other information for the first and replacing them with the process
information for the second
CHRIST
Excellence and Service
Deemed to be University
Multithreading: Basics
● Thread: A dispatchable unit of work within a process. It includes a
processor context (which includes the program counter and stack
pointer) and its own data area for a stack (to enable subroutine
branching).
● Thread switch: The act of switching processor control from one
thread to another within the same process
CHRIST
Excellence and Service
Deemed to be University
Hardware Multithreading
● General idea: Have multiple thread contexts in a single processor
○ When the hardware executes from those hardware contexts
determines the granularity of multithreading
● Why?
○ To tolerate latency (initial motivation)
■ Latency of memory operations, dependent instructions, branch
resolution
■ By utilizing processing resources more efficiently
○ To improve system throughput
■ By exploiting thread-level parallelism
■ By improving superscalar processor utilization
○ To reduce context switch penalty
CHRIST
Excellence and Service
Deemed to be University
Hardware Multithreading
● Benefit
+ Latency tolerance
+ Better hardware utilization (when?)
+ Reduced context switch penalty
● Cost
- Requires multiple thread contexts to be implemented in hardware
(area, power, latency cost)
- Usually reduced single-thread performance
- Resource sharing, contention
- Switching penalty (can be reduced with additional hardware)
CHRIST
Deemed to be University
Types of Multithreading
● Fine-grained (Interleaved multithreading)
○ Cycle by cycle
● Coarse-grained (Blocked multithreading)
○ Switch on event (e.g., cache miss)
○ Switch on quantum/timeout
● Simultaneous multithreading (SMT)
○ Instructions from multiple threads executed concurrently in the
same cycle
● Chip multiprocessing
○ In this case, multiple cores are implemented on a single chip and
each core handles separate threads
Excellence and Service
CHRIST
Deemed to be University
Fine-grained Multithreading
● Idea: Switch to another thread every cycle such that no two instructions
from the thread are in the pipeline concurrently
● Improves pipeline utilization by taking advantage of multiple threads
● Alternative way of looking at it: Tolerates the control and data
dependency latencies by overlapping the latency with useful work from
other threads
Excellence and Service
CHRIST
Deemed to be University
Fine-grained Multithreading
● Advantages
+ No need for dependency checking between instructions
(only one instruction in pipeline from a single thread)
+ No need for branch prediction logic
+ Otherwise-bubble cycles used for executing useful instructions
from different threads
+ Improved system throughput, latency tolerance, utilization
● Disadvantages
- Extra hardware complexity: multiple hardware contexts, thread
selection logic
- Reduced single thread performance (one instruction fetched every
N cycles)
- Resource contention between threads in caches and memory
- Dependency checking logic between threads remains (load/store)
Excellence and Service
CHRIST
Deemed to be University
Coarse-grained Multithreading
● Idea: When a thread is stalled due to some event, switch to
a different hardware context
○ Switch-on-event multithreading
● Possible stall events
○ Cache misses
○ Synchronization events (e.g., load an empty location)
○ FP operations
Excellence and Service
CHRIST
Deemed to be University
Fine-grained vs. Coarse-grained MT
● Fine-grained advantages
+ Simpler to implement, can eliminate dependency checking, branch
prediction logic completely
+ Switching need not have any performance overhead (i.e. dead
cycles)
+ Coarse-grained requires a pipeline flush or a lot of hardware
to save pipeline state
 Higher performance overhead with deep pipelines and
large windows
● Disadvantages
- Low single thread performance: each thread gets 1/Nth of the
bandwidth of the pipeline
Excellence and Service
CHRIST
Excellence and Service
Deemed to be University
Fine-grained vs. Coarse-grained MT
● Fine-grained advantages
+ Simpler to implement, can eliminate dependency checking, branch
prediction logic completely
+ Switching need not have any performance overhead (i.e. dead
cycles)
+ Coarse-grained requires a pipeline flush or a lot of hardware
to save pipeline state
 Higher performance overhead with deep pipelines and
large windows
● Disadvantages
- Low single thread performance: each thread gets 1/Nth of the
bandwidth of the pipeline
CHRIST
Excellence and Service
Deemed to be University
Simultaneous multithreading (SMT)
● SMT is a variation on hardware multithreading that uses the resources of
a multiple-issue, dynamically scheduled pipelined processor to exploit
thread-level parallelism at the same time it exploits instruction level
parallelism.
● The key insight that motivates SMT is that multiple-issue processors
often have more functional unit parallelism available than most single
threads can effectively use.
● Furthermore, with register renaming and dynamic scheduling, multiple
instructions from independent threads can be issued without regard to the
dependences among them; the resolution of the dependences can be
handled by the dynamic scheduling capability
CHRIST
Deemed to be University
Approaches to Explicit Multithreading
● Interleaved
○ Fine-grained
○ Processor deals with two or more thread contexts at a time
○ Switching thread at each clock cycle
○ If thread is blocked it is skipped
● Blocked
○ Coarse-grained
○ Thread executed until event causes delay
○ E.g.Cache miss
○ Effective on in-order processor
○ Avoids pipeline stall
● Simultaneous (SMT)
○ Instructions simultaneously issued from multiple threads to execution
units of superscalar processor
● Chip multiprocessing
○ Processor is replicated on a single chip
○ Each processor handles separate threads
Excellence and Service
CHRIST
Deemed to be University
Scalar Processor Approaches
● Single-threaded scalar
○ Simple pipeline
○ No multithreading
● Interleaved multithreaded scalar
○ Easiest multithreading to implement
○ Switch threads at each clock cycle
○ Pipeline stages kept close to fully occupied
○ Hardware needs to switch thread context between cycles
● Blocked multithreaded scalar
○ Thread executed until latency event occurs
○ Would stop pipeline
○ Processor switches to another thread
Excellence and Service
CHRIST
Deemed to be University
Scalar Diagrams
Excellence and Service
CHRIST
Excellence and Service
Deemed to be University
Multiple Instruction Issue Processors (1)
● Superscalar
○ No multithreading
● Interleaved multithreading superscalar:
○ Each cycle, as many instructions as possible issued from single thread
○ Delays due to thread switches eliminated
○ Number of instructions issued in cycle limited by dependencies
● Blocked multithreaded superscalar
○ Instructions from one thread
○ Blocked multithreading used
CHRIST
Deemed to be University
Multiple Instruction Issue Diagram (1)
Excellence and Service
CHRIST
Excellence and Service
Deemed to be University
Multiple Instruction Issue Processors (2)
● Very long instruction word (VLIW)
○ E.g. IA-64
○ Multiple instructions in single word
○ Typically constructed by compiler
○ Operations that may be executed in parallel in same word
○ May pad with no-ops
● Interleaved multithreading VLIW
○ Similar efficiencies to interleaved multithreading on superscalar
architecture
● Blocked multithreaded VLIW
○ Similar efficiencies to blocked multithreading on superscalar
architecture
CHRIST
Deemed to be University
Multiple Instruction Issue Diagram (2)
Excellence and Service
CHRIST
Deemed to be University
Parallel, Simultaneous Execution of Multiple
Threads
● Simultaneous multithreading
○ Issue multiple instructions at a time
○ One thread may fill all horizontal slots
○ Instructions from two or more threads may be issued
○ With enough threads, can issue maximum number of instructions on
each cycle
● Chip multiprocessor
○ Multiple processors
○ Each has two-issue superscalar processor
○ Each processor is assigned thread
■ Can issue up to two instructions per cycle per thread
Excellence and Service
CHRIST
Deemed to be University
Parallel Diagram
Excellence and Service
CHRIST
Deemed to be University
Examples
● Some Pentium 4
○ Intel calls it hyperthreading
○ SMT with support for two threads
○ Single multithreaded processor, logically two processors
● IBM Power5
○ High-end PowerPC
○ Combines chip multiprocessing with SMT
○ Chip has two separate processors
○ Each supporting two threads concurrently using SMT
Excellence and Service
CHRIST
Deemed to be University
Intel® Hyper-Threading Technology
● Intel® Hyper-Threading Technology is a hardware innovation that allows
more than one thread to run on each core. More threads means more work
can be done in parallel.
● How does Hyper-Threading work?
● When Intel® Hyper-Threading Technology is active, the CPU exposes two
execution contexts per physical core. This means that one physical core
now works like two “logical cores” that can handle different software
threads.
● The ten-core Intel® Core™ i9-10900K processor, for example, has 20 threads
when Hyper-Threading is enabled.
● Two logical cores can work through tasks more efficiently than a traditional
single-threaded core. By taking advantage of idle time when the core would
formerly be waiting for other tasks to complete, Intel® Hyper-Threading
Technology improves CPU throughput (by up to 30% in server
applications).
Excellence and Service
CHRIST
Deemed to be University
Clusters
● Alternative to SMP
● High performance
● High availability
● Server applications
● A group of interconnected whole computers
● Working together as unified resource
● Illusion of being one machine
● Each computer called a node
Excellence and Service
CHRIST
Deemed to be University
Clusters Benefits
● Absolute scalability
● Incremental scalability
● High availability
● Superior price/performance
Excellence and Service
CHRIST
Deemed to be University
Cluster Configurations - Standby Server, No
Shared Disk
Excellence and Service
CHRIST
Deemed to be University
Cluster Configurations - Shared Disk
Excellence and Service
CHRIST
Excellence and Service
Deemed to be University
Clustering Methods: Benefits and Limitations
CHRIST
Deemed to be University
Operating Systems Design Issues
● Failure Management
○ High availability
○ Fault tolerant
○ Failover
■ Switching applications & data from failed system to alternative
within cluster
○ Failback
■ Restoration of applications and data to original system
■ After problem is fixed
● Load balancing
○ Incremental scalability
○ Automatically include new computers in scheduling
○ Middleware needs to recognise that processes may switch between
machines
Excellence and Service
CHRIST
Excellence and Service
Deemed to be University
Parallelizing
● Single application executing in parallel on a number of machines in cluster
○ Complier
■ Determines at compile time which parts can be executed in parallel
■ Split off for different computers
○ Application
■ Application written from scratch to be parallel
■ Message passing to move data between nodes
■ Hard to program
■ Best end result
○ Parametric computing
■ If a problem is repeated execution of algorithm on different sets of
data
■ e.g. simulation using different scenarios
■ Needs effective tools to organize and run
CHRIST
Deemed to be University
Cluster Computer Architecture
Excellence and Service
CHRIST
Deemed to be University
Cluster Middleware
● Unified image to user
○ Single system image
● Single point of entry
● Single file hierarchy
● Single control point
● Single virtual networking
● Single memory space
● Single job management system
● Single user interface
● Single I/O space
● Single process space
● Checkpointing
● Process migration
Excellence and Service
CHRIST
Deemed to be University
Blade Servers
● Common implementation of cluster
● Server houses multiple server modules (blades) in single chassis
○ Save space
○ Improve system management
○ Chassis provides power supply
○ Each blade has processor, memory, disk
Excellence and Service
CHRIST
Deemed to be University
Cluster v. SMP
● Both provide multiprocessor support to high demand applications.
● Both available commercially
○ SMP for longer
● SMP:
○ Easier to manage and control
○ Closer to single processor systems
■ Scheduling is main difference
■ Less physical space
■ Lower power consumption
● Clustering:
○ Superior incremental & absolute scalability
○ Superior availability
■ Redundancy
Excellence and Service
CHRIST
Deemed to be University
Nonuniform Memory Access (NUMA)
● Alternative to SMP & clustering
● Uniform memory access
○ All processors have access to all parts of memory
■ Using load & store
○ Access time to all regions of memory is the same
○ Access time to memory for different processors same
○ As used by SMP
● Nonuniform memory access
○ All processors have access to all parts of memory
■ Using load & store
○ Access time of processor differs depending on region of memory
○ Different processors access different regions of memory at different speeds
● Cache coherent NUMA
○ Cache coherence is maintained among the caches of the various processors
○ Significantly different from SMP and clusters
Excellence and Service
CHRIST
Deemed to be University
Motivation
● SMP has practical limit to number of processors
○ Bus traffic limits to between 16 and 64 processors
● In clusters each node has own memory
○ Apps do not see large global memory
○ Coherence maintained by software not hardware
● NUMA retains SMP flavour while giving large scale
multiprocessing
○ e.g. Silicon Graphics Origin NUMA 1024 MIPS R10000 processors
● Objective is to maintain transparent system wide memory
while permitting multiprocessor nodes, each with own bus or
internal interconnection system
Excellence and Service
CHRIST
Excellence and Service
Deemed to be University
CC-NUMA Organization
CHRIST
Excellence and Service
Deemed to be University
CC-NUMA Operation
● Each processor has own L1 and L2 cache
● Each node has own main memory
● Nodes connected by some networking facility
● Each processor sees single addressable memory space
● Memory request order:
○ L1 cache (local to processor)
○ L2 cache (local to processor)
○ Main memory (local to node)
○ Remote memory
■ Delivered to requesting (local to processor) cache
● Automatic and transparent
CHRIST
Deemed to be University
Excellence and Service
Memory Access Sequence
● Each node maintains directory of location of portions of
memory and cache status
● e.g. node 2 processor 3 (P2-3) requests location 798 which is
in memory of node 1
○ P2-3 issues read request on snoopy bus of node 2
○ Directory on node 2 recognises location is on node 1
○ Node 2 directory requests node 1’s directory
○ Node 1 directory requests contents of 798
○ Node 1 memory puts data on (node 1 local) bus
○ Node 1 directory gets data from (node 1 local) bus
○ Data transferred to node 2’s directory
○ Node 2 directory puts data on (node 2 local) bus
○ Data picked up, put in P2-3’s cache and delivered to processor
CHRIST
Excellence and Service
Deemed to be University
Cache Coherence
● Node 1 directory keeps note that node 2 has copy of data
● If data modified in cache, this is broadcast to other nodes
● Local directories monitor and purge local cache if necessary
● Local directory monitors changes to local data in remote caches and marks
memory invalid until writeback
● Local directory forces writeback if memory location requested by another
processor
CHRIST
Excellence and Service
Deemed to be University
NUMA Pros & Cons
● Effective performance at higher levels of parallelism than SMP
● No major software changes
● Performance can breakdown if too much access to remote memory
○ Can be avoided by:
■ L1 & L2 cache design reducing all memory access
● Need good temporal locality of software
■ Good spatial locality of software
■ Virtual memory management moving pages to nodes that are using
them most
● Not transparent
○ Page allocation, process allocation and load balancing changes needed
● Availability?
CHRIST
Excellence and Service
Deemed to be University
Vector Computation
● Maths problems involving physical processes present different difficulties
for computation
○ Aerodynamics, seismology, meteorology
○ Continuous field simulation
● High precision
● Repeated floating point calculations on large arrays of numbers
● Supercomputers handle these types of problem
○ Hundreds of millions of flops
○ $10-15 million
○ Optimised for calculation rather than multitasking and I/O
○ Limited market
■ Research, government agencies, meteorology
● Array processor
○ Alternative to supercomputer
○ Configured as peripherals to mainframe & mini
○ Just run vector portion of problems
CHRIST
Deemed to be University
Vector Addition Example
Excellence and Service
CHRIST
Excellence and Service
Deemed to be University
Approaches
● General purpose computers rely on iteration to do vector calculations
● In example this needs six calculations
● Vector processing
○ Assume possible to operate on one-dimensional vector of data
○ All elements in a particular row can be calculated in parallel
● Parallel processing
○ Independent processors functioning in parallel
○ Use FORK N to start individual process at location N
○ JOIN N causes N independent processes to join and merge following
JOIN
■ O/S Co-ordinates JOINs
■ Execution is blocked until all N processes have reached JOIN
CHRIST
Excellence and Service
Deemed to be University
Processor Designs
● PipelinedALU
○ Within operations
○ Across operations
● ParallelALUs
● Parallel processors

More Related Content

Similar to COA-Unit4-PPT.pptx

High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer ArchitectureSubhasis Dash
 
Parallel architecture-programming
Parallel architecture-programmingParallel architecture-programming
Parallel architecture-programmingShaveta Banda
 
Parallel architecture &programming
Parallel architecture &programmingParallel architecture &programming
Parallel architecture &programmingIsmail El Gayar
 
Software Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationSoftware Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationHao Xu
 
CSA unit5.pptx
CSA unit5.pptxCSA unit5.pptx
CSA unit5.pptxAbcvDef
 
Flynn's Classification .pptx
Flynn's Classification .pptxFlynn's Classification .pptx
Flynn's Classification .pptxNayan Gupta
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learningAmer Ather
 
Lecture 1 introduction to parallel and distributed computing
Lecture 1   introduction to parallel and distributed computingLecture 1   introduction to parallel and distributed computing
Lecture 1 introduction to parallel and distributed computingVajira Thambawita
 
Cluster Tutorial
Cluster TutorialCluster Tutorial
Cluster Tutorialcybercbm
 
Chap 2 classification of parralel architecture and introduction to parllel p...
Chap 2  classification of parralel architecture and introduction to parllel p...Chap 2  classification of parralel architecture and introduction to parllel p...
Chap 2 classification of parralel architecture and introduction to parllel p...Malobe Lottin Cyrille Marcel
 
intro, definitions, basic laws+.pptx
intro, definitions, basic laws+.pptxintro, definitions, basic laws+.pptx
intro, definitions, basic laws+.pptxssuser413a98
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsDilum Bandara
 
Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism) Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism) A B Shinde
 

Similar to COA-Unit4-PPT.pptx (20)

aca mod1.pptx
aca mod1.pptxaca mod1.pptx
aca mod1.pptx
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer Architecture
 
Parallel architecture-programming
Parallel architecture-programmingParallel architecture-programming
Parallel architecture-programming
 
Parallel architecture &programming
Parallel architecture &programmingParallel architecture &programming
Parallel architecture &programming
 
Software Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationSoftware Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale Automation
 
CSA unit5.pptx
CSA unit5.pptxCSA unit5.pptx
CSA unit5.pptx
 
Flynn's Classification .pptx
Flynn's Classification .pptxFlynn's Classification .pptx
Flynn's Classification .pptx
 
CA UNIT IV.pptx
CA UNIT IV.pptxCA UNIT IV.pptx
CA UNIT IV.pptx
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
Pdc lecture1
Pdc lecture1Pdc lecture1
Pdc lecture1
 
Lecture 1 introduction to parallel and distributed computing
Lecture 1   introduction to parallel and distributed computingLecture 1   introduction to parallel and distributed computing
Lecture 1 introduction to parallel and distributed computing
 
Cluster Tutorial
Cluster TutorialCluster Tutorial
Cluster Tutorial
 
Chap 2 classification of parralel architecture and introduction to parllel p...
Chap 2  classification of parralel architecture and introduction to parllel p...Chap 2  classification of parralel architecture and introduction to parllel p...
Chap 2 classification of parralel architecture and introduction to parllel p...
 
Grid computing
Grid computingGrid computing
Grid computing
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
Multicore architectures
Multicore architecturesMulticore architectures
Multicore architectures
 
Flynn taxonomies
Flynn taxonomiesFlynn taxonomies
Flynn taxonomies
 
intro, definitions, basic laws+.pptx
intro, definitions, basic laws+.pptxintro, definitions, basic laws+.pptx
intro, definitions, basic laws+.pptx
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in Microprocessors
 
Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism) Pipelining and ILP (Instruction Level Parallelism)
Pipelining and ILP (Instruction Level Parallelism)
 

More from Ruhul Amin

COA-Unit5-ppt2.pptx
COA-Unit5-ppt2.pptxCOA-Unit5-ppt2.pptx
COA-Unit5-ppt2.pptxRuhul Amin
 
coa-Unit5-ppt1 (1).pptx
coa-Unit5-ppt1 (1).pptxcoa-Unit5-ppt1 (1).pptx
coa-Unit5-ppt1 (1).pptxRuhul Amin
 
COA-Unit3-ppt.pptx
COA-Unit3-ppt.pptxCOA-Unit3-ppt.pptx
COA-Unit3-ppt.pptxRuhul Amin
 
COA-Unit2-flynns.pptx
COA-Unit2-flynns.pptxCOA-Unit2-flynns.pptx
COA-Unit2-flynns.pptxRuhul Amin
 
COA-unit-2-Arithmetic.ppt
COA-unit-2-Arithmetic.pptCOA-unit-2-Arithmetic.ppt
COA-unit-2-Arithmetic.pptRuhul Amin
 
risc_and_cisc.ppt
risc_and_cisc.pptrisc_and_cisc.ppt
risc_and_cisc.pptRuhul Amin
 
COA-Unit-1-Basics.ppt
COA-Unit-1-Basics.pptCOA-Unit-1-Basics.ppt
COA-Unit-1-Basics.pptRuhul Amin
 

More from Ruhul Amin (7)

COA-Unit5-ppt2.pptx
COA-Unit5-ppt2.pptxCOA-Unit5-ppt2.pptx
COA-Unit5-ppt2.pptx
 
coa-Unit5-ppt1 (1).pptx
coa-Unit5-ppt1 (1).pptxcoa-Unit5-ppt1 (1).pptx
coa-Unit5-ppt1 (1).pptx
 
COA-Unit3-ppt.pptx
COA-Unit3-ppt.pptxCOA-Unit3-ppt.pptx
COA-Unit3-ppt.pptx
 
COA-Unit2-flynns.pptx
COA-Unit2-flynns.pptxCOA-Unit2-flynns.pptx
COA-Unit2-flynns.pptx
 
COA-unit-2-Arithmetic.ppt
COA-unit-2-Arithmetic.pptCOA-unit-2-Arithmetic.ppt
COA-unit-2-Arithmetic.ppt
 
risc_and_cisc.ppt
risc_and_cisc.pptrisc_and_cisc.ppt
risc_and_cisc.ppt
 
COA-Unit-1-Basics.ppt
COA-Unit-1-Basics.pptCOA-Unit-1-Basics.ppt
COA-Unit-1-Basics.ppt
 

Recently uploaded

Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersMairaAshraf6
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...Amil baba
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEselvakumar948
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxchumtiyababu
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdfKamal Acharya
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxNadaHaitham1
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdfKamal Acharya
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"mphochane1998
 

Recently uploaded (20)

Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptx
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 

COA-Unit4-PPT.pptx

  • 1. CHRIST Deemed to be University Unit IV - PARALLELISM ● Parallel processing challenges – Flynn’s classification – SISD, MIMD, SIMD, SPMD, and Vector Architectures - Hardware multithreading – Multi-core processors and other Shared Memory Multiprocessors - Introduction to Graphics Processing Units, Clusters, Warehouse Scale Computers and other Message-Passing Multiprocessors. Excellence and Service
  • 2. CHRIST Excellence and Service Deemed to be University What is Parallelism? ● Doing Things Simultaneously ○ Same thing or different things ○ Solving one larger problem ● Serial Computing ○ Problem is broken into stream of instructions that are executed sequentially one after another on a single processor. ○ One instruction executes at a time. ● Parallel Computing ○ Problem divided into parts that can be solved concurrently. ○ Each part further broken into stream of instructions ○ Instructions from different parts executes simultaneously.
  • 3. CHRIST Deemed to be University Serial computation ● Traditionally in serial computation, used only a single computer having a single Central Processing Unit (CPU). ● In the serial computation, a large problem is broken into smaller parts but these sub part are executed one by one. ● Only a single instruction may execute at a time. So it takes lot of time for solving a large problem. Excellence and Service
  • 4. CHRIST Deemed to be University Serial computation Cont…….. Problem CPU N Excellence and Service 2 1 N-1 ……. Instructions
  • 5. CHRIST Deemed to be University Parallel Computing Sub problems Instructions Problem CPU CPU CPU CPU Excellence and Service
  • 6. CHRIST Deemed to be University Different forms of parallel computing ● Bit level ● Instruction level ● Data parallelism ● Task parallelism Excellence and Service
  • 7. CHRIST Deemed to be University Advantages of Parallel Computing ● Solve large problem easily. ● Save money and time. ● Data are transmitted fast. ● Provide concurrency. ● Communicate in the proper way. ● Good performance. ● Choose best hardware and software primitives. Excellence and Service
  • 8. CHRIST Deemed to be University Use of Parallel Computing ● Atmosphere, Earth, Environment ● Physics - applied, nuclear, particle, condensed matter, high pressure, fusion, photonics ● Bioscience, Biotechnology, Genetics ● Chemistry, Molecular Sciences ● Geology, Seismology ● Mechanical Engineering - from prosthetics to spacecraft ● Electrical Engineering, Circuit Design, Microelectronics ● Computer Science, Mathematics Excellence and Service
  • 9. CHRIST Excellence and Service Deemed to be University Use of Parallel Computing ● Scientific Computing. ○ Numerically Intensive Simulations ● Database Operations and Information Systems ○ Web based services, Web search engines, Online transaction processing. ○ Client and inventory database management, Data mining, MIS ○ Geographic information systems, Seismic data Processing ● Artificial intelligence, Machine Learning, Deep Learning ● Real time systems and Control Applications ○ Hardware and Robotic Control, Speech processing, Pattern Recognition.
  • 10. CHRIST Deemed to be University Parallel Computer Architectural Model Parallel architectural model is classified into two categories as below. ○ Shared memory ○ Distributed memory Excellence and Service
  • 11. CHRIST Deemed to be University Flynn’s & Feng’s Classification Taxonomy S I S D Single Instruction, Single Data S I M D Single Instruction, Multiple Data M I S D M I M D Multiple Instruction, Multiple Instruction, Single Data Multiple Data Excellence and Service
  • 12. CHRIST Deemed to be University A Taxonomy of Parallel Processor Architectures Excellence and Service
  • 13. CHRIST Deemed to be University SISD (single-instruction single-data streams) ● SISD is a serial computer or it is a non – parallel computer system. This is the most common type of computer. In this computer system use only single instruction and single data stream. ● Single Instruction: Only one instruction stream is being used by the CPU. ● Single Data: Only one data stream is being used as input. Excellence and Service
  • 14. CHRIST Deemed to be University SISD cont……… ● Block Diagram of SISD : Excellence and Service
  • 15. CHRIST Deemed to be University SIMD architecture Excellence and Service
  • 16. CHRIST Deemed to be University SIMD architecture……… ● A type of parallel computer ● Single instruction: All processing units execute the same instruction at any given clock cycle ● Multiple data: Each processing unit can operate on a different data element ● Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image processing. ● Synchronous (lockstep) and deterministic execution ● Two varieties: ProcessorArrays and Vector Pipelines ● Examples: ○ Processor Arrays: Connection Machine CM-2, MasPar MP-1 & MP-2, ILLIAC IV ○ Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820, ETA10 Excellence and Service
  • 17. CHRIST Deemed to be University MIMD architecture Excellence and Service
  • 18. CHRIST Deemed to be University MIMD architecture…… ● Currently, the most common type of parallel computer. Most modern computers fall into this category. ● Multiple Instruction: every processor may be executing a different instruction stream ● Multiple Data: every processor may be working with a different data stream ● Execution can be synchronous or asynchronous, deterministic or non- deterministic ● Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers, multi-core PCs. Excellence and Service
  • 19. CHRIST Deemed to be University MIMD (with shared memory) Excellence and Service
  • 20. CHRIST Deemed to be University MIMD (with distributed memory) Excellence and Service
  • 21. CHRIST Deemed to be University Multiple Instruction, Single Data (MISD) ● A single data stream is fed into multiple processing units. ● Each processing unit operates on the data independently via independent instruction streams. ● Few actual examples of this class of parallel computer have ever existed. One is the experimental Carnegie-Mellon C.mmp computer (1971). ● Some conceivable uses might be: ○ multiple frequency filters operating on a single signal stream multiple cryptography algorithms attempting to crack a single coded message. Excellence and Service
  • 22. CHRIST Excellence and Service Deemed to be University Instruction and Data Streams ● An alternate classification: parallel system ● SPMD: Single Program Multiple Data ○ A parallel program on a MIMD computer ○ Conditional code for different processors Data Streams Single Multiple Instruction Streams Single SISD: Intel Pentium 4 SIMD: SSE instructions of x86 Multiple MISD: No examples today MIMD: Intel Xeon e5345
  • 23. CHRIST Deemed to be University Vector Processors Excellence and Service
  • 24. CHRIST Excellence and Service Deemed to be University Vector Processors ● Highly pipelined function units ● Stream data from/to vector registers (with multiple elements in a vector register) to units ○ Data collected from memory into registers ○ Results stored from registers to memory ● Example: Vector extension to MIPS ○ 32 × 64-element registers (64-bit elements) ○ Vector instructions ■ lv, sv: load/store to /from vector registers ■ addv.d: add vectors of double ■ addvs.d: add scalar to each element of vector of double ● Significantly reduces instruction-fetch bandwidth
  • 25. CHRIST Excellence and Service Deemed to be University Vector Processors ● In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set containing instructions that operate on one-dimensional arrays of data called vectors, ● Compared to the scalar processors, whose instructions operate on single data items. ● Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks. ● Vector machines appeared in the early 1970s and dominated supercomputer design through the 1970s into the 1990s, notably the various Cray platforms. ● The rapid fall in the price-to-performance ratio of conventional microprocessor designs led to the vector supercomputer's demise in the later 1990s.
  • 26. CHRIST Excellence and Service Deemed to be University Vector Processors ● An older and, as we shall see, more elegant interpretation of SIMD is called a vector architecture, which has been closely identified with computers designed by Seymour Cray starting in the 1970s. ● It is also a great match to problems with lots of data-level parallelism. ● Rather than having 64 ALUs perform 64 additions simultaneously, like the old array processors, the vector architectures pipelined the ALU to get good performance at lower cost.
  • 27. CHRIST Excellence and Service Deemed to be University Vector Processors ● Highly pipelined function units ● Stream data from/to vector registers (with multiple elements in a vector register) to units ○ Data collected from memory into registers ○ Results stored from registers to memory ● Example: Vector extension to MIPS ○ 32 × 64-element registers (64-bit elements) ○ Vector instructions ■ lv, sv: load/store to /from vector registers ■ addv.d: add vectors of double ■ addvs.d: add scalar to each element of vector of double ● Significantly reduces instruction-fetch bandwidth
  • 28. CHRIST Deemed to be University Excellence and Service Vector versus Scalar Vector instructions have several important properties compared to conventional instruction set architectures, which are called scalar architectures in this context: ● A single vector instruction is equivalent to executing an entire loop. The instruction fetch and decode bandwidth needed is dramatically reduced. ● Hardware does not have to check for data hazards within a vector instruction. ● Vector architectures and compilers have a reputation of making it much easier than when using MIMD multiprocessors to write efficient applications when they contain data-level parallelism.
  • 29. CHRIST Deemed to be University Excellence and Service Vector versus Scalar ● Hardware need only check for data hazards between two vector instructions once per vector operand ● The cost of the latency to main memory is seen only once for the entire vector, rather than once for each word of the vector. ● Control hazards that would normally arise from the loop branch are non-existent. ● The savings in instruction bandwidth and hazard checking plus the efficient use of memory bandwidth give vector architectures advantages in power and energy versus scalar architectures.
  • 30. CHRIST Deemed to be University Vector Processor Excellence and Service
  • 31. CHRIST Deemed to be University Excellence and Service Vector Processor Classification ● Memory to memory architecture ● Register to register architecture
  • 32. CHRIST Deemed to be University Excellence and Service Vector Processor Classification Memory to memory architecture • In memory to memory architecture, source operands, intermediate and final results are retrieved (read) directly from the main memory. • For memory to memory vector instructions, the information of the base address, the offset, the increment, and the vector length must be specified in order to enable streams of data transfers between the main memory and pipelines. • The processors like TI-ASC, CDC STAR-100, and Cyber-205 have vector instructions in memory to memory formats. • The main points about memory to memory architecture are: • There is no limitation of size • Speed is comparatively slow in this architecture
  • 33. CHRIST Deemed to be University Excellence and Service Vector Processor Classification Register to register architecture • In register to register architecture, operands and results are retrieved indirectly from the main memory through the use of large number of vector registers or scalar registers. • The processors like Cray-1 and the Fujitsu VP-200 use vector instructions in register to register formats. • The main points about register to register architecture are: • Register to register architecture has limited size. • Speed is very high as compared to the memory to memory architecture. • The hardware cost is high in this architecture.
  • 34. CHRIST Deemed to be University Processors Excellence and Service
  • 35. CHRIST Excellence and Service Deemed to be University Symmetric Multiprocessors ● A stand alone computer with the following characteristics ○ Two or more similar processors of comparable capacity ○ Processors share same memory and I/O ○ Processors are connected by a bus or other internal connection ○ Memory access time is approximately the same for each processor ○ All processors share access to I/O ■ Either through same channels or different channels giving paths to same devices ○ All processors can perform the same functions (hence symmetric) ○ System controlled by integrated operating system ■ providing interaction between processors ■ Interaction at job, task, file and data element levels
  • 36. CHRIST Deemed to be University Symmetric Multiprocessors Excellence and Service
  • 37. CHRIST Excellence and Service Deemed to be University SMP Advantages ● Performance ○ If some work can be done in parallel ● Availability ○ Since all processors can perform the same functions, failure of a single processor does not halt the system ● Incremental growth ○ User can enhance performance by adding additional processors ● Scaling ○ Vendors can offer range of products based on number of processors
  • 38. CHRIST Deemed to be University Block Diagram of Tightly Coupled Multiprocessor Excellence and Service
  • 39. CHRIST Deemed to be University Symmetric Multiprocessor Organization Excellence and Service
  • 40. CHRIST Excellence and Service Deemed to be University Multithreading: Basics ● Thread ○ Instruction stream with state (registers and memory) ○ Register state is also called “thread context” ● Threads could be part of the same process (program) or from different programs ○ Threads in the same program share the same address space (shared memory model) ● Traditionally, the processor keeps track of the context of a single thread ● Multitasking: When a new thread needs to be executed, old thread’s context in hardware written back to memory and new thread’s context loaded
  • 41. CHRIST Excellence and Service Deemed to be University Multithreading: Basics ● The most important measure of performance for a processor is the rate at which it executes instructions. This can be expressed as MIPS rate = f * IPC ● where f is the processor clock frequency, in MHz, and IPC (instructions per cycle) is the average number of instructions executed per cycle. ● Accordingly, designers have pursued the goal of increased performance on two fronts: ○ increasing clock frequency and ○ increasing the number of instructions executed or, more properly, the number of instructions that complete during a processor cycle.
  • 42. CHRIST Excellence and Service Deemed to be University Multithreading: Basics ● Designers have increased IPC by using an instruction pipeline and then by using multiple parallel instruction pipelines in a superscalar architecture. ● With pipelined and multiple-pipeline designs, the principal problem is to maximize the utilization of each pipeline stage. ● An alternative approach, which allows for a high degree of instruction- level parallelism without increasing circuit complexity or power consumption, is called multithreading. ● In essence, the instruction stream is divided into several smaller streams, known as threads, such that the threads can be executed in parallel.
  • 43. CHRIST Excellence and Service Deemed to be University Multithreading: Basics ● Process: An instance of a program running on a computer. A process embodies two key characteristics: ○ Resource ownership:A process includes a virtual address space to hold the process image; the process image is the collection of program, data, stack, and attributes that define the process. ○ Scheduling/execution: The execution of a process follows an execution path (trace) through one or more programs. ● Process switch: An operation that switches the processor from one process to another, by saving all the process control data, registers, and other information for the first and replacing them with the process information for the second
  • 44. CHRIST Excellence and Service Deemed to be University Multithreading: Basics ● Thread: A dispatchable unit of work within a process. It includes a processor context (which includes the program counter and stack pointer) and its own data area for a stack (to enable subroutine branching). ● Thread switch: The act of switching processor control from one thread to another within the same process
  • 45. CHRIST Excellence and Service Deemed to be University Hardware Multithreading ● General idea: Have multiple thread contexts in a single processor ○ When the hardware executes from those hardware contexts determines the granularity of multithreading ● Why? ○ To tolerate latency (initial motivation) ■ Latency of memory operations, dependent instructions, branch resolution ■ By utilizing processing resources more efficiently ○ To improve system throughput ■ By exploiting thread-level parallelism ■ By improving superscalar processor utilization ○ To reduce context switch penalty
  • 46. CHRIST Excellence and Service Deemed to be University Hardware Multithreading ● Benefit + Latency tolerance + Better hardware utilization (when?) + Reduced context switch penalty ● Cost - Requires multiple thread contexts to be implemented in hardware (area, power, latency cost) - Usually reduced single-thread performance - Resource sharing, contention - Switching penalty (can be reduced with additional hardware)
  • 47. CHRIST Deemed to be University Types of Multithreading ● Fine-grained (Interleaved multithreading) ○ Cycle by cycle ● Coarse-grained (Blocked multithreading) ○ Switch on event (e.g., cache miss) ○ Switch on quantum/timeout ● Simultaneous multithreading (SMT) ○ Instructions from multiple threads executed concurrently in the same cycle ● Chip multiprocessing ○ In this case, multiple cores are implemented on a single chip and each core handles separate threads Excellence and Service
  • 48. CHRIST Deemed to be University Fine-grained Multithreading ● Idea: Switch to another thread every cycle such that no two instructions from the thread are in the pipeline concurrently ● Improves pipeline utilization by taking advantage of multiple threads ● Alternative way of looking at it: Tolerates the control and data dependency latencies by overlapping the latency with useful work from other threads Excellence and Service
  • 49. CHRIST Deemed to be University Fine-grained Multithreading ● Advantages + No need for dependency checking between instructions (only one instruction in pipeline from a single thread) + No need for branch prediction logic + Otherwise-bubble cycles used for executing useful instructions from different threads + Improved system throughput, latency tolerance, utilization ● Disadvantages - Extra hardware complexity: multiple hardware contexts, thread selection logic - Reduced single thread performance (one instruction fetched every N cycles) - Resource contention between threads in caches and memory - Dependency checking logic between threads remains (load/store) Excellence and Service
  • 50. CHRIST Deemed to be University Coarse-grained Multithreading ● Idea: When a thread is stalled due to some event, switch to a different hardware context ○ Switch-on-event multithreading ● Possible stall events ○ Cache misses ○ Synchronization events (e.g., load an empty location) ○ FP operations Excellence and Service
  • 51. CHRIST Deemed to be University Fine-grained vs. Coarse-grained MT ● Fine-grained advantages + Simpler to implement, can eliminate dependency checking, branch prediction logic completely + Switching need not have any performance overhead (i.e. dead cycles) + Coarse-grained requires a pipeline flush or a lot of hardware to save pipeline state  Higher performance overhead with deep pipelines and large windows ● Disadvantages - Low single thread performance: each thread gets 1/Nth of the bandwidth of the pipeline Excellence and Service
  • 52. CHRIST Excellence and Service Deemed to be University Fine-grained vs. Coarse-grained MT ● Fine-grained advantages + Simpler to implement, can eliminate dependency checking, branch prediction logic completely + Switching need not have any performance overhead (i.e. dead cycles) + Coarse-grained requires a pipeline flush or a lot of hardware to save pipeline state  Higher performance overhead with deep pipelines and large windows ● Disadvantages - Low single thread performance: each thread gets 1/Nth of the bandwidth of the pipeline
  • 53. CHRIST Excellence and Service Deemed to be University Simultaneous multithreading (SMT) ● SMT is a variation on hardware multithreading that uses the resources of a multiple-issue, dynamically scheduled pipelined processor to exploit thread-level parallelism at the same time it exploits instruction level parallelism. ● The key insight that motivates SMT is that multiple-issue processors often have more functional unit parallelism available than most single threads can effectively use. ● Furthermore, with register renaming and dynamic scheduling, multiple instructions from independent threads can be issued without regard to the dependences among them; the resolution of the dependences can be handled by the dynamic scheduling capability
  • 54. CHRIST Deemed to be University Approaches to Explicit Multithreading ● Interleaved ○ Fine-grained ○ Processor deals with two or more thread contexts at a time ○ Switching thread at each clock cycle ○ If thread is blocked it is skipped ● Blocked ○ Coarse-grained ○ Thread executed until event causes delay ○ E.g.Cache miss ○ Effective on in-order processor ○ Avoids pipeline stall ● Simultaneous (SMT) ○ Instructions simultaneously issued from multiple threads to execution units of superscalar processor ● Chip multiprocessing ○ Processor is replicated on a single chip ○ Each processor handles separate threads Excellence and Service
  • 55. CHRIST Deemed to be University Scalar Processor Approaches ● Single-threaded scalar ○ Simple pipeline ○ No multithreading ● Interleaved multithreaded scalar ○ Easiest multithreading to implement ○ Switch threads at each clock cycle ○ Pipeline stages kept close to fully occupied ○ Hardware needs to switch thread context between cycles ● Blocked multithreaded scalar ○ Thread executed until latency event occurs ○ Would stop pipeline ○ Processor switches to another thread Excellence and Service
  • 56. CHRIST Deemed to be University Scalar Diagrams Excellence and Service
  • 57. CHRIST Excellence and Service Deemed to be University Multiple Instruction Issue Processors (1) ● Superscalar ○ No multithreading ● Interleaved multithreading superscalar: ○ Each cycle, as many instructions as possible issued from single thread ○ Delays due to thread switches eliminated ○ Number of instructions issued in cycle limited by dependencies ● Blocked multithreaded superscalar ○ Instructions from one thread ○ Blocked multithreading used
  • 58. CHRIST Deemed to be University Multiple Instruction Issue Diagram (1) Excellence and Service
  • 59. CHRIST Excellence and Service Deemed to be University Multiple Instruction Issue Processors (2) ● Very long instruction word (VLIW) ○ E.g. IA-64 ○ Multiple instructions in single word ○ Typically constructed by compiler ○ Operations that may be executed in parallel in same word ○ May pad with no-ops ● Interleaved multithreading VLIW ○ Similar efficiencies to interleaved multithreading on superscalar architecture ● Blocked multithreaded VLIW ○ Similar efficiencies to blocked multithreading on superscalar architecture
  • 60. CHRIST Deemed to be University Multiple Instruction Issue Diagram (2) Excellence and Service
  • 61. CHRIST Deemed to be University Parallel, Simultaneous Execution of Multiple Threads ● Simultaneous multithreading ○ Issue multiple instructions at a time ○ One thread may fill all horizontal slots ○ Instructions from two or more threads may be issued ○ With enough threads, can issue maximum number of instructions on each cycle ● Chip multiprocessor ○ Multiple processors ○ Each has two-issue superscalar processor ○ Each processor is assigned thread ■ Can issue up to two instructions per cycle per thread Excellence and Service
  • 62. CHRIST Deemed to be University Parallel Diagram Excellence and Service
  • 63. CHRIST Deemed to be University Examples ● Some Pentium 4 ○ Intel calls it hyperthreading ○ SMT with support for two threads ○ Single multithreaded processor, logically two processors ● IBM Power5 ○ High-end PowerPC ○ Combines chip multiprocessing with SMT ○ Chip has two separate processors ○ Each supporting two threads concurrently using SMT Excellence and Service
  • 64. CHRIST Deemed to be University Intel® Hyper-Threading Technology ● Intel® Hyper-Threading Technology is a hardware innovation that allows more than one thread to run on each core. More threads means more work can be done in parallel. ● How does Hyper-Threading work? ● When Intel® Hyper-Threading Technology is active, the CPU exposes two execution contexts per physical core. This means that one physical core now works like two “logical cores” that can handle different software threads. ● The ten-core Intel® Core™ i9-10900K processor, for example, has 20 threads when Hyper-Threading is enabled. ● Two logical cores can work through tasks more efficiently than a traditional single-threaded core. By taking advantage of idle time when the core would formerly be waiting for other tasks to complete, Intel® Hyper-Threading Technology improves CPU throughput (by up to 30% in server applications). Excellence and Service
  • 65. CHRIST Deemed to be University Clusters ● Alternative to SMP ● High performance ● High availability ● Server applications ● A group of interconnected whole computers ● Working together as unified resource ● Illusion of being one machine ● Each computer called a node Excellence and Service
  • 66. CHRIST Deemed to be University Clusters Benefits ● Absolute scalability ● Incremental scalability ● High availability ● Superior price/performance Excellence and Service
  • 67. CHRIST Deemed to be University Cluster Configurations - Standby Server, No Shared Disk Excellence and Service
  • 68. CHRIST Deemed to be University Cluster Configurations - Shared Disk Excellence and Service
  • 69. CHRIST Excellence and Service Deemed to be University Clustering Methods: Benefits and Limitations
  • 70. CHRIST Deemed to be University Operating Systems Design Issues ● Failure Management ○ High availability ○ Fault tolerant ○ Failover ■ Switching applications & data from failed system to alternative within cluster ○ Failback ■ Restoration of applications and data to original system ■ After problem is fixed ● Load balancing ○ Incremental scalability ○ Automatically include new computers in scheduling ○ Middleware needs to recognise that processes may switch between machines Excellence and Service
  • 71. CHRIST Excellence and Service Deemed to be University Parallelizing ● Single application executing in parallel on a number of machines in cluster ○ Complier ■ Determines at compile time which parts can be executed in parallel ■ Split off for different computers ○ Application ■ Application written from scratch to be parallel ■ Message passing to move data between nodes ■ Hard to program ■ Best end result ○ Parametric computing ■ If a problem is repeated execution of algorithm on different sets of data ■ e.g. simulation using different scenarios ■ Needs effective tools to organize and run
  • 72. CHRIST Deemed to be University Cluster Computer Architecture Excellence and Service
  • 73. CHRIST Deemed to be University Cluster Middleware ● Unified image to user ○ Single system image ● Single point of entry ● Single file hierarchy ● Single control point ● Single virtual networking ● Single memory space ● Single job management system ● Single user interface ● Single I/O space ● Single process space ● Checkpointing ● Process migration Excellence and Service
  • 74. CHRIST Deemed to be University Blade Servers ● Common implementation of cluster ● Server houses multiple server modules (blades) in single chassis ○ Save space ○ Improve system management ○ Chassis provides power supply ○ Each blade has processor, memory, disk Excellence and Service
  • 75. CHRIST Deemed to be University Cluster v. SMP ● Both provide multiprocessor support to high demand applications. ● Both available commercially ○ SMP for longer ● SMP: ○ Easier to manage and control ○ Closer to single processor systems ■ Scheduling is main difference ■ Less physical space ■ Lower power consumption ● Clustering: ○ Superior incremental & absolute scalability ○ Superior availability ■ Redundancy Excellence and Service
  • 76. CHRIST Deemed to be University Nonuniform Memory Access (NUMA) ● Alternative to SMP & clustering ● Uniform memory access ○ All processors have access to all parts of memory ■ Using load & store ○ Access time to all regions of memory is the same ○ Access time to memory for different processors same ○ As used by SMP ● Nonuniform memory access ○ All processors have access to all parts of memory ■ Using load & store ○ Access time of processor differs depending on region of memory ○ Different processors access different regions of memory at different speeds ● Cache coherent NUMA ○ Cache coherence is maintained among the caches of the various processors ○ Significantly different from SMP and clusters Excellence and Service
  • 77. CHRIST Deemed to be University Motivation ● SMP has practical limit to number of processors ○ Bus traffic limits to between 16 and 64 processors ● In clusters each node has own memory ○ Apps do not see large global memory ○ Coherence maintained by software not hardware ● NUMA retains SMP flavour while giving large scale multiprocessing ○ e.g. Silicon Graphics Origin NUMA 1024 MIPS R10000 processors ● Objective is to maintain transparent system wide memory while permitting multiprocessor nodes, each with own bus or internal interconnection system Excellence and Service
  • 78. CHRIST Excellence and Service Deemed to be University CC-NUMA Organization
  • 79. CHRIST Excellence and Service Deemed to be University CC-NUMA Operation ● Each processor has own L1 and L2 cache ● Each node has own main memory ● Nodes connected by some networking facility ● Each processor sees single addressable memory space ● Memory request order: ○ L1 cache (local to processor) ○ L2 cache (local to processor) ○ Main memory (local to node) ○ Remote memory ■ Delivered to requesting (local to processor) cache ● Automatic and transparent
  • 80. CHRIST Deemed to be University Excellence and Service Memory Access Sequence ● Each node maintains directory of location of portions of memory and cache status ● e.g. node 2 processor 3 (P2-3) requests location 798 which is in memory of node 1 ○ P2-3 issues read request on snoopy bus of node 2 ○ Directory on node 2 recognises location is on node 1 ○ Node 2 directory requests node 1’s directory ○ Node 1 directory requests contents of 798 ○ Node 1 memory puts data on (node 1 local) bus ○ Node 1 directory gets data from (node 1 local) bus ○ Data transferred to node 2’s directory ○ Node 2 directory puts data on (node 2 local) bus ○ Data picked up, put in P2-3’s cache and delivered to processor
  • 81. CHRIST Excellence and Service Deemed to be University Cache Coherence ● Node 1 directory keeps note that node 2 has copy of data ● If data modified in cache, this is broadcast to other nodes ● Local directories monitor and purge local cache if necessary ● Local directory monitors changes to local data in remote caches and marks memory invalid until writeback ● Local directory forces writeback if memory location requested by another processor
  • 82. CHRIST Excellence and Service Deemed to be University NUMA Pros & Cons ● Effective performance at higher levels of parallelism than SMP ● No major software changes ● Performance can breakdown if too much access to remote memory ○ Can be avoided by: ■ L1 & L2 cache design reducing all memory access ● Need good temporal locality of software ■ Good spatial locality of software ■ Virtual memory management moving pages to nodes that are using them most ● Not transparent ○ Page allocation, process allocation and load balancing changes needed ● Availability?
  • 83. CHRIST Excellence and Service Deemed to be University Vector Computation ● Maths problems involving physical processes present different difficulties for computation ○ Aerodynamics, seismology, meteorology ○ Continuous field simulation ● High precision ● Repeated floating point calculations on large arrays of numbers ● Supercomputers handle these types of problem ○ Hundreds of millions of flops ○ $10-15 million ○ Optimised for calculation rather than multitasking and I/O ○ Limited market ■ Research, government agencies, meteorology ● Array processor ○ Alternative to supercomputer ○ Configured as peripherals to mainframe & mini ○ Just run vector portion of problems
  • 84. CHRIST Deemed to be University Vector Addition Example Excellence and Service
  • 85. CHRIST Excellence and Service Deemed to be University Approaches ● General purpose computers rely on iteration to do vector calculations ● In example this needs six calculations ● Vector processing ○ Assume possible to operate on one-dimensional vector of data ○ All elements in a particular row can be calculated in parallel ● Parallel processing ○ Independent processors functioning in parallel ○ Use FORK N to start individual process at location N ○ JOIN N causes N independent processes to join and merge following JOIN ■ O/S Co-ordinates JOINs ■ Execution is blocked until all N processes have reached JOIN
  • 86. CHRIST Excellence and Service Deemed to be University Processor Designs ● PipelinedALU ○ Within operations ○ Across operations ● ParallelALUs ● Parallel processors