1. 13-11-2020 1Dr. K.K. THYAGHARAJAN
Contact E-mail: acdean@rmd.ac.in kkthyagharajan@yahoo.com kkthyagharajan@gmail.com
Dr. K.K. THYAGHARAJAN
Professor & Dean (Academic)
Department of Electronics and Communication Engineering
RMD ENGINEERING COLLEGE
PROCESSOR ORGANIZATION
Click on the links given below to view videos.
UNIT – I Computer Architecture
PPT-PDF http://dx.doi.org/10.13140/RG.2.2.28687.20643
https://youtu.be/DcMM_dIxWEE
https://youtu.be/JoSONsTuopk
UNIT – II Computer Architecture
PPT-PDF http://dx.doi.org/10.13140/RG.2.2.36236.95363
https://youtu.be/thC8B4B-PyY
https://youtu.be/m7JtcP5QmFA
https://youtu.be/NbfTKSm4ubM
https://youtu.be/RhiBtztCESI
UNIT – IV Computer Architecture & Microprocessors
PPT-PDF http://dx.doi.org/10.13140/RG.2.2.20718.02880
https://youtu.be/LroA8T-_vqs
https://youtu.be/CU1wx8EZmvc
https://youtu.be/zYADaZ5sfY0
https://youtu.be/GuC7sZEw-uM
2. 13-11-2020 2Dr. K.K. THYAGHARAJAN
Dr. K.K. THYAGHARAJAN
Contact E-mail: acdean@rmd.ac.in kkthyagharajan@yahoo.com kkthyagharajan@gmail.com
PROCESSOR ORGANIZATION
FLYNN’S CLASSIFICATION OF MULTIPROCESSOR ORGANIZATION
This video explains SISD, SIMD, MISD and MIMD classification
3. 13-11-2020 3Dr. K.K. THYAGHARAJAN
Flynn’s Classification of Processor Organization
1. SISD (Single Instruction stream Single Data stream)
2. SIMD (Single Instruction stream Multiple Data stream)
3. MISD (Multiple Instruction stream Single Data stream)
4. MIMD (Multiple Instruction stream Multiple Data stream)
The processor organization deals with how the parts such as
control units and processing elements (ALU) of the processor
are linked together in a multiprocessor to improve the
performance.
4. 13-11-2020 4Dr. K.K. THYAGHARAJAN
Flynn’s Classification of Processor Organization
1. SISD
2. SIMD
Control
Unit
ALU
Main
Memory
IS DS
IS
SISD has a single control unit and gets a single
instruction from the main memory at a time.
It has one processing Element (ALU) and uses one
data stream connected with the main memory.
Processing unit may have more functional units
(add, multiply, load etc.)
Instruction stream (Is)= Data stream (Ds) = 1
SIMD is more suitable for handling arrays in the for loops and
data parallelism is achieved
SIMD gets only one copy of the code from the main memory
and this operation is performed on multiple independent
data obtained by the main memory as shown in the figure.
Reduces the instruction bandwidth and space
Not suitable for the case … switch statements because the
execution unit (Processing unit) must perform different
operations based on data
The processor should complete the current instruction
before it takes the next instruction i.e. execution of the
instruction is synchronous
PEn
PE2Control
Unit
PE1
MM- Main
Memory
IS
DS1
IS
DS2
DSn
…
MM1
MM2
MMn
IS
IS
IS
…
PE –
Processing
Element
5. 13-11-2020 5Dr. K.K. THYAGHARAJAN
Flynn’s Classification of Processor Organization
3. MISD
PEn
PE2
PE1
MM- Main
Memory
…
IS1
IS2
ISn
PE – Processing Element CU – Control Unit
Is >1 and Ds = 1
Multiple control units are used to control
multiple processing units
Each control unit is handling one instruction and
process it through its corresponding process
element
Only one data stream is passing through all
processing elements at a time from a common
shared memory
CUn
CU2
CU1
…
DS
DS
ISn
IS2
IS1
6. 13-11-2020 6Dr. K.K. THYAGHARAJAN
Flynn’s Classification of Processor Organization
4. MIMD
PEn
PE2
PE1
…
IS1
IS2
ISn
PE – Processing Element CU – Control Unit
Is >1 and Ds >1 Multiple control units are used to handle multiple
instructions at same time
Multiple processing elements are used with separate
data stream drawn from main memory for each
processing element
Each process works on its own instruction and own data
Task executed by different processes are asynchronous
i.e. each task can start or finish at different times
This organization actually represents a real parallel
computer - Example Graphics Processing Unit (GPU)
CUn
CU2
CU1
…
DS1
ISn
IS2
IS1
MM1
MM2
MMn
…
Mainmemory
DS2
DSn
7. 13-11-2020 7Dr. K.K. THYAGHARAJAN
SIMD is a vector architecture and it is used for data level parallelism.
In vector architecture data are collected from memory, put them in proper order into a set of registers. These registers are operated
sequentially using pipelined execution units. The results are written back to memory.
If two vectors A and B are to be added and the result is to be stored in the vector C, then it can be written as C = A + B . A(1), A(2) etc
are vector elements
Figure 1 shows a single ‘add’ pipeline to add vectors A and B and stores the result in vector C. Here one addition is performed per cycle
Figure 2
SIMD (Single Instruction stream Multiple Data stream)
Figure 1 Figure 3
Figure 2 uses four add pipelines or lanes. So,
it completes four additions per cycle.
The number of clocks required to execute a
vector addition is reduced by a factor of 4.
Each vector lane uses a portion of vector
register.
Figure 3 uses four lanes. Each
vector lane has more than
one functional units.
Three functional units FP add,
FP Multiply, and a load store
unit are provided
The elements in a single
vector are interleaved across
four lanes,
Each lane uses a portion of
vector register.
The vector storage is divided
across four lanes and each
lane holds every fourth
element of each vector
register
8. 13-11-2020 8Dr. K.K. THYAGHARAJAN
SIMD (Single Instruction stream Multiple Data stream)
Old array processors use 64 ALUs for doing 64 additions simultaneously
But SIMD vector architecture use less number of ALUs (even one) and passes the data through lanes and
pipelines. This reduces the hardware cost.
In MIPS vector architecture, 32 vector registers are provided and each register will point 64 vector elements
each of 64 bit size.
Hardware gets the addresses of the vector from these vector vector-registers. This indexed accesses are called
as gather scatter.
The data need not be contiguous in main memory. Indexed load instructions gather data from the main
memory and put them in contiguous vector elements.
Indexed store instruction scatters vector elements across main memory.
The number of elements in a vector operation is not in the instruction or opcode, it is in a separate register
MIPS vector instructions are obtained by appending letter v to MIPS instructions for example
addv.d # adds two double precision vectors. This instruction accesses the inputs using two vector registers
addvs.d # This instruction takes one input from a scalar register and another input using a vector register. The
scalar input is added with each element in the vector.
lv # load vector (double precision data)
sv # store vector (double precision data)
9. 13-11-2020 9Dr. K.K. THYAGHARAJAN
SIMD (Single Instruction stream Multiple Data stream)
Problem:
Write an MIPS program using vector instructions to solve Y = a*X + Y
Where X and Y are vectors (arrays) with 64 double precision floating point numbers i.e. 64 numbers of 64-
bit size each stored in the memory.
Assume that the starting address of X is in $s0 and starting address of Y is in $s1
Solution:
l.d $f0, a($sp) # load scalar ‘a’ into f0 register
lv $v1, 0($s0) # load vector X pointed by register s0 into v1 register
mulvs.d $v2, $v1, $f0 # multiply vector v1 by scalar f0 and store the result in vector v2
lv $v3, 0($s1) # load vector Y in v3
addv.d $v4, $v2, $v3 # add vector Y (in v3) to the product (in v2) and store the result in vector v4
sv $v4, 0($s1) # store the result v4 in the vector Y (pointed by register s1)
10. 13-11-2020 10Dr. K.K. THYAGHARAJAN
Contact E-mail: acdean@rmd.ac.in kkthyagharajan@yahoo.com kkthyagharajan@gmail.com
Dr. K.K. THYAGHARAJAN
Professor & Dean (Academic)
Department of Electronics and Communication Engineering
RMD ENGINEERING COLLEGE
Multithreading
11. 13-11-2020 11Dr. K.K. THYAGHARAJAN
Dr. K.K. THYAGHARAJAN
Contact E-mail: acdean@rmd.ac.in kkthyagharajan@yahoo.com kkthyagharajan@gmail.com
Multithreading
Hardware Multithreading
This video explains three types of hardware multithreading
12. 13-11-2020 12Dr. K.K. THYAGHARAJAN
1. Hardware Multithreading
Thread: Thread is a sequence of instructions that can run independently from other programs.
When a sequence of instructions are being executed, the processor may have to wait if the next instruction
or data is not available. This is called stalling.
Instead of waiting, the processor may switch to another thread , execute it and come back to this thread.
Multithreading: Switching from one thread (stalled thread) to another thread is known as multithreading.
All the threads generally share a single address space using a program counter, and stack and register states.
Process: A process includes one or more threads and their address spaces. Switching from one process to another process
invokes the operating system (OS).
The main difference between multithreading and the process: Multithreading uses single address space and it does not
invoke the OS. But, a process switches from threads at different address spaces and requires the help of the operating
system to do this switching. So, you can say that multithreading is a lightweight process and it is smaller than the process.
Types of multithreading:
Fine-grained multithreading
Coarse-grained multithreading
Simultaneous multithreading
13. 13-11-2020 13Dr. K.K. THYAGHARAJAN
1.1 Fine-grained multithreading
Switching between threads happens on each instruction
Switching happens on every clock cycle
Switching is done in a round-robin fashion as shown in figure 1 i.e. 1st instruction from the 1st thread is executed then 1st
instruction from the 2nd thread is executed and it continues until the 1st instruction in the last thread completes. Then the
2nd instruction from the 1st thread is executed, 2nd instruction from the 2nd thread is executed and so on.
Advantage: If any thread is stalled, it is skipped and the next thread continues . This is called interleaving and this approach
improves the throughput.
Disadvantage: Threads that are ready to execute the next instruction without stalls should wait until other threads are over. This
slows down the execution of individual threads.
1. Hardware Multithreading
1.2. Coarse-grained multithreading
Threads are switched only when costly stalls (i.e. last-level cache miss) occur. So, frequent thread switching is avoided and hence slow
down of individual thread execution is avoided.
Disadvantage: The new thread starts execution only after the pipeline is filled up and the current instructions completes its execution.
This is called start-up overhead.
Advantage: The processor issues instructions from the same thread when shorter stalls occur. So pipeline may be required to be
emptied or frozen. However, the time taken in this case is less compared to the pipeline start-up time and hence the throughput cost is
minimized.
This approach reduces the penalty of high-cost stalls because the pipeline refill time is negligible compared to stall time.
14. 13-11-2020 14Dr. K.K. THYAGHARAJAN
1.3. Simultaneous multithreading (SMT)
Multiple instructions are executed from independent multiple threads using
register renaming. If the threads are dependent, the dependencies are handled
by dynamic scheduling. But it does not switch resources every clock cycle.
SMT processors have dynamically scheduled pipelines for thread level
parallelism and instruction level parallelism. These processors have more
functional units to implement parallelism.
Intel call this as hyper-threading and AMD calls this as SMT
1. Hardware Multithreading
Figure 1 shows three threads that will execute independently with stalling.
The empty rows indicate unused clock cycle or stall.
One row in each thread is issued to pipe line in each clock cycle
Figure 2a shows how the three threads shown in figure 1 are executed when fine grained multithreading is applied
Figure 2b shows how the three threads shown in figure 1 are executed when coarse grained multithreading is applied
Figure 2c shows how the three threads shown in figure 1 are executed when simultaneous multithreading is applied
15. 13-11-2020 15Dr. K.K. THYAGHARAJAN
Thread A Thread B Thread C
Costly Stall
Costly Stall
Costly Stall
Costly Stall
Costly Stall
Costly Stall Costly Stall
Time
(2a) Fine Grained MT (2b) Coarse Grained MT (2c) Simultaneous MT
Time
Figure 1: Independent threads
Figure 2: The above three threads
executed with Multithreading
16. 13-11-2020 16Dr. K.K. THYAGHARAJAN
Contact E-mail: acdean@rmd.ac.in kkthyagharajan@yahoo.com kkthyagharajan@gmail.com
Dr. K.K. THYAGHARAJAN
Professor & Dean (Academic)
Department of Electronics and Communication Engineering
RMD ENGINEERING COLLEGE
Multiprocessing
17. 13-11-2020 17Dr. K.K. THYAGHARAJAN
Dr. K.K. THYAGHARAJAN
Contact E-mail: acdean@rmd.ac.in kkthyagharajan@yahoo.com kkthyagharajan@gmail.com
Multiprocessing Systems
Multicore processor & Multiprocessor
KKT
18. 13-11-2020 18Dr. K.K. THYAGHARAJAN
1. Multiprocessing Systems
1.Multiprocessor
system
1.1 Shared memory System
(Tightly coupled system)
1.1.1 Uniform
memory Access
(UMA)
1.1.2 Non-uniform
memory Access
(NUMA)
1.2 Distributed memory System
(Loosely Coupled System)
Cluster
Two Types of Multiprocessing:
More than one processors are in a single chip - Multicore processor – Quad core
processor can handle 4 threads, Octa-core can handle 8 threads – more cores better
multiprocessing- uses on-chip network to connect the processors
More than one processors are connected in a single system - Multiprocessor system -
uses Local Area Network (LAN) to connect systems
19. 13-11-2020 19Dr. K.K. THYAGHARAJAN
1.1 Shared Memory (tightly coupled) System
• All the processors share a single global memory. The global memory
may be divided into many modules, but single address space is used.
(e.g. Multicore processor)
• Processors communicate using shared locations (variables) in the global
memory.
• These shared data are coordinated using locks (synchronization
primitives) which allow data to be accessed by only one processor at a
time.
• Shared memory system use a common bus or a cross bar or a
multistage network to connect processors, memory and I/O devices
• Programs stored in the virtual address space of each processor can run
independently.
• This is used for high speed real-time processing and provides high
throughput compared to loosely coupled systems.
20. 13-11-2020 20Dr. K.K. THYAGHARAJAN
1.1.1 Uniform memory Access (UMA) System
UMA systems are divided in to two
1. Symmetric UMA (SUMA)
2. Asymmetric UMA (AUMA)
In the case of SUMA all processors are identical. Processors may have local cache memories and I/O devices. Physical
memory is uniformly shared by all processors with equal access time to all words.
In the case of AUMA, one master processor executes the operating system and other processors may be dedicated to
special tasks such as graphics rendering, doing mathematical functions etc.
Processor 1
Cache
Processor 2
Cache
Processor n
Cache
Interconnection Network
Memory I/O
21. 13-11-2020 21Dr. K.K. THYAGHARAJAN
1.1.2 Non-uniform Memory Access (NUMA) System
In NUMA system each processor may have a local memory. Local memories will have its
own private program and private data. The collection of all local memories form the global
memory i.e. local memory of one processor may be accessed by another processor using
shared variables.
The time taken for accessing the
local memory of one processor by
another remote processor is not
uniform. It depends on the location
of the processor and the memory.
NUMAs can scale to larger size with
lower latency (access time) to local
memory.
Interrupt Signal Interconnection Network
Processor-Memory Interconnection Network
P1 P2 Pn
M1 M2 Mn
I/O processor
Interconnection
Network
D1 D2 Dn
I/O channels
22. 13-11-2020 22Dr. K.K. THYAGHARAJAN
1.2 Distributed Memory System (DMS)
(Loosely Coupled System)
• DMS systems do not use a global shared memory.
• Use of global memory creates memory conflicts and slows down the execution.
• DMS has multiple processors and each processor has a large local memory and a set of I/O devices,
which are not shared by any other processor. So, this system is called distributed multicomputer system.
• The group of computers connected together is called a cluster and each computer is called a node.
• These computers communicate with each other by passing messages through an interconnection
network.
• To pass messages to other computers in the cluster ‘send message’ routine is used
• To receive messages from other computers in the cluster ‘receive message’ routine is used
24. 13-11-2020 24Dr. K.K. THYAGHARAJAN
Contact E-mail: acdean@rmd.ac.in kkthyagharajan@yahoo.com kkthyagharajan@gmail.com
Dr. K.K. THYAGHARAJAN
Professor & Dean (Academic)
Department of Electronics and Communication Engineering
RMD ENGINEERING COLLEGE
Multiprocessor Network Topologies
26. 13-11-2020 26Dr. K.K. THYAGHARAJAN
Bus Topology
o Showing how the nodes (processor & memory) are connected is called the network topology.
o Multicore chips require on-chip networks to connect cores together
o Clusters require local area networks to connect servers together
o Networks are drawn as graphs and edges of the graph represents links of the communication network
o Nodes (computers or processor-memory-nodes) are connected to this graph through network switches .
o In the following diagrams coloured circles represent switches and black squares represent processor-memory-nodes
o Network costs include the number of switches, number of links on a switch to connect, the length of the links, the width (number
of bits) per link
1. Uses a shared set of wires that allows broadcasting messages to all
nodes at the same time
2. Bandwidth of the network = bandwidth of the bus
Multiprocessor Network Topologies
Ring Topology
1. Messages will have to travel along the intermediate nodes until they arrive at
the final destination. A ring is capable of many simultaneous transfers
2. Bandwidth of the network = Bandwidth of each link x number of links
3. Ring is a fully connected network. Every processor (P) has a bidirectional link to
every other processor
4. If a link is as fast as the bus, then a ring is P times faster than bus in the best-
case.
5. The total bandwidth = P x (P-1)/2
6. Bisection bandwidth = (P/2)2 Bisection bandwidth is calculated by dividing the
machines into two halves
switch
node
link
27. 13-11-2020 27Dr. K.K. THYAGHARAJAN
• Instead of placing a processor at every node in a network, a switch is
placed at some of these nodes.
• Switches are smaller than processor-memory-switch nodes
In Star topology all nodes are connected to a
central device called hub using a point-to-point
connection
Multiprocessor Network Topologies
STAR TOPLOGY
28. 13-11-2020 28Dr. K.K. THYAGHARAJAN
Multiprocessor Network Topologies
Boolean Cube tree network
n=3 for cube
So n (here n=3) links per switch are used in these networks
23 = 8 i.e. 2n nodes are connected.
One link goes to the processor
Switch
Node
2D Grid or Mesh network
Here n=2
So n (here n=2) links per switch
are used in these networks
One link goes to the processor
29. 13-11-2020 29Dr. K.K. THYAGHARAJAN
Multistage networks: Messages can travel in multiple steps
1. Fully connected or crossbar networks – Any node can communicate with
any other node in one pass through the network.
2. Omega Networks: Uses less hardware than the crossbar network,
bus contention may occur between messages
Multiprocessor Network Topologies
Crossbar Network:
n= number of processors = 8
Number of switches = n2 = 64
Any node can communicate with any other
node in one pass through the network
30. 13-11-2020 30Dr. K.K. THYAGHARAJAN
Omega Networks: Uses less hardware than the crossbar network.
There are 12 switch boxes in the network shown in the figure. Each
switch box has 4 smaller switches.
Number of switches used = 12x 4 = 48
It is given by the formula 2nlog2n , here n = number of processors = 8
So, number of switches used = 2x8x log28 = 2x8x3=48 as given above.
This network can not support all combinations for message passing,
bus contention may occur between messages.
i.e. P0 can not send messages to P6 because these two are not
connected
If P1 sends a message to P4 , then P0 may not send messages to P4 or
P5 at the same time.
Multiprocessor Network Topologies
Omega Network
Switch Box
A
B
C
D
31. 13-11-2020 31Dr. K.K. THYAGHARAJAN
Contact E-mail: acdean@rmd.ac.in kkthyagharajan@yahoo.com kkthyagharajan@gmail.com
Dr. K.K. THYAGHARAJAN
Professor & Dean (Academic)
Department of Electronics and Communication Engineering
RMD ENGINEERING COLLEGE
Graphics Processing Unit (GPU)
32. 13-11-2020 32Dr. K.K. THYAGHARAJAN
Dr. K.K. THYAGHARAJAN
Contact E-mail: acdean@rmd.ac.in kkthyagharajan@yahoo.com kkthyagharajan@gmail.com
Graphics Processing Unit (GPU)
Architecture of GPU (MIMD Processor)
KKT
33. 13-11-2020 33Dr. K.K. THYAGHARAJAN
Graphics Processing Unit (GPU)
Central Processing Unit (CPU)
1. CPU has general purpose instructions and they are more
suitable for serial processing
2. CPUs will have just few cores with cache and can handle
only few threads at a time
3. CPU has a large main memory which is oriented toward
low latency
4. CPU is mainly designed for instruction level parallelism
Graphics Processing Unit (GPU)
1. GPU has its own special instructions to handle graphics
and more suitable for parallel processing.
2. GPUs have hundreds of cores and can handle thousands
of threads simultaneously
3. GPU has a separate large main memory which is oriented
toward bandwidth rather than latency and provides high
throughput
4. GPU is designed for data level parallelism
o GPU is a processor specially designed for handling graphics rendering tasks.
o GPU is used to accelerate the processing in video editing, video game
rendering, 3D modelling (AUTO CAD), AI based tasks etc.
o GPU breaks complex problems into many tasks and work on them parallelly.
o GPUs are highly multithreaded.
34. 13-11-2020 34Dr. K.K. THYAGHARAJAN
Graphics Processing Unit (GPU)
NVIDIA is an American company. They developed ‘Compute Unified Device Architecture’ (CUDA) for graphics
processing units and its commercial name is Fermi
GeForce is a brand of graphics processing units designed by NVIDIA
GPU contains a collection of multithreaded SIMD processors and hence it is a MIMD processor.
In the Fermi architecture, The number of SIMD processors will be 7 or 11 or 14 or 15.
GPU Architecture - NVIDIA
A CUDA program calls kernels parallelly. Click to view the diagram
The GPU executes a kernel on a grid, a grid has many thread blocks. The thread block consists of many threads
which will be executed parallelly (figure 2).
Each thread with in the thread block is called as machine object and it has a thread ID, program instructions, program
counter, registers, per-thread private memory, inputs and outputs.
The machine object is created , managed, scheduled and executed by the GPU.
The GPU has two schedulers
1. Thread Block Scheduler: Assigns blocks of threads to multithreaded SIMD processors.
2. SIMD Thread Scheduler: This is available within the SIMD processor and it has a controller. It identifies the threads
that are ready run , schedules them for executing and sends them to the dispatcher when needed.
35. 13-11-2020 35Dr. K.K. THYAGHARAJAN
Graphics Processing Unit (GPU)
The off-chip DRAM is shared by all thread blocks. It is called global memory or GPU memory
The local memory of each SIMD processor is shared by its lanes within the processor, but it is not shared
between two SIMD processors
7 or 11 or 14 or 15 SIMD processors are used in different Fermi architectures.
Figure 1: Multithreaded SIMD Processor Figure 2: GPU Memory Structure
Thread
BlockThreads
Grid 0
Grid 1
36. 13-11-2020 36Dr. K.K. THYAGHARAJAN
Scheduling
Parallel Processing Includes
o Scheduling - a method used to share the resources for threads and processes, processor time , data lines
and balance the load to achieve the quality
o Partitioning the work into parallel pieces – task must be divided equally to all processors to avoid idle time
o Balancing the load evenly between the processors – processors should not be idle for more time
o Time required for Synchronization – processor should complete the allotted work in the specified time and
should get the next work intime otherwise parallel processing will not be possible
o Overhead for communication – Inter process communication has time overhead because it has to know which
process has to communicate and which one need not communicate.
Types of scheduling
Long term scheduling
Medium term scheduling
Short term scheduling
Dispatcher
37. 13-11-2020 37Dr. K.K. THYAGHARAJAN
Scheduling
Problem:
To achieve a speed-up of 90 times faster with 100 processors, what percentage of the original computation can
be sequential?
Amdahl’s Law
Fraction time affected = execution time affected / execution time before
Speed-up = 90 ; amount of improvement 100
Substitute in the above formula
Fraction time affected = 0.999
So to achieve speed-up of 90 times from 100 processors, the sequential percentage can only be 0.1%