Ca unit v 27 9-2020

13-11-2020 1Dr. K.K. THYAGHARAJAN
Contact E-mail: acdean@rmd.ac.in kkthyagharajan@yahoo.com kkthyagharajan@gmail.com
Dr. K.K. THYAGHARAJAN
Professor & Dean (Academic)
Department of Electronics and Communication Engineering
RMD ENGINEERING COLLEGE
PROCESSOR ORGANIZATION
Click on the links given below to view videos.
UNIT – I Computer Architecture
PPT-PDF http://dx.doi.org/10.13140/RG.2.2.28687.20643
https://youtu.be/DcMM_dIxWEE
https://youtu.be/JoSONsTuopk
UNIT – II Computer Architecture
https://youtu.be/thC8B4B-PyY
https://youtu.be/m7JtcP5QmFA
https://youtu.be/NbfTKSm4ubM
https://youtu.be/RhiBtztCESI
UNIT – IV Computer Architecture & Microprocessors
https://youtu.be/LroA8T-_vqs
https://youtu.be/CU1wx8EZmvc
https://youtu.be/zYADaZ5sfY0
https://youtu.be/GuC7sZEw-uM

PROCESSOR ORGANIZATION
FLYNN’S CLASSIFICATION OF MULTIPROCESSOR ORGANIZATION
This video explains SISD, SIMD, MISD and MIMD classification

Flynn’s Classification of Processor Organization
1. SISD (Single Instruction stream Single Data stream)
2. SIMD (Single Instruction stream Multiple Data stream)
3. MISD (Multiple Instruction stream Single Data stream)
4. MIMD (Multiple Instruction stream Multiple Data stream)
The processor organization deals with how the parts such as
control units and processing elements (ALU) of the processor
are linked together in a multiprocessor to improve the
performance.

1. SISD
2. SIMD
Control
Unit
ALU
Main
Memory
IS DS
IS
 SISD has a single control unit and gets a single
instruction from the main memory at a time.
 It has one processing Element (ALU) and uses one
data stream connected with the main memory.
 Processing unit may have more functional units
(add, multiply, load etc.)
 Instruction stream (Is)= Data stream (Ds) = 1
 SIMD is more suitable for handling arrays in the for loops and
data parallelism is achieved
 SIMD gets only one copy of the code from the main memory
and this operation is performed on multiple independent
data obtained by the main memory as shown in the figure.
 Reduces the instruction bandwidth and space
 Not suitable for the case … switch statements because the
execution unit (Processing unit) must perform different
operations based on data
 The processor should complete the current instruction
before it takes the next instruction i.e. execution of the
instruction is synchronous
PEn
PE2Control
Unit
PE1
MM- Main
Memory
IS
DS1
IS
DS2
DSn
…
MM1
MM2
MMn
IS
IS
IS
…
PE –
Processing
Element

3. MISD
PEn
PE2
PE1
MM- Main
Memory
…
IS1
IS2
ISn
PE – Processing Element CU – Control Unit
Is >1 and Ds = 1
 Multiple control units are used to control
multiple processing units
 Each control unit is handling one instruction and
process it through its corresponding process
element
 Only one data stream is passing through all
processing elements at a time from a common
shared memory
CUn
CU2
CU1
…
DS
DS
ISn
IS2
IS1

4. MIMD
PEn
PE2
PE1
…
IS1
IS2
ISn
PE – Processing Element CU – Control Unit
Is >1 and Ds >1  Multiple control units are used to handle multiple
instructions at same time
 Multiple processing elements are used with separate
data stream drawn from main memory for each
processing element
 Each process works on its own instruction and own data
 Task executed by different processes are asynchronous
i.e. each task can start or finish at different times
 This organization actually represents a real parallel
computer - Example Graphics Processing Unit (GPU)
CUn
CU2
CU1
…
DS1
ISn
IS2
IS1
MM1
MM2
MMn
…
Mainmemory
DS2
DSn

 SIMD is a vector architecture and it is used for data level parallelism.
 In vector architecture data are collected from memory, put them in proper order into a set of registers. These registers are operated
sequentially using pipelined execution units. The results are written back to memory.
 If two vectors A and B are to be added and the result is to be stored in the vector C, then it can be written as C = A + B . A(1), A(2) etc
are vector elements
 Figure 1 shows a single ‘add’ pipeline to add vectors A and B and stores the result in vector C. Here one addition is performed per cycle
Figure 2
SIMD (Single Instruction stream Multiple Data stream)
Figure 1 Figure 3
 Figure 2 uses four add pipelines or lanes. So,
it completes four additions per cycle.
 The number of clocks required to execute a
vector addition is reduced by a factor of 4.
 Each vector lane uses a portion of vector
register.
Figure 3 uses four lanes. Each
vector lane has more than
one functional units.
Three functional units FP add,
FP Multiply, and a load store
unit are provided
The elements in a single
vector are interleaved across
four lanes,
Each lane uses a portion of
vector register.
The vector storage is divided
across four lanes and each
lane holds every fourth
element of each vector
register

 Old array processors use 64 ALUs for doing 64 additions simultaneously
 But SIMD vector architecture use less number of ALUs (even one) and passes the data through lanes and
pipelines. This reduces the hardware cost.
 In MIPS vector architecture, 32 vector registers are provided and each register will point 64 vector elements
each of 64 bit size.
 Hardware gets the addresses of the vector from these vector vector-registers. This indexed accesses are called
as gather scatter.
 The data need not be contiguous in main memory. Indexed load instructions gather data from the main
memory and put them in contiguous vector elements.
 Indexed store instruction scatters vector elements across main memory.
 The number of elements in a vector operation is not in the instruction or opcode, it is in a separate register
 MIPS vector instructions are obtained by appending letter v to MIPS instructions for example
 addv.d # adds two double precision vectors. This instruction accesses the inputs using two vector registers
 addvs.d # This instruction takes one input from a scalar register and another input using a vector register. The
scalar input is added with each element in the vector.
 lv # load vector (double precision data)
 sv # store vector (double precision data)

Problem:
Write an MIPS program using vector instructions to solve Y = a*X + Y
Where X and Y are vectors (arrays) with 64 double precision floating point numbers i.e. 64 numbers of 64-
bit size each stored in the memory.
Assume that the starting address of X is in $s0 and starting address of Y is in $s1
Solution:
l.d $f0, a($sp) # load scalar ‘a’ into f0 register
lv $v1, 0($s0) # load vector X pointed by register s0 into v1 register
mulvs.d $v2, $v1, $f0 # multiply vector v1 by scalar f0 and store the result in vector v2
lv $v3, 0($s1) # load vector Y in v3
addv.d $v4, $v2, $v3 # add vector Y (in v3) to the product (in v2) and store the result in vector v4
sv $v4, 0($s1) # store the result v4 in the vector Y (pointed by register s1)

Multithreading

Multithreading
Hardware Multithreading
This video explains three types of hardware multithreading

1. Hardware Multithreading
Thread: Thread is a sequence of instructions that can run independently from other programs.
When a sequence of instructions are being executed, the processor may have to wait if the next instruction
or data is not available. This is called stalling.
Instead of waiting, the processor may switch to another thread , execute it and come back to this thread.
Multithreading: Switching from one thread (stalled thread) to another thread is known as multithreading.
All the threads generally share a single address space using a program counter, and stack and register states.
Process: A process includes one or more threads and their address spaces. Switching from one process to another process
invokes the operating system (OS).
The main difference between multithreading and the process: Multithreading uses single address space and it does not
invoke the OS. But, a process switches from threads at different address spaces and requires the help of the operating
system to do this switching. So, you can say that multithreading is a lightweight process and it is smaller than the process.
Types of multithreading:
Fine-grained multithreading
Coarse-grained multithreading
Simultaneous multithreading

1.1 Fine-grained multithreading
 Switching between threads happens on each instruction
 Switching happens on every clock cycle
 Switching is done in a round-robin fashion as shown in figure 1 i.e. 1st instruction from the 1st thread is executed then 1st
instruction from the 2nd thread is executed and it continues until the 1st instruction in the last thread completes. Then the
2nd instruction from the 1st thread is executed, 2nd instruction from the 2nd thread is executed and so on.
Advantage: If any thread is stalled, it is skipped and the next thread continues . This is called interleaving and this approach
improves the throughput.
Disadvantage: Threads that are ready to execute the next instruction without stalls should wait until other threads are over. This
slows down the execution of individual threads.
1.2. Coarse-grained multithreading
Threads are switched only when costly stalls (i.e. last-level cache miss) occur. So, frequent thread switching is avoided and hence slow
down of individual thread execution is avoided.
Disadvantage: The new thread starts execution only after the pipeline is filled up and the current instructions completes its execution.
This is called start-up overhead.
Advantage: The processor issues instructions from the same thread when shorter stalls occur. So pipeline may be required to be
emptied or frozen. However, the time taken in this case is less compared to the pipeline start-up time and hence the throughput cost is
minimized.
This approach reduces the penalty of high-cost stalls because the pipeline refill time is negligible compared to stall time.

1.3. Simultaneous multithreading (SMT)
 Multiple instructions are executed from independent multiple threads using
register renaming. If the threads are dependent, the dependencies are handled
by dynamic scheduling. But it does not switch resources every clock cycle.
 SMT processors have dynamically scheduled pipelines for thread level
parallelism and instruction level parallelism. These processors have more
functional units to implement parallelism.
 Intel call this as hyper-threading and AMD calls this as SMT
Figure 1 shows three threads that will execute independently with stalling.
The empty rows indicate unused clock cycle or stall.
One row in each thread is issued to pipe line in each clock cycle
Figure 2a shows how the three threads shown in figure 1 are executed when fine grained multithreading is applied
Figure 2b shows how the three threads shown in figure 1 are executed when coarse grained multithreading is applied
Figure 2c shows how the three threads shown in figure 1 are executed when simultaneous multithreading is applied

Thread A Thread B Thread C
Costly Stall
Costly Stall
Costly Stall
Costly Stall
Costly Stall
Costly Stall Costly Stall
Time
(2a) Fine Grained MT (2b) Coarse Grained MT (2c) Simultaneous MT
Time
Figure 1: Independent threads
Figure 2: The above three threads
executed with Multithreading

Multiprocessing

Multiprocessing Systems
Multicore processor & Multiprocessor
KKT

1. Multiprocessing Systems
1.Multiprocessor
system
1.1 Shared memory System
(Tightly coupled system)
1.1.1 Uniform
memory Access
(UMA)
1.1.2 Non-uniform
memory Access
(NUMA)
1.2 Distributed memory System
(Loosely Coupled System)
Cluster
Two Types of Multiprocessing:
 More than one processors are in a single chip - Multicore processor – Quad core
processor can handle 4 threads, Octa-core can handle 8 threads – more cores better
multiprocessing- uses on-chip network to connect the processors
 More than one processors are connected in a single system - Multiprocessor system -
uses Local Area Network (LAN) to connect systems

1.1 Shared Memory (tightly coupled) System
• All the processors share a single global memory. The global memory
may be divided into many modules, but single address space is used.
(e.g. Multicore processor)
• Processors communicate using shared locations (variables) in the global
memory.
• These shared data are coordinated using locks (synchronization
primitives) which allow data to be accessed by only one processor at a
time.
• Shared memory system use a common bus or a cross bar or a
multistage network to connect processors, memory and I/O devices
• Programs stored in the virtual address space of each processor can run
independently.
• This is used for high speed real-time processing and provides high
throughput compared to loosely coupled systems.

1.1.1 Uniform memory Access (UMA) System
UMA systems are divided in to two
1. Symmetric UMA (SUMA)
2. Asymmetric UMA (AUMA)
In the case of SUMA all processors are identical. Processors may have local cache memories and I/O devices. Physical
memory is uniformly shared by all processors with equal access time to all words.
In the case of AUMA, one master processor executes the operating system and other processors may be dedicated to
special tasks such as graphics rendering, doing mathematical functions etc.
Processor 1
Cache
Processor 2
Cache
Processor n
Cache
Interconnection Network
Memory I/O

1.1.2 Non-uniform Memory Access (NUMA) System
In NUMA system each processor may have a local memory. Local memories will have its
own private program and private data. The collection of all local memories form the global
memory i.e. local memory of one processor may be accessed by another processor using
shared variables.
The time taken for accessing the
local memory of one processor by
another remote processor is not
uniform. It depends on the location
of the processor and the memory.
NUMAs can scale to larger size with
lower latency (access time) to local
memory.
Interrupt Signal Interconnection Network
Processor-Memory Interconnection Network
P1 P2 Pn
M1 M2 Mn
I/O processor
Interconnection
Network
D1 D2 Dn
I/O channels

1.2 Distributed Memory System (DMS)
• DMS systems do not use a global shared memory.
• Use of global memory creates memory conflicts and slows down the execution.
• DMS has multiple processors and each processor has a large local memory and a set of I/O devices,
which are not shared by any other processor. So, this system is called distributed multicomputer system.
• The group of computers connected together is called a cluster and each computer is called a node.
• These computers communicate with each other by passing messages through an interconnection
network.
• To pass messages to other computers in the cluster ‘send message’ routine is used
• To receive messages from other computers in the cluster ‘receive message’ routine is used

1.2 Distributed Memory System
LM1 P1
Node
LM2 P2
Node
LMn Pn
Node
Message
Passing
Inter-
Connection
Network
LM – Local Memory
P - Processor

Multiprocessor Network Topologies

KKT

Bus Topology
o Showing how the nodes (processor & memory) are connected is called the network topology.
o Multicore chips require on-chip networks to connect cores together
o Clusters require local area networks to connect servers together
o Networks are drawn as graphs and edges of the graph represents links of the communication network
o Nodes (computers or processor-memory-nodes) are connected to this graph through network switches .
o In the following diagrams coloured circles represent switches and black squares represent processor-memory-nodes
o Network costs include the number of switches, number of links on a switch to connect, the length of the links, the width (number
of bits) per link
1. Uses a shared set of wires that allows broadcasting messages to all
nodes at the same time
2. Bandwidth of the network = bandwidth of the bus
Ring Topology
1. Messages will have to travel along the intermediate nodes until they arrive at
the final destination. A ring is capable of many simultaneous transfers
2. Bandwidth of the network = Bandwidth of each link x number of links
3. Ring is a fully connected network. Every processor (P) has a bidirectional link to
every other processor
4. If a link is as fast as the bus, then a ring is P times faster than bus in the best-
case.
5. The total bandwidth = P x (P-1)/2
6. Bisection bandwidth = (P/2)2 Bisection bandwidth is calculated by dividing the
machines into two halves
switch
node
link

• Instead of placing a processor at every node in a network, a switch is
placed at some of these nodes.
• Switches are smaller than processor-memory-switch nodes
In Star topology all nodes are connected to a
central device called hub using a point-to-point
connection
STAR TOPLOGY

Boolean Cube tree network
n=3 for cube
So n (here n=3) links per switch are used in these networks
23 = 8 i.e. 2n nodes are connected.
One link goes to the processor
Switch
Node
2D Grid or Mesh network
Here n=2
So n (here n=2) links per switch
are used in these networks
One link goes to the processor

Multistage networks: Messages can travel in multiple steps
1. Fully connected or crossbar networks – Any node can communicate with
any other node in one pass through the network.
2. Omega Networks: Uses less hardware than the crossbar network,
bus contention may occur between messages
Crossbar Network:
n= number of processors = 8
Number of switches = n2 = 64
Any node can communicate with any other
node in one pass through the network

Omega Networks: Uses less hardware than the crossbar network.
There are 12 switch boxes in the network shown in the figure. Each
switch box has 4 smaller switches.
Number of switches used = 12x 4 = 48
It is given by the formula 2nlog2n , here n = number of processors = 8
So, number of switches used = 2x8x log28 = 2x8x3=48 as given above.
This network can not support all combinations for message passing,
bus contention may occur between messages.
i.e. P0 can not send messages to P6 because these two are not
connected
If P1 sends a message to P4 , then P0 may not send messages to P4 or
P5 at the same time.
Omega Network
Switch Box
A
B
C
D

Graphics Processing Unit (GPU)

Architecture of GPU (MIMD Processor)
KKT

Central Processing Unit (CPU)
1. CPU has general purpose instructions and they are more
suitable for serial processing
2. CPUs will have just few cores with cache and can handle
only few threads at a time
3. CPU has a large main memory which is oriented toward
low latency
4. CPU is mainly designed for instruction level parallelism
1. GPU has its own special instructions to handle graphics
and more suitable for parallel processing.
2. GPUs have hundreds of cores and can handle thousands
of threads simultaneously
3. GPU has a separate large main memory which is oriented
toward bandwidth rather than latency and provides high
throughput
4. GPU is designed for data level parallelism
o GPU is a processor specially designed for handling graphics rendering tasks.
o GPU is used to accelerate the processing in video editing, video game
rendering, 3D modelling (AUTO CAD), AI based tasks etc.
o GPU breaks complex problems into many tasks and work on them parallelly.
o GPUs are highly multithreaded.

 NVIDIA is an American company. They developed ‘Compute Unified Device Architecture’ (CUDA) for graphics
processing units and its commercial name is Fermi
 GeForce is a brand of graphics processing units designed by NVIDIA
 GPU contains a collection of multithreaded SIMD processors and hence it is a MIMD processor.
 In the Fermi architecture, The number of SIMD processors will be 7 or 11 or 14 or 15.
GPU Architecture - NVIDIA
 A CUDA program calls kernels parallelly. Click to view the diagram
 The GPU executes a kernel on a grid, a grid has many thread blocks. The thread block consists of many threads
which will be executed parallelly (figure 2).
 Each thread with in the thread block is called as machine object and it has a thread ID, program instructions, program
counter, registers, per-thread private memory, inputs and outputs.
 The machine object is created , managed, scheduled and executed by the GPU.
 The GPU has two schedulers
1. Thread Block Scheduler: Assigns blocks of threads to multithreaded SIMD processors.
2. SIMD Thread Scheduler: This is available within the SIMD processor and it has a controller. It identifies the threads
that are ready run , schedules them for executing and sends them to the dispatcher when needed.

 The off-chip DRAM is shared by all thread blocks. It is called global memory or GPU memory
 The local memory of each SIMD processor is shared by its lanes within the processor, but it is not shared
between two SIMD processors
 7 or 11 or 14 or 15 SIMD processors are used in different Fermi architectures.
Figure 1: Multithreaded SIMD Processor Figure 2: GPU Memory Structure
Thread
BlockThreads
Grid 0
Grid 1

Scheduling
Parallel Processing Includes
o Scheduling - a method used to share the resources for threads and processes, processor time , data lines
and balance the load to achieve the quality
o Partitioning the work into parallel pieces – task must be divided equally to all processors to avoid idle time
o Balancing the load evenly between the processors – processors should not be idle for more time
o Time required for Synchronization – processor should complete the allotted work in the specified time and
should get the next work intime otherwise parallel processing will not be possible
o Overhead for communication – Inter process communication has time overhead because it has to know which
process has to communicate and which one need not communicate.
Types of scheduling
 Long term scheduling
 Medium term scheduling
 Short term scheduling
 Dispatcher

Scheduling
Problem:
To achieve a speed-up of 90 times faster with 100 processors, what percentage of the original computation can
be sequential?
Amdahl’s Law
Fraction time affected = execution time affected / execution time before
Speed-up = 90 ; amount of improvement 100
Substitute in the above formula
Fraction time affected = 0.999
So to achieve speed-up of 90 times from 100 processors, the sequential percentage can only be 0.1%

Ca unit v 27 9-2020

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ca unit v 27 9-2020

Similar to Ca unit v 27 9-2020 (20)

Recently uploaded

Recently uploaded (20)

Ca unit v 27 9-2020

Editor's Notes