CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt

July – Oct 2021 CSC 457 Lecture Notes 1
(Knowledge for development)
KIBABII UNIVERSITY(KIBU)
SCHOOL OF COMPUTING AND INFORMATICS
CSC 457E: ADVANCED MICROPROCESSOR ARCHITECTURE
COURSE OUTLINE
TIME: Tuesdays 11am – 1PM Room: ABB302
Lecturer: Eric Sifuna, BSc EEE ; MSc IS; R.Eng; MIEEE
Cellphone: 0707327418 Email: sifunaes@gmail.com

Aim/Purpose:
 The purpose of this course is to teach students the
fundamentals of microprocessor and microcontroller
systems.

 Learning outcomes:
At the end of this course, successful students should be able to:
 Describe how the hardware and software components of a microprocessor-based
system work together to implement system-level features
 Integrate both hardware and software aspects of digital devices (such as memory
and I/O interfaces) into microprocessor-based systems
 Gain hands-on experience with common microprocessor peripherals such as
UARTs, timers, and analog-to-digital and digital-to-analog converters
 Get practical experience in applied digital logic design and assembly-language
programming
 Use the tools and techniques used by practicing engineers to design, implement,
and debug microprocessor-based systems

 High Performance microprocessor design:
 Computational Models
 An Argument for Parallel Architectures
 Internetworking Performance Issues and Scalability of Parallel Architectures
 Performance Evaluation:
 Performance of Modeling Method
 Pipeline Freeze Strategies, Prediction Strategies, Composite Strategies,
Benchmark Performance
 Pipelined processors and super pipeline concepts, Solutions to pipeline
hazards (e.g. prediction and delay branch etc.).
Course Topics

 Memory and I/O systems:
 Cache Memory, Cache addressing, Multilevel caches, Virtual Memory,
Paged, Segmented, and Paged Organizations;
 Address Translation:
 Direct Page Table Translation, Inverted Page Table, Table Lookaside Buffer,
Virtual Memory Accessing rules, Shared Memory Multiprocessors,
Partitioning, Scheduling, Communication and Synchronization, Memory
Coherency.
 Superscalar Processor Design:
 Superscalar Concepts, Execution Model, Exception Recovery
 Register DataFlow, Out-of-Order Issue and Basic Software Scheduling.
Course Topics (cont..)

 Instruction Level Parallelism Exploration:
 VLIW, simultaneous multithreading, processor coupling.
 Advanced Speculation Techniques:
 Speculation Techniques for Improving Load Related Instruction
Scheduling
 Performance Analysis for OpenMP Applications
 Fine-Grain Distributed Shared Memory on Clusters
 Future Processor Architectures:
 MAJC, Raw Network Computing, Quantum Computing
Course Topics (cont..)

Delivery: Blended Learning, small group discussion, case
studies, individual projects and tutorials
Instructional Material and/or Equipment: Computers,
Learning Management System, writing boards, writing materials,
projectors etc.
Recommended Core Reading:
1. Microprocessors and Programmed Logic, Short, K., Prentice Hall.
Other references
1. Texts, Audio and video cassettes, computer software, other
resources

CSC 457 Lecture Notes 8
 High Performance microprocessor design:
 Computational Models
 An Argument for Parallel Architectures
 Internetworking Performance Issues and Scalability of
Parallel Architectures
In this introduction …
July – Oct 2021

1. Computational Models for High Performance
Microprocessors
July – Oct 2021

 Computational Models
o High performance RISC-based microprocessors
are defining the current history of high
performance computing
o A Complex Instruction Set Computer (CISC)
instruction set is made up of powerful primitives,
close in functionality to the primitives of high-level
languages
o “If RISC is faster, why did people bother with
CISC designs in the first place?”
 RISC wasn’t always both feasible and affordable
July – Oct 2021

Computational Models
o High-level language compilers were commonly available, but
they didn’t generate the fastest code, and they weren’t terribly
thrifty with memory.
o When programming, you needed to save both space and time.
A good instruction set was both easy to use and powerful
o computers had very little storage by today’s standards. An
instruction that could roll all the steps of a complex operation,
such as a do-loop, into single opcode was a plus, because
memory was precious.
o Complex instructions saved time, too. Almost every large computer
following the IBM 704 had a memory system that was slower than its
central processing unit (CPU). When a single instruction can perform
several operations, the overall number of instructions retrieved from
memory can be reduced.
July – Oct 2021

o There were several obvious pressures that
affected the development of RISC:
- The number of transistors that could fit on a single chip was
increasing. It was clear that one would eventually be able to fit all
the components from a processor board onto a single chip.
- Techniques such as pipelining were being explored to improve
performance. Variable-length instructions and variable-length
instruction execution times (due to varying numbers of microcode
steps) made implementing pipelines more difficult.
- As compilers improved, they found that well-optimized sequences
of stream- lined instructions often outperformed the equivalent
complicated multi-cycle instructions.
July – Oct 2021

o The RISC designers sought to create a high performance
single-chip processor with a fast clock rate.
o When a CPU can fit on a single chip, its cost is decreased,
its reliability is increased, and its clock speed can be
increased.
o While not all RISC processors are single-chip
implementation, most use a single chip.
o To accomplish this task, it was necessary to discard the
existing CISC instruction sets and develop a new minimal
instruction set that could fit on a single chip. Hence the
term Reduced Instruction Set Computer.
July – Oct 2021

o The earliest RISC processors had no floating-point support in
hardware, and some did not even support integer multiply in
hardware. However, these instructions could be implemented using
software routines that combined other instructions (a microcode of
sorts).
o These earliest RISC processors (most severely reduced) were not
overwhelming successes for four reasons:
 It took time for compilers, operating systems, and user software to be retuned to
take advantage of the new processors.
 If an application depended on the performance of one of the software-implemented
instructions, its performance suffered dramatically.
 Because RISC instructions were simpler, more instructions were needed to
accomplish the task.
 Because all the RISC instructions were 32 bits long, and commonly used CISC
instructions were as short as 8 bits, RISC program executables were often larger.
• As a result of these last two issues, a RISC program may have to fetch more
memory for its instructions than a CISC program. This increased appetite for
July – Oct 2021

o As a result of these last two issues, a RISC program may have to fetch
more memory for its instructions than a CISC program. This increased
appetite for instructions actually clogged the memory bottleneck until
sufficient caches were added to the RISC processors.
o RISC processors quickly became known for their affordable high-speed
floating- point capability compared to CISC processors. This excellent
performance on scientific and engineering applications effectively
created a new type of computer system, the workstation
July – Oct 2021

 Parallel Architectures
o Concurrency and parallelism are related concepts, but they are distinct.
Concurrent programming happens when several computations are happening
in over-lapping time periods. Your laptop, for example, seems like it doing a
lot of things at the same time even though there are only 1, 2, or 4 cores. So,
we have concurrency without parallelism.
o At the other end of the spectrum, the CPU in your laptop is carrying out pieces
of the same computation in parallel to speed up the execution of the
instruction stream.
o Parallel computing occupies a unique spot in the universe of distributed
systems. Parallel computing is centralized—all of the processes are typically
under the control of a single entity. Parallel computing is usually hierarchical—
parallel architectures are frequently described as grids, trees, or pipelines.
Parallel computing is co-located—for efficiency, parallel processes are typically
located very close to each other, often in the same chassis or at least the
same data center. These choices are driven by the problem space and the
need for high performance.
July – Oct 2021

o Parallel computing occupies a unique spot in the universe
of distributed systems.
o Parallel computing is centralized—all of the processes are typically under
the control of a single entity.
o Parallel computing is usually hierarchical—parallel architectures are
frequently described as grids, trees, or pipelines.
o Parallel computing is co-located—for efficiency, parallel processes are
typically located very close to each other, often in the same chassis or at
least the same data center.
o These choices are driven by the problem space and the
need for high performance.
July – Oct 2021

o Definition of a parallel computer: A set of independent
processors that can work cooperatively to solve a problem
o A parallel system consists of an algorithm and the parallel
architecture that the algorithm is implemented.
o Note that an algorithm may have different performance on
different parallel architecture.
o For example, an algorithm may perform differently on a
linear array of processors and on a hypercube of
processors
July – Oct 2021

o Why Use Parallel Computing?
 Single processor speeds are reaching their ultimate limits
 Multi-core processors and multiple processors are the most
promising paths to performance improvements
o Concurrency: The property of a parallel algorithm that a number of
operations can be performed by separate processors at the same time.
Concurrency is the key concept in the design of parallel algorithms:
 Requires a different way of looking at the strategy to solve a
problem
 May require a very different approach from a serial program to
achieve high efficiency
July – Oct 2021

•
July – Oct 2021

o Protein folding problems involve a large number of independent
calculations that do not depend on data from other calculations
o Concurrent calculations with no dependence on the data from
other calculations are termed Embarrassingly Parallel
o These embarrassingly parallel problems are ideal for solution by
HPC methods, and can realize nearly ideal concurrency and
scalability
o Flexibility in the way a problem is solved is beneficial to finding
a parallel algorithm that yields a good parallel scaling.
o Often, one has to employ substantial creativity in the way a
parallel algorithm is implemented to achieve good scalability.
July – Oct 2021

o Understand the Dependencies
o One must understand all aspects of the problem to be solved, in
particular the possible dependencies of the data.
o It is important to understand fully all parts of a serial code that
you wish to parallelize, Example: Pressure Forces (Local) vs.
Gravitational Forces (Global)
o When designing a parallel algorithm, always remember:
 Computation is FAST
 Communication is SLOW
 Input/Output (I/O) is INCREDIBLY SLOW
o In addition to concurrency and scalability, there are a number of
other important factors in the design of parallel algorithms:
Locality; Granularity; Modularity; Flexibility ;Load balancing
July – Oct 2021

Parallel Computer Architectures
o Virtually all computers follow the basic design of the Von
Neumann Architecture as follows;
 Memory stores both instructions and data
 Control unit fetches instructions from memory, decodes
instructions, and then sequentially performs operations to
perform programmed task
 Arithmetic Unit performs mathematical operations
 Input/Output is interface to the user
July – Oct 2021

Flynn’s Taxonomy
o SISD: This is a standard serial computer: one set of instructions, one
data stream
o SIMD: All units execute same instructions on different data streams
(vector)
- Useful for specialized problems, such as graphics/image processing
- Old Vector Supercomputers worked this way, also moderns GPUs
o MISD: Single data stream operated on by different sets of instructions,
not generally used for parallel computers
o MIMD: Most common parallel computer, each processor can execute
different instructions on different data streams
-Often constructed of many SIMD subcomponents
July – Oct 2021

Parallel Computer Memory Architectures
o Shared Memory – memory shared among various CPUs
o Distributed Memory - each CPU has its memory
o Hybrid Distributed Shared Memory
July – Oct 2021

Relation to Parallel Programming Models
o OpenMP: Multi-threaded calculations occur within shared-
memory components of systems, with different threads working
on the same data.
o MPI: Based a distributed-memory model, data associated with
another processor must be communicated over the network
connection.
o GPUs: Graphics Processing Units (GPUs) incorporate many
(hundreds) of computing cores with single Control Unit, so this
is a shared-memory model.
o Processors vs. Cores: Most common parallel computer, each
processor can execute different instructions on different data
streams -Often constructed of many SIMD subcomponents
July – Oct 2021

Embarrassingly Parallel
o Refers to an approach that involves solving many similar but
independent tasks simultaneously
o Little to no coordination (and thus no communication) between
tasks
o Each task can be a simple serial program
o This is the “easiest” type of problem to implement in a parallel
manner. Essentially requires automatically coordinating many
independent calculations and possibly collating the results.
o Examples: Computer Graphics and Image Processing; Protein
Folding Calculations in Biology; Geographic Land Management
Simulations in Geography; Data Mining in numerous fields;
Event simulation and reconstruction in Particle Physics
July – Oct 2021

 Internetworking Performance Issues
and Scalability of Parallel Architectures
o Performance Limitations of Parallel Architectures
o Adding additional resources doesn’t necessarily speed up a
computation. There’s a limit defined by Amdahl’s Law.
o The basic idea of Amdahl’s law is that a parallel computation’s
maximum performance gain is limited by the portion of the
computation that has to happen serially which creates a
bottleneck.
o Serial portion include scheduling, resource allocation,
communication and synchronizing e.t.c.
o For example, if a computation that takes 20 hours on a single CPU has a serial
portion that takes 1 hour (5%), then Amdahl’s law shows that no matter how
many processors you put on the task, the maximum speed up is 20x.
Consequently, after a point, putting additional processors on the job is just
wasted resource.
July – Oct 2021

•
July – Oct 2021

Process Interaction
o Except for embarrassingly parallel algorithms, the threads in a
parallel computation need to communicate with each other.
There are two ways they can do this;
o Shared memory – the processes can share a storage location
that they use for communicating. Shared memory can also be
used to synchronize threads by using the shared memory as a
semaphore.
o Messaging – the processes communicate via messages. This
could be over a network or special-purposes bus. Networks for
this use are typically hyper-local and designed for this purpose.
July – Oct 2021

Consistent Execution
o The threads of execution for most parallel algorithms must be coupled
to achieve consistent execution.
o Parallel threads of execution communicate to transfer values between
processes. Parallel algorithms communicate not only to calculate the result,
but to achieve deterministic execution.
o For any given set of inputs, the parallel version of an algorithm should return
the same answer each time it is run as well as returning the same answer a
sequential version of the algorithm would return.
o Parallel algorithms achieve this by locking memory or otherwise sequencing
operations between threads. This communication overhead, as well as the
waiting required for sequencing impose a performance overhead.
o As we saw in our discussion of Amdahl’s Law, these sequential portions of a
parallel algorithm are the limiting factor in speeding up execution.
July – Oct 2021

o (Read tutorial on Performance and Scalability on Parallel Computing
attached)
July – Oct 2021

Next …....
Performance Evaluation
July – Oct 2021

 Performance Modelling
o The goal of performance modeling is to gain
understanding of a computer system’s performance on
various applications, by means of measurement and
analysis, and then to encapsulate these characteristics in a
compact formula.
o The resulting model can be used to gain greater
understanding of the performance phenomena involved
and to project performance to other system/application
combinations
July – Oct 2021

o The performance profile of a given system/application
combination depends on numerous factors, including:
(1) System size;(2) System architecture;
(3) Processor speed; (4) Multi-level cache latency and
bandwidth;
(5) Interprocessor network latency and bandwidth;
(6) System software efficiency;
(7) Type of application; (8) Algorithms used;
(9) Programming language used; (10) Problem size;
(11) Amount of I/O;
July – Oct 2021

o Performance models can be used to improve architecture
design, inform procurement, and guide application tuning
o Someone has observed that, due to the difficulty of
developing performance models for new applications, as
well as the increasing complexity of new systems, our
supercomputers have become better at predicting and
explaining natural phenomena (such as the weather) than
at predicting and explaining the performance of
themselves or other computers.
July – Oct 2021

Applications of Performance Modelling
o Performance modeling can be used in numerous ways.
Here is a brief summary of these usages, both present-day
and future possibilities;
1. System design.
o Performance models are frequently employed by computer vendors in
their design of future systems. Typically engineers construct a
performance model for one or two key applications, and then compare
future technology options based on performance model projections.
Once performance modeling techniques are better developed, it may
be possible to target many more applications and technology options
July – Oct 2021

2. Runtime estimation
o The most common application for a performance model is to enable a
scientist to estimate the runtime of a job when the input parameters
for the job are changed, or when a different number of processors is
used in a parallel computer system.
o One can also estimate the largest size of system that can be used to
run a given problem before the parallel efficiency drops to an
unacceptable area.
3. System tuning
o An example of using performance modeling for system tuning is where
performance model is used to diagnose and rectify a misconfigured
channel buffer, which yields a doubling of network performance for
programs sending short messages
July – Oct 2021

4. Application Tuning
o If a memory performance model is combined with application
parameters, one can predict how cache hit-rates would change if a
different cache blocking factor were used in the application.
o Once the optimal cache blocking has been identified, then the code
can be permanently changed.
o Simple performance models can even be incorporated into an
application code, permitting on-the-fly selection of different program
options.
o Performance models, by providing performance expectations based on
the fundamental computational characteristics of algorithms, can also
enable algorithmic choice before going to the trouble to implement all
the possible choices
July – Oct 2021

 Pipeline Freeze Strategies
•
July – Oct 2021

 Branch prediction Strategies
o In an exceedingly parallel system, conditional instructions break the
continuous flow of programs or decrease the performance of the
pipelined processor, which causes delay.
o To decrease the delay prediction of branch direction is necessary. The
disparity in the branches needs accurately branch prediction
strategies. So, branch prediction is a vital part of the present pipelined
processor.
o Branch prediction is the process of making an educated guess as to
whether a branch will be taken or not taken based on a preset
algorithm .
o A branch is a category of instruction which causes the code to move to
another block to continue execution. Branch prediction has the ability
to be static or dynamic
July – Oct 2021

 Prediction Strategies
o Static branch prediction means that a given branch will always be
predicted as taken or not taken without possibility of change
throughout the duration of the program.
o Dynamic branch prediction means that the predicted outcome of a
branch is dependent on an algorithm, and the prediction may change
throughout the course of the program.
o Code is able to use a combination of both static and dynamic branch
predictors based on the type of branch.
o The improvement branch prediction is dependent on the number of
branches in the code, as well as the type of prediction being used as
different prediction methods have varied rates of success.
o Overall, branch prediction provides an increase in performance for
code containing branches. This improvement is based on the number
of computational cycles which are able to be used for computation
rather than wasted on a system which does not use branch prediction.
July – Oct 2021

o There are three different kinds of 6 branches: forward conditional,
backward conditional, and unconditional branches .
o Forward conditional branches are when a branch evaluates to a target
that is somewhere forward in the instruction stream.
o Backward conditional branches are when a branch evaluates to a
target that is somewhere backwards in the instruction stream.
Common instances of backward conditional branches are loops.
o Unconditional branches are branches which will always occur.
July – Oct 2021

o A static or dynamic prediction strategy will determine which different
algorithms or methods are available for use.
o For static branch prediction, the strategy may either be predict taken, predict not
taken, or some combination that specifies the branch type such as backward branch
predict taken, forward branch predict not taken . The third strategy is advantageous
for programs with loops because it will have a higher percentage of correctly
predicted branches for backward branches.
o Dynamic branch prediction is able to use one-level prediction, two-
level adaptive prediction, or a tournament predictor. One-level
prediction uses a counter based on a specific branch to use; said
branch’s history to predict its future outcomes .
o The address of the branch is used as an index into a table where these counters are
stored. When a branch is correctly predicted taken, a counter is incremented. When
a branch is correctly predicted not taken, the same counter is decremented. In the
case where the prediction was incorrect, the opposite occurs.
July – Oct 2021

o The two-level adaptive branch prediction is very similar to the one-level
branch prediction strategy. The two-level strategy uses the same counter
concept as the one-level, except the two-level implements this counter while
taking input from other branches. This strategy may also be used to predict
the direction of the branch based on the direction and outcomes of other
branches in the program. This strategy is also called a global history counter .
o Hybrid or tournament prediction strategies use a combination of two or more
other prediction strategies . For example, any static prediction used in
conjunction with a dynamic prediction strategy would be considered a hybrid
strategy.
o All of the strategies listed here are used in practice. The two-bit counter
presented in the one-level branch prediction strategy is used in a number of
other branch prediction strategies, including a predictor for choosing which
predictor to use.
o One disadvantage to each of these strategies is that their level of
improvement for a given code will vary depending on what is written into the
code
July – Oct 2021

 Composite Strategies
July – Oct 2021
(Blank!!)

 Benchmark Performance
(Blank!!)
July – Oct 2021

 Pipeline Processor Concepts
o High performance is an important issue in microprocessor
and its importance is exponentially increasing over the
years.
o To improve the performance, two alternative methods exist
(a) To improve the hardware by providing faster circuit
(b) To arrange the hardware, so that multi-operations can be
performed.
o On the basis of performance, pipelining is a process of
arrangement of hardware elements of the CPU such that its
overall performance is increased, simultaneous execution of
more than one instruction takes place in a pipelined
processor
July – Oct 2021

o A pipeline processor is comprised of a sequential, linear list
of segments, where each segment performs one computational task or
group of tasks.
o There are three things that one must observe about the pipeline.
1. First, the work (in a computer, the ISA) is divided up into pieces that
more or less fit into the segments alloted for them.
2. Second, this implies that in order for the pipeline to work efficiently
and smoothly, the work partitions must each take about the same time
to complete. Otherwise, the longest partition requiring time T would
hold up the pipeline, and every segment would have to take time T to
complete its work. For fast segments, this would mean much idle time.
3. Third, in order for the pipeline to work smoothly, there must be few (if
any) exceptions or hazards that cause errors or delays within the
pipeline. Otherwise, the instruction will have to be reloaded and the
pipeline restarted with the same instruction that causes the exception.
July – Oct 2021

o Work Partitioning: A multicycle datapath is based on the
assumption that computational work associated with the
execution of an instruction could be partitioned into a five-
step process, as follows:
July – Oct 2021

o Pipelining is one way of improving the overall processing performance of
a processor. This architectural approach allows the simultaneous
execution of several instructions.
o Pipelining is transparent to the programmer; it exploits parallelism at the
instruction level by overlapping the execution process of instructions.
o It is analogous to an assembly line where workers perform a specific
task and pass the partially completed product to the next worker
o The pipeline design technique decomposes a sequential process into
several subprocesses, called stages or segments. A stage performs a
particular function and produces an intermediate result.
o It consists of an input latch, also called a register or buffer, followed by
a processing circuit. (A processing circuit can be a combinational or
sequential circuit.)
July – Oct 2021

o At each clock pulse, every stage transfers its intermediate result to the
input latch of the next stage. In this way, the final result is produced
after the input data have passed through the entire pipeline, completing
one stage per clock pulse.
o The period of the clock pulse should be large enough to provide
sufficient time for a signal to traverse through the slowest stage, which
is called the bottleneck (i.e., the stage needing the longest amount of
time to complete).
o In addition, there should be enough time for a latch to store its input
signals.
o If the clock's period, P, is expressed as P = tb + tl, then tb should be
greater than the maximum delay of the bottleneck stage, and tl should
be sufficient for storing data into a latch
July – Oct 2021

Completion Time for pipelined processor
o The ability to overlap stages of a sequential process for different input
tasks (data or operations) results in an overall theoretical completion
time of Tpipe = m*P + (n-1)*P, where n is the number of input tasks, m is
the number of stages in the pipeline, and P is the clock period
o The term m*P is the time required for the first input task to get through
the pipeline, and the term (n-1)*P is the time required for the remaining
tasks.
o After the pipeline has been filled, it generates an output on each clock
cycle. In other words, after the pipeline is loaded, it will generate output
only as fast as its slowest stage.
o Even with this limitation, the pipeline will greatly outperform nonpipelined techniques,
which require each task to complete before another task’s execution sequence begins.
To be more specific, when n is large, a pipelined processor can produce output
approximately m times faster than a nonpipelined processor.
July – Oct 2021

•
July – Oct 2021

Pipeline Performance Measures
1. Speedup
o Now, speedup (S) may be represented as:
S = Tseq / Tpipe = n*m / (m+n -1)
The value S approaches m when n  . That is, the maximum
speedup, also called ideal speedup, of a pipeline processor
with m stages over an equivalent nonpipelined processor is m.
In other words, the ideal speedup is equal to the number of
pipeline stages. That is, when n is very large, a pipelined
processor can produce output approximately m times faster
than a nonpipelined processor. When n is small, the speedup
decreases; in fact, for n=1 the pipeline has the minimum
speedup of 1.
July – Oct 2021

Pipeline Performance Measures
2. Efficiency
o The efficiency E of a pipeline with m stages is defined as:
E = S/m = [n*m / (m+n -1)] / m = n / (m+n -1).
The efficiency E, which represents the speedup per stage,
approaches its maximum value of 1 when n  . When n=1, E will
have the value 1/m, which is the lowest obtainable value.
July – Oct 2021

•
July – Oct 2021

Pipeline Types
o Pipelines are usually divided into two classes: instruction pipelines and arithmetic
pipelines. A pipeline in each of these classes can be designed in two ways: static
or dynamic.
o A static pipeline can perform only one operation (such as addition or
multiplication) at a time. The operation of a static pipeline can only be changed
after the pipeline has been drained. (A pipeline is said to be drained when the
last input data leave the pipeline.) For example, consider a static pipeline that is
able to perform addition and multiplication. Each time that the pipeline switches
from a multiplication operation to an addition operation, it must be drained and
set for the new operation.
o The performance of static pipelines is severely degraded when the operations
change often, since this requires the pipeline to be drained and refilled each
time.
o A dynamic pipeline can perform more than one operation at a time. To perform
a particular operation on an input data, the data must go through a certain
sequence of stages. In dynamic pipelines the mechanism that controls when data should be fed to the pipeline is much
more complex than in static pipelines
July – Oct 2021

Instruction Pipeline
o An instruction pipeline increases the performance of a processor by
overlapping the processing of several different instructions. An
instruction pipeline often consists of five stages, as follows:
1. Instruction fetch (IF). Retrieval of instructions from cache (or main memory).
2. Instruction decoding (ID). Identification of the operation to be performed.
3. Operand fetch (OF). Decoding and retrieval of any required operands.
4. Execution (EX). Performing the operation on the operands.
5. Write-back (WB). Updating the destination operands.
An instruction pipeline overlaps the process of the preceding stages for
different instructions to achieve a much lower total completion time,
on average, for a series of instructions.
July – Oct 2021

o During the first cycle, or clock pulse, instruction i1 is fetched from
memory. Within the second cycle, instruction i1 is decoded while
instruction i2 is fetched. This process continues until all the
instructions are executed. The last instruction finishes the write-
back stage after the eighth clock cycle.
o Therefore, it takes 80 nanoseconds (ns) to complete execution of all
the four instructions when assuming the clock period to be 10 ns.
The total completion time is,
Tpipe = m*P+(n-1)*P
=5*10+(4-1)*10=80 ns.
Note that in a nonpipelined design the completion time will be much
higher.
July – Oct 2021

o Note that in a nonpipelined design the completion time will be much
higher.
Tseq = n*m*P = 4*5*10 = 200 ns
o It is worth noting that a pipeline simply takes advantage of these
naturally occurring stages to improve processing efficiency.
o Henry Ford made the same connection when he realized that all cars
were built in stages and invented the assembly line in the early
1900s.
o Even though pipelining speeds up the execution of instructions, it
does pose potential problems. Some of these problems and possible
solutions are discussed next
July – Oct 2021

Improving the Throughput of an Instruction Pipeline
o Three sources of architectural problems may affect the throughput
of an instruction pipeline. They are fetching, bottleneck, and issuing
problems. Some solutions are given for each.
1. The fetching problem
o In general, supplying instructions rapidly through a pipeline is costly
in terms of chip area. Buffering the data to be sent to the pipeline is
one simple way of improving the overall utilization of a pipeline. The
utilization of a pipeline is defined as the percentage of time that the
stages of the pipeline are used over a sufficiently long period of
time. A pipeline is utilized 100% of the time when every stage is
used (utilized) during each clock cycle.
July – Oct 2021

1. The fetching problem
o Occasionally, the pipeline has to be drained and refilled, for example, whenever
an interrupt or a branch occurs. The time spent refilling the pipeline can be
minimized by having instructions and data loaded ahead of time into various
geographically close buffers (like on-chip caches) for immediate transfer into the
pipeline. If instructions and data for normal execution can be fetched before they
are needed and stored in buffers, the pipeline will have a continuous source of
information with which to work. Prefetch algorithms are used to make sure
potentially needed instructions are available most of the time. Delays from
memory access conflicts can thereby be reduced if these algorithms are used,
since the time required to transfer data from main memory is far greater than the
time required to transfer data from a buffer.
July – Oct 2021

2. The bottleneck problem
o The bottleneck problem relates to the amount of load (work) assigned to a stage
in the pipeline.
o If too much work is applied to one stage, the time taken to complete an operation
at that stage can become unacceptably long.
o This relatively long time spent by the instruction at one stage will inevitably create
a bottleneck in the pipeline system.
o In such a system, it is better to remove the bottleneck that is the source of
congestion. One solution to this problem is to further subdivide the stage.
Another solution is to build multiple copies of this stage into the pipeline.
July – Oct 2021

3. The issuing problem
o If an instruction is available, but cannot be executed for some reason, a hazard
exists for that instruction. These hazards create issuing problems; they prevent
issuing an instruction for execution. Three types of hazard are discussed here.
They are called structural hazard, data hazard, and control hazard.
o A structural hazard refers to a situation in which a required resource is not
available (or is busy) for executing an instruction.
o A data hazard refers to a situation in which there exists a data dependency
(operand conflict) with a prior instruction.
o A control hazard refers to a situation in which an instruction, such as branch,
causes a change in the program flow. Each of these hazards is explained next.
July – Oct 2021

1. Structural Hazard
o structural hazard occurs as a result of resource conflicts between instructions.
One type of structural hazard that may occur is due to the design of execution
units. If an execution unit that requires more than one clock cycle (such as
multiply) is not fully pipelined or is not replicated, then a sequence of instructions
that uses the unit cannot be subsequently (one per clock cycle) issued for
execution. Replicating and/or pipelining execution units increases the number of
instructions that can be issued simultaneously.
o Another type of structural hazard that may occur is due to the design of register
files. If a register file does not have multiple write (read) ports, multiple writes
(reads) to (from) registers cannot be performed simultaneously. For example,
under certain situations the instruction pipeline might want to perform two
register writes in a clock cycle. This may not be possible when the register file has
only one write port. The effect of a structural hazard can be reduced fairly simply
by implementing multiple execution units and using register files with multiple
input/output ports
July – Oct 2021

2. Data Hazard
o In a nonpipelined processor, the instructions are executed one by one, and the execution
of an instruction is completed before the next instruction is started. In this way, the
instructions are executed in the same order as the program. However, this may not be true
in a pipelined processor, where instruction executions are overlapped. An instruction may
be started and completed before the previous instruction is completed. The data hazard,
which is also referred to as the data dependency problem, comes about as a result of
overlapping (or changing the order of) the execution of data-dependent instructions.
o The delaying of execution can be accomplished in two ways. One way is to delay the OF or
IF stages of i2 for two clock cycles. To insert a delay, an extra hardware component called a
pipeline interlock can be added to the pipeline. A pipeline interlock detects the
dependency and delays the dependent instructions until the conflict is resolved. Another
way is to let the compiler solve the dependency problem. During compilation, the compiler
detects the dependency between data and instructions. It then rearranges these
instructions so that the dependency is not hazardous to the system. If it is not possible to
rearrange the instructions, NOP (no operation) instructions are inserted to create delays.
July – Oct 2021

o There are three primary types of data hazards: RAW (read after write),
WAR (write after read), and WAW (write after write). The hazard names
denote the execution ordering of the instructions that must be maintained
to produce a valid result; otherwise, an invalid result might occur.
o RAW: it refers to the situation in which i2 reads a data source before i1 writes to it.
This may produce an invalid result since the read must be performed after the write
in order to obtain a valid result.
o WAR: This refers to the situation in which i2 writes to a location before i1 reads it.
o WAW: This refers to the situation in which i2 writes to a location before i1 writes to
it.
o Note that the WAR and WAW types of hazards cannot happen when the order of
completion of instructions execution in the program is preserved. However, one way
to enhance the architecture of an instruction pipeline is to increase concurrent
execution of the instructions by dispatching several independent instructions to
different functional units, such as adders/subtractors, multipliers, and dividers. That
is, the instructions can be executed out of order, and so their execution may be
completed out of order too.
July – Oct 2021

o The dependencies between instructions are checked statically by the
compiler and/or dynamically by the hardware at run time. This preserves
the execution order for dependent instructions, which ensures valid results.
o In general, dynamic dependency checking has the advantage of being able
to determine dependencies that are either impossible or hard to detect at
compile time. However, it may not be able to exploit all the parallelism
available in a loop because of the limited lookahead ability that can be
supported by the hardware.
o Two of the most commonly used techniques for dynamic dependency
checking are called Tomasulo's method and the scoreboard method
o Tomasulo's method increases concurrent execution of the instructions with
minimal (or no) effort by the compiler or the programmer.
o The scoreboard method: multiple functional units allow instructions to be
completed out of the original program order.
July – Oct 2021

3. Control Hazard
o In any set of instructions, there is normally a need for some kind of statement that allows
the flow of control to be something other than sequential. Instructions that do this are
included in every programming language and are called branches. In general, about 30% of
all instructions in a program are branches.
o This means that branch instructions in the pipeline can reduce the throughput
tremendously if not handled properly. Whenever a branch is taken, the performance of the
pipeline is seriously affected. Each such branch requires a new address to be loaded into
the program counter, which may invalidate all the instructions that are either already in
the pipeline or prefetched in the buffer. This draining and refilling of the pipeline for each
branch degrade the throughput of the pipeline to that of a sequential processor.
o Note that the presence of a branch statement does not automatically cause the pipeline to
drain and begin refilling. A branch not taken allows the continued sequential flow of
uninterrupted instructions to the pipeline. Only when a branch is taken does the problem
arise.
July – Oct 2021

3. Control Hazard
o Branch instructions can be classified into three groups: (1) unconditional branch,
(2) conditional branch, and (3) loop branch
o An unconditional branch always alters the sequential program flow. It sets a new
target address in the program counter, rather than incrementing it by 1 to point
to the next sequential instruction address, as is normally the case.
o A conditional branch sets a new target address in the program counter only when
a certain condition, usually based on a condition code, is satisfied. Otherwise, the
program counter is incremented by 1 as usual. A conditional branch selects a path
of instructions based on a certain condition. If the condition is satisfied, the path
starts from the target address and is called a target path. If it is not, the path
starts from the next sequential instruction and is called a sequential path.
o A loop branch in a loop statement usually jumps back to the beginning of the loop
and executes it either a fixed or a variable (data-dependent) number of times.
July – Oct 2021

Techniques for Reducing Effect of Branching on Processor Performance
o To reduce the effect of branching on processor performance, several techniques
have been proposed. Some of the better known techniques are branch prediction,
delayed branching, and multiple prefetching
1. Branch Prediction. In this type of design, the outcome of a branch decision is predicted
before the branch is actually executed. Therefore, based on a particular prediction, the
sequential path or the target path is chosen for execution. Although the chosen path often
reduces the branch penalty, it may increase the penalty in case of incorrect prediction.
2. Delayed Branching. The delayed branching scheme eliminates or significantly reduces
the effect of the branch penalty. In this type of design, a certain number of instructions
after the branch instruction is fetched and executed regardless of which path will be
chosen for the branch. For example, a processor with a branch delay of k executes a path
containing the next k sequential instructions and then either continues on the same path
or starts a new path from a new target address. As often as possible, the compiler tries to
fill the next k instruction slots after the branch with instructions that are independent from
the branch instruction. NOP (no operation) instructions are placed in any remaining empty
slots.
July – Oct 2021

Techniques for Reducing Effect of Branching on Processor Performance
3. Multiple Prefetching. In this type of design, the processor fetches both possible
paths. Once the branch decision is made, the unwanted path is thrown away. By
prefetching both possible paths, the fetch penalty is avoided in the case of an
incorrect prediction. To fetch both paths, two buffers are employed to service the
pipeline.
In normal execution, the first buffer is loaded with instructions from the next
sequential address of the branch instruction. If a branch occurs, the contents of the
first buffer are invalidated, and the secondary buffer, which has been loaded with
instructions from the target address of the branch instruction, is used as the primary
buffer.
This double buffering scheme ensures a constant flow of instructions and data to the
pipeline and reduces the time delays caused by the draining and refilling of the
pipeline. Some amount of performance degradation is unavoidable any time the
pipeline is drained, however
July – Oct 2021

Further Throughput Improvement of an Instruction Pipeline
o One way to increase the throughput of an instruction pipeline is
to exploit instruction-level parallelism. The common approaches
to accomplish such parallelism are called superscalar,
superpipeline, and very long instruction word (VLIW)
July – Oct 2021

1. Superscalar
o The superscalar approach relies on spatial parallelism, that is, multiple
operations running concurrently on separate hardware. This approach
achieves the execution of multiple instructions per clock cycle by issuing
several instructions to different functional units.
o A superscalar processor contains one or more instruction pipelines sharing a
set of functional units. It often contains functional units, such as an add
unit, multiply unit, divide unit, floating-point add unit, and graphic unit.
o A superscalar processor contains a control mechanism to preserve the
execution order of dependent instructions for ensuring a valid result. The
scoreboard method and Tomasulo's method can be used for implementing
such mechanisms.
o In practice, most of the processors are based on the superscalar approach
and employ a scoreboard method to ensure a valid result.
July – Oct 2021

2. Superpipeline
o The superpipeline approach achieves high performance by overlapping the
execution of multiple instructions on one instruction pipeline.
o A superpipeline processor often has an instruction pipeline with more stages
than a typical instruction pipeline design. In other words, the execution
process of an instruction is broken down into even finer steps. By increasing
the number of stages in the instruction pipeline, each stage has less work to
do. This allows the pipeline clock rate to increase (cycle time decreases),
since the clock rate depends on the delay found in the slowest stage of the
pipeline.
o An example of such an architecture is the MIPS R4000 processor. The R4000
subdivides instruction fetching and data cache access to create an eight-
stage pipeline.
July – Oct 2021

3. Very Long Instruction Word (VLIW).
o The very long instruction word (VLIW) approach makes extensive use of the
compiler by requiring it to incorporate several small independent operations
into a long instruction word.
o The instruction is large enough to provide, in parallel, enough control bits
over many functional units. In other words, a VLIW architecture provides
many more functional units than a typical processor design, together with a
compiler that finds parallelism across basic operations to keep the functional
units as busy as possible.
o The compiler compacts ordinary sequential codes into long instruction words
that make better use of resources. During execution, the control unit issues
one long instruction per cycle. The issued instruction initiates many
independent operations simultaneously
July – Oct 2021

o A comparison of the three approaches will show a few interesting
differences.
o For instance, the superscalar and VLIW approaches are more sensitive to
resource conflicts than the superpipelined approach.
o In a superscalar or VLIW processor, a resource must be duplicated to
reduce the chance of conflicts, while the superpipelined design avoids any
resource conflicts.
July – Oct 2021

Pipeline Datapath Design and Implementation
o The work involved in an instruction can be partitioned into
steps labelled IF (Instruction Fetch), ID (Instruction
Decode and data fetch), EX (ALU operations or R-format
execution), MEM (Memory operations), and WB (Write-
Back to register file)
July – Oct 2021

Pipeline Datapath Design and Implementation
MIPS Instructions and Pipelining
o Multi Instruction Processing System (MIPS) is a reduced
instruction set computer (RISC) instruction set architecture
(ISA)
o In order to implement MIPS instructions effectively on a
pipeline processor, we must ensure that the instructions
are the same length (simplicity favors regularity) for easy
IF and ID, similar to the multicycle datapath.
o We also need to have few but consistent instruction
formats, to avoid deciphering variable formats during IF
and ID, which would prohibitively increase pipeline
segment complexity for those tasks. Thus, the register
indices should be in the same place in each instruction.
July – Oct 2021

Next …....
Memory and I/O Systems
July – Oct 2021

Levels of Memory
o Level 1 or Register – It is a type of memory in which data is
stored and accepted that are immediately stored in CPU.
Most commonly used register is accumulator, Program
counter, address register etc.
o Level 2 or Cache memory – It is the fastest memory which
has faster access time where data is temporarily stored for
faster access.
o Level 3 or Main Memory – It is memory on which computer
works currently. It is small in size and once power is off data
no longer stays in this memory.
o Level 4 or Secondary Memory – It is external memory which
is not as fast as main memory but data stays permanently in
this memory.
July – Oct 2021

Cache Memory
o The cache is a smaller and faster memory which stores
copies of the data from frequently used main memory
locations.
o Cache Memory is a special very high-speed memory used
to speed up and synchronizing with high-speed CPU.
Cache memory is costly than main memory or disk
memory but economical than CPU registers.
o Cache memory is an extremely fast memory type that acts
as a buffer between RAM and the CPU. It holds frequently
requested data and instructions so that they are
immediately available to the CPU when needed.
July – Oct 2021

Cache Memory
o Cache memory is used to reduce the average time to access data
from the Main memory. The cache is a smaller and faster memory
which stores copies of the data from frequently used main memory
locations.
o There are various different independent caches in a CPU, which
store instructions and data.
July – Oct 2021

Basic Definitions in Cache Memory
o cache block - The basic unit for cache storage. May
contain multiple bytes/words of data.
o cache line - Same as cache block. Note that this is not the
same thing as a “row” of cache.
o cache set - A “row” in the cache. The number of blocks per
set is determined by the layout of the cache (e.g. direct
mapped, set-associative, or fully associative).
o tag - A unique identifier for a group of data. Because
different regions of memory may be mapped into a block,
the tag is used to differentiate between them.
o valid bit - A bit of information that indicates whether the
data in a block is valid (1) or not (0).
July – Oct 2021

Types of Cache Memory
o There are three general cache levels:
o L1 cache, or primary cache, is extremely fast but relatively
small, and is usually embedded in the processor chip as
CPU cache.
o L2 cache, or secondary cache, often has higher capacity
than L1. L2 cache may be embedded on the CPU, or it can
be on a separate chip or coprocessor and have a high-
speed alternative system bus connecting the cache and
CPU. That way it doesn't get slowed by traffic on the main
system bus.
July – Oct 2021

Types of Cache Memory
o Level 3 (L3) cache is specialized memory developed to improve the
performance of L1 and L2. L1 or L2 can be significantly faster than
L3, though L3 is usually double the speed of DRAM. With multicore
processors, each core can have dedicated L1 and L2 cache, but
they can share an L3 cache. If an L3 cache references an
instruction, it is usually elevated to a higher level of cache.
o Contrary to popular belief, implementing flash or more dynamic
RAM (DRAM) on a system won't increase cache memory. This can
be confusing since the terms memory caching (hard disk buffering)
and cache memory are often used interchangeably.
o Memory caching, using DRAM or flash to buffer disk reads, is
meant to improve storage I/O by caching data that is frequently
referenced in a buffer ahead of slower magnetic disk or tape.
Cache memory, on the other hand, provides read buffering for the
CPU
July – Oct 2021

Cache Memory Performance
o When the processor needs to read or write a location in main
memory, it first checks for a corresponding entry in the cache.
o If the processor finds that the memory location is in the cache, a
cache hit has occurred and data is read from cache
o If the processor does not find the memory location in the cache, a
cache miss has occurred. For a cache miss, the cache allocates a
new entry and copies in data from main memory, then the request
is fulfilled from the contents of the cache.
o The performance of cache memory is frequently measured in terms
of a quantity called Hit ratio.
Hit ratio = hit / (hit + miss) = no. of hits/total accesses
o We can improve Cache performance using higher cache block size,
higher associativity, reduce miss rate, reduce miss penalty, and
reduce the time to hit in the cache.
July – Oct 2021

Architecture and data flow of a typical
cache memory
July – Oct 2021

Cache Memory Mapping
o There are three different types of mapping used for the purpose of
cache memory which are as follows: Direct mapping, Associative
mapping, and Set-Associative mapping.
o Direct mapped cache has each block mapped to exactly one cache memory
location. Conceptually, a direct mapped cache is like rows in a table with three
columns: the cache block that contains the actual data fetched and stored, a
tag with all or part of the address of the data that was fetched, and a flag bit
that shows the presence in the row entry of a valid bit of data.
o Fully associative cache mapping is similar to direct mapping in structure but
allows a memory block to be mapped to any cache location rather than to a
prespecified cache memory location as is the case with direct mapping.
o Set associative cache mapping can be viewed as a compromise between direct
mapping and fully associative mapping in which each block is mapped to a
subset of cache locations. It is sometimes called N-way set associative
mapping, which provides for a location in main memory to be cached to any of
"N" locations in the L1 cache.
July – Oct 2021

Locality of Reference
o The ability of cache memory to improve a computer's
performance relies on the concept of locality of reference.
o Locality describes various situations that make a system
more predictable.
o Cache memory takes advantage of these situations to create
a pattern of memory access that it can rely upon.
o There are several types of locality. Two key ones for cache
are:
 Temporal locality. This is when the same resources are
accessed repeatedly in a short amount of time.
 Spatial locality. This refers to accessing various data or
resources that are near each other.
July – Oct 2021

Importance of Cache Memory
o Cache memory is important because it improves the efficiency of
data retrieval (improve performance). It stores program
instructions and data that are used repeatedly in the operation of
programs or information that the CPU is likely to need next. The
computer processor can access this information more quickly from
the cache than from the main memory. Fast access to these
instructions increases the overall speed of the program.
o Aside from its main function of improving performance, cache
memory is a valuable resource for evaluating a computer's overall
performance. Users can do this by looking at cache's hit-to-miss
ratio. Cache hits are instances in which the system successfully
retrieves data from the cache. A cache miss is when the system
looks for the data in the cache, can't find it, and looks somewhere
else instead. In some cases, users can improve the hit-miss ratio
by adjusting the cache memory block size i.e. the size of data units
stored.
July – Oct 2021

Practice Questions
Que-1: A computer has a 256 KByte, 4-way set associative, write back data
cache with the block size of 32 Bytes. The processor sends 32-bit addresses to the
cache controller. Each cache tag directory entry contains, in addition, to address
tag, 2 valid bits, 1 modified bit and 1 replacement bit. The number of bits in the
tag field of an address is
(A) 11
(B) 14
(C) 16
(D) 27
Answer: (C)
Explanation: https://www.geeksforgeeks.org/gate-gate-cs-2012-question-54/
July – Oct 2021

o Explanation:
o A set-associative scheme is a hybrid between a fully associative cache, and
direct mapped cache. It’s considered a reasonable compromise between the
complex hardware needed for fully associative caches (which requires parallel
searches of all slots), and the simplistic direct-mapped scheme, which may
cause collisions of addresses to the same slot (similar to collisions in a hash
table).
• Number of blocks = Cache-Size/Block-Size = 256 KB / 32 Bytes = 213
• Number of Sets = 213 / 4 = 211
o Tag + Set offset + Byte offset = 32
o Tag + 11 + 5 = 32
o Tag = 16
July – Oct 2021

Que-2: Consider the data given in previous question. The size of the cache tag
directory is
(A) 160 Kbits
(B) 136 bits
(C) 40 Kbits
(D) 32 bits
Answer: (A)
July – Oct 2021

Explanation: 16 bit address
2 bit valid
1 modified
1 replace
Total bits = 20
20 × no. of blocks
= 160 K bits.
July – Oct 2021

Que-3: An 8KB direct-mapped write-back cache is organized as multiple blocks,
each of size 32-bytes. The processor generates 32-bit addresses. The cache
controller maintains the tag information for each cache block comprising of the
following; 1 Valid bit; 1 Modified bit
As many bits as the minimum needed to identify the memory block mapped in
the cache. What is the total size of memory needed at the cache controller to store
meta-data (tags) for the cache?
(A) 4864 bits
(B) 6144 bits
(C) 6656 bits
(D) 5376 bits
Answer: (D)
July – Oct 2021

Explanation
oCache size = 8 KB
oBlock size = 32 bytes
oNumber of cache lines = Cache size / Block size = (8 × 1024 bytes)/32 = 256
ototal bits required to store meta-data of 1 line = 1 + 1 + 19 = 21 bits
ototal memory required = 21 × 256 = 5376 bits
July – Oct 2021

Locating Data in the Cache
o Given an address, we can determine whether the data at
that memory location is in the cache. To do so, we use the
following procedure:
1. Use the set index to determine which cache set the
address should reside in.
2. For each block in the corresponding cache set, compare
the tag associated with that block to the tag from the
memory address. If there is a match, proceed to the next
step. Otherwise, the data is not in the cache.
3. For the block where the data was found, look at valid
bit. If it is 1, the data is in the cache, otherwise it is not.
July – Oct 2021

Locating Data in the Cache
o If the data at that address is in the cache, then we use the block offset from
that address to find the data within the cache block where the data was
found. Figure 1 below shows divisions of the address for cache use
o All of the information needed to locate the data in the cache is given in the
address. Fig. 1 shows which parts of the address are used for locating data in
the cache.
o The least significant bits are used to determine the block offset. If the block
size is B then b = log2 B bits will be needed in the address to specify the block
offset. The next highest group of bits is the set index and is used to determine
which cache set we will look at.
o If S is the number of sets in our cache, then the set index has s = log2 S bits.
Note that in a fully-associative cache, there is only 1 set so the set index will
not exist. The remaining bits are used for the tag. If ℓ is the length of the
address (in bits), then the number of tag bits is t = ℓ − b − s.
July – Oct 2021
tag (t bits) Set index
(s bits)
Block offset
(b bits)

Cache Addressing
o (Read Tutorial on “Hardware Organization and Design” – 15 pages)
July – Oct 2021

Multilevel Cache Organisation
o Multilevel Caches is one of the techniques to improve Cache Performance by
reducing the “MISS PENALTY”. Miss Penalty refers to the extra time required
to bring the data into cache from the Main memory whenever there is a “miss”
in the cache.
o For clear understanding let us consider an example where the CPU requires 10
Memory References for accessing the desired information and consider this
scenario in the following 3 cases of System design :
Case 1 : System Design without Cache Memory
o Here the CPU directly communicates with the main memory and no caches
are involved. In this case, the CPU needs to access the main memory 10 times
to access the desired information.
July – Oct 2021

Case 2 : System Design with Cache Memory
o Here the CPU at first checks whether the desired data is present in the
Cache Memory or not i.e. whether there is a “hit” in cache or “miss” in
the cache.
o Suppose there is 3 miss in Cache Memory then the Main Memory will
be accessed only 3 times.
o We can see that here the miss penalty is reduced because the Main
Memory is accessed a lesser number of times than that in the previous
case.
July – Oct 2021

Case 3 : System Design with Multilevel Cache Memory
o Here the Cache performance is optimized further by introducing
multilevel Caches. As shown in the above figure, we are considering 2
level Cache Design.
o Suppose there is 3 miss in the L1 Cache Memory and out of these 3
misses there is 2 miss in the L2 Cache Memory then the Main Memory
will be accessed only 2 times.
o It is clear that here the Miss Penalty is reduced considerably than that
in the previous case thereby improving the Performance of Cache
Memory.
July – Oct 2021

o We can observe from the above 3 cases that we are trying to decrease the
number of Main Memory References and thus decreasing the Miss Penalty in
order to improve the overall System Performance. Also, it is important to note
that in the Multilevel Cache Design, L1 Cache is attached to the CPU and it is
small in size but fast. Although, L2 Cache is attached to the Primary Cache i.e.
L1 Cache and it is larger in size and slower but still faster than the Main
Memory.
o Effective Access Time = Hit rate * Cache access time + Miss rate * Lower
level access time
o Average access Time For Multilevel Cache:(Tavg)
Tavg = H1 * C1 + (1 – H1) * (H2 * C2 +(1 – H2) *M )
where H1 is the Hit rate in the L1 caches; H2 is the Hit rate in the L2
cache; C1 is the Time to access information in the L1 caches; C2 is the Miss
penalty to transfer information from the L2 cache to an L1 cache and M is the
Miss penalty to transfer information from the main memory to the L2 cache.
July – Oct 2021

Exercise
o Que 1 - Find the Average memory access time for a processor with a 2 ns
clock cycle time, a miss rate of 0.04 misses per instruction, a missed penalty
of 25 clock cycles, and a cache access time (including hit detection) of 1
clock cycle. Also, assume that the read and write miss penalties are the same
and ignore other write stalls.
Solution
Average Memory access time(AMAT)= Hit Time + Miss Rate * Miss Penalty.
Hit Time = 1 clock cycle (Hit time = Hit rate * access time) but here Hit time
is directly given so,
Miss rate = 0.04 ;Miss Penalty= 25 clock cycle (this is the time taken
by the above level of memory after the hit)
so, AMAT= 1 + 0.04 * 25
AMAT= 2 clock cycle
according to question 1 clock cycle = 2 ns
AMAT = 4ns
July – Oct 2021

Virtual Memory Terminologies
o Virtual Memory - A storage allocation scheme in which secondary memory
can be addressed as though it were part of main memory. The addresses
a program may use to reference memory are distinguished from the
addresses the memory system uses to identify physical storage sites, and
program-generated addresses are translated automatically to the
corresponding machine addresses. The size of virtual storage is limited by
the addressing scheme of the computer system and by the amount of
secondary memory available and not by the actual number of main
storage locations.
o Virtual Address - The address assigned to a location in virtual memory to
allow that location to be accessed as though it were part of main
memory.
o Virtual address space - The virtual storage assigned to a process.
o Address space - The range of memory addresses available to a process.
o Real address - The address of a storage location in main memory.
July – Oct 2021

Virtual Memory
o Virtual Memory is a storage allocation scheme in which secondary
memory can be addressed as though it were part of main memory.
o The addresses a program may use to reference memory are
distinguished from the addresses the memory system uses to
identify physical storage sites, and program generated addresses
are translated automatically to the corresponding machine
addresses.
o The size of virtual storage is limited by the addressing scheme of
the computer system and amount of secondary memory is
available not by the actual number of the main storage locations.
o It is a technique that is implemented using both hardware and
software. It maps memory addresses used by a program, called
virtual addresses, into physical addresses in computer memory.
July – Oct 2021

Virtual Memory
o Two characteristics fundamental to memory management:
1) all memory references are logical addresses that are
dynamically translated into physical addresses at run time
2) a process may be broken up into a number of pieces
that don’t need to be contiguously located in main
memory during execution
o If these two characteristics are present, it is not necessary
that all of the pages or segments of a process be in main
memory during execution. This means that the required
pages need to be loaded into memory whenever required.
Virtual memory is implemented using Demand Paging or
Demand Segmentation.
July – Oct 2021

Thrashing
o A state in which the system spends most of its time
swapping process pieces rather than executing instructions
o To avoid this, the operating system tries to guess, based
on recent history, which pieces are least likely to be used
in the near future
July – Oct 2021

Principle of Locality
o Program and data references within a process
tend to cluster
o Only a few pieces of a process will be needed over
a short period of time
o Therefore it is possible to make intelligent guesses
about which pieces will be needed in the future
o Avoids thrashing
July – Oct 2021

 Support Needed for Virtual Memory
o For virtual memory to be practical and effective:
1. Hardware must support paging and segmentation
2. Operating system must include software for managing the
movement of pages and/or segments between secondary
memory and main memory
July – Oct 2021

 Paging
o The term virtual memory is usually associated
with systems that employ paging
o Use of paging to achieve virtual memory was first
reported for the Atlas computer
o Each process has its own page table and each
page table entry contains the frame number of
the corresponding page in main memory
July – Oct 2021

 Paging
o Paging is a memory management scheme that eliminates the
need for contiguous allocation of physical memory. This scheme
permits the physical address space of a process to be non –
contiguous
• Logical Address or Virtual Address (represented in bits): An
address generated by the CPU
• Logical Address Space or Virtual Address Space(
represented in words or bytes): The set of all logical
addresses generated by a program
• Physical Address (represented in bits): An address actually
available on memory unit
• Physical Address Space (represented in words or bytes):
The set of all physical addresses corresponding to the logical
addresses
July – Oct 2021

 Paging
o Example:
• If Logical Address = 31 bit, then Logical Address
Space = 231 words = 2 G words (1 G = 230)
• If Logical Address Space = 128 M words = 27 *
220 words, then Logical Address = log2 227 = 27 bits
• If Physical Address = 22 bit, then Physical Address
Space = 222 words = 4 M words (1 M = 220)
• If Physical Address Space = 16 M words = 24 *
220 words, then Physical Address = log2 224 = 24 bits
July – Oct 2021

 Paging
o The mapping from virtual to physical address
is done by the memory management unit
(MMU) which is a hardware device and this
mapping is known as paging technique.
o The Physical Address Space is conceptually
divided into a number of fixed-size blocks,
called frames.
o The Logical address Space is also split into
fixed-size blocks, called pages.
o Page Size = Frame Size
July – Oct 2021

 Paging
• Let us consider an example:
• Physical Address = 12 bits, then Physical Address Space = 4 K words
• Logical Address = 13 bits, then Logical Address Space = 8 K words
• Page size = frame size = 1 K words (assumption)
July – Oct 2021

 Paging
o Address generated by CPU is divided into
• Page number(p): Number of bits required to represent the
pages in Logical Address Space or Page number
• Page offset(d): Number of bits required to represent
particular word in a page or page size of Logical Address
Space or word number of a page or page offset.
o Physical Address is divided into
• Frame number(f): Number of bits required to represent the
frame of Physical Address Space or Frame number.
• Frame offset(d): Number of bits required to represent
particular word in a frame or frame size of Physical Address
Space or word number of a frame or frame offset
July – Oct 2021

 Paging
o The hardware implementation of page table can be done by
using dedicated registers. But the usage of register for the
page table is satisfactory only if page table is small.
o If page table contain large number of entries then we can use
TLB (translation Look-aside buffer), a special, small, fast look
up hardware cache.
o The TLB is associative, high speed memory.
o Each entry in TLB consists of two parts: a tag and a value.
o When this memory is used, then an item is compared with all
tags simultaneously. If the item is found, then corresponding
value is returned.
July – Oct 2021

 Paging
July – Oct 2021

 Paging
o Main memory access time = m
o If page table are kept in main memory,
o Effective access time = m(for page table) + m(for particular
page in page table)
July – Oct 2021

 Paging
Sample Question
o Consider a machine with 64 MB physical memory and a 32-bit
virtual address space. If the page size is 4KB, what is the
approximate size of the page table?
(A) 16 MB
(B) 8 MB
(C) 2 MB
(D) 24 MB
Answer: (C)
o Explanation: See question 1 of
https://www.geeksforgeeks.org/operating-systems-set-2/
July – Oct 2021

 Paging
Explanation:
A page entry is used to get address of physical memory. Here we assume that single
level of Paging is happening. So the resulting page table will contain entries for all the
pages of the Virtual address space.
Number of entries in page table = (virtual address space size)/(page size)
Using above formula we can say that there will be 2^(32-12) = 2^20 entries in page
table.
No. of bits required to address the 64MB Physical memory = 26.
So there will be 2^(26-12) = 2^14 page frames in the physical memory. And page
table needs to store the address of all these 2^14 page frames. Therefore, each page
table entry will contain 14 bits address of the page frame and 1 bit for valid-invalid
bit.
Since memory is byte addressable. So we take that each page table entry is 16 bits
i.e. 2 bytes long.
Size of page table = (total number of page table entries) *(size of a page table entry)
= (2^20 *2) = 2MB
July – Oct 2021

 Demand Paging
o The process of loading the page into memory on demand
(whenever page fault occurs) is known as demand paging.
The process includes the following steps :
July – Oct 2021

 Demand Paging
1. If CPU try to refer a page that is currently not available in the main
memory, it generates an interrupt indicating memory access fault.
2. The OS puts the interrupted process in a blocking state. For the
execution to proceed the OS must bring the required page into the
memory.
3. The OS will search for the required page in the logical address space.
4. The required page will be brought from logical address space to
physical address space. The page replacement algorithms are used for
the decision making of replacing the page in physical address space.
5. The page table will updated accordingly.
6. The signal will be sent to the CPU to continue the program execution
and it will place the process back into ready state.
o Hence whenever a page fault occurs these steps are followed by the
operating system and the required page is brought into memory.
July – Oct 2021

 Advantages of Demand Paging
1. More processes may be maintained in the main memory: Because we
are going to load only some of the pages of any particular process,
there is room for more processes. This leads to more efficient
utilization of the processor because it is more likely that at least one of
the more numerous processes will be in the ready state at any
particular time.
2. A process may be larger than all of main memory: One of the most
fundamental restrictions in programming is lifted. A process larger
than the main memory can be executed because of demand paging.
The OS itself loads pages of a process in main memory as required.
3. It allows greater multiprogramming levels by using less of the
available (primary) memory for each process
July – Oct 2021

 Page Fault Service Time
o The time taken to service the page fault is called as page
fault service time. The page fault service time includes the
time taken to perform all the above six steps.
Let Main memory access time is: m
Page fault service time is: s
Page fault rate is : p
Then, Effective memory access time = (p*s) + (1-p)*m
July – Oct 2021

 Swapping
o Swapping a process out means removing all of its pages
from memory, or marking them so that they will be
removed by the normal page replacement process.
o Suspending a process ensures that it is not runnable while
it is swapped out. At some later time, the system swaps
back the process from the secondary storage to main
memory.
o When a process is busy swapping pages in and out then
this situation is called thrashing
July – Oct 2021

 Inverted Page Table
o Page number portion of a virtual address is
mapped into a hash value
 hash value points to inverted page table
o Fixed proportion of real memory is required for
the tables regardless of the number of processes
or virtual pages supported
o Structure is called inverted because it indexes
page table entries by frame number rather than
by virtual page number
July – Oct 2021

 Inverted Page Table
o Each entry in the page table includes:
July – Oct 2021
The process
that owns
this page
Includes flags
and protection
and locking info
the index value of
the next entry in
the chain
Page
Number
Process
Identifier
Control
Bits
Chain
Pointer

 Translation Lookaside Buffer (TLB)
o Each virtual memory reference can cause two
physical memory accesses:
 one to fetch the page table entry
 one to fetch the data
o To overcome the effect of doubling the memory
access time, most virtual memory schemes make
use of a special high-speed cache called a
translation lookaside buffer (TLB)
July – Oct 2021

o The TLB only contains some of the page table
entries so we cannot simply index into the TLB
based on page number
 each TLB entry must include the page number as well
as the complete page table entry (associative mapping)
o The processor is equipped with hardware that
allows it to interrogate simultaneously a number
of TLB entries to determine if there is a match on
page number
July – Oct 2021

o Some TLBs store address-space identifiers (ASIDs)
in each TLB entry –
– uniquely identifies each process
– provide address-space protection for that process
– Otherwise need to flush at every context switch
o TLBs typically small (64 to 1,024 entries)
o On a TLB miss, value is loaded into the TLB for
faster access next time
– Replacement policies must be considered
– Some entries can be wired down for permanent fast
access
July – Oct 2021

 Improving Efficiency of Virtual Address Translation)
o Next step towards improvement of the efficiency of virtual address
translation is the memory management unit - MMU, introduced into
modern microprocessors.
o The functioning of the memory management unit is based on the use
of address translation buffers and other registers, in which current
pointers to all tables used in virtual to physical address translation are
stored
July – Oct 2021

 Improving Efficiency of Virtual Address Translation)
o The MMU unit checks if the requested page descriptor is in
the TLB. If so, the MMU generates the physical address for
the main memory.
o If the descriptor is missing in TLB, then MMU brings the
descriptor from the main memory and updates the TLB.
o Next, depending on the presence of the page in the main
memory, the MMU performs address translation or
launches the transmission of the page to the main memory
from the auxiliary store.
July – Oct 2021

 Segmentation
o A process is divided into Segments. The chunks that a
program is divided into which are not necessarily all of the
same sizes are called segments. Segmentation gives user’s
view of the process which paging does not give. Here the
user’s view is mapped to physical memory.
o There are types of segmentation:
1. Virtual memory segmentation –
Each process is divided into a number of segments, not all
of which are resident at any one point in time.
2. Simple segmentation –
Each process is divided into a number of segments, all of
which are loaded into memory at run time, though not
necessarily contiguously.
July – Oct 2021

 Segmentation
o There is no simple relationship between logical addresses
and physical addresses in segmentation. A table stores the
information about all such segments and is called Segment
Table.
o Segment Table – It maps two-dimensional Logical address
into one-dimensional Physical address. It’s each table
entry has:
o Base Address: It contains the starting physical address
where the segments reside in memory.
o Limit: It specifies the length of the segment.
July – Oct 2021

 Segmentation
July – Oct 2021

 Segmentation
o Translation of Two dimensional Logical Address to one
dimensional Physical Address
o Address generated by the CPU is divided into:
 Segment number (s): Number of bits required to represent the segment.
 Segment offset (d): Number of bits required to represent the size of the
segment.
July – Oct 2021

 Segmentation
Advantages of Segmentation
1. No Internal fragmentation
2. Segment table consumes less space in comparison to page
table in paging
Disadvantage of Segmentation
1. As processes are loaded and removed from the memory,
the free memory space is broken into little pieces, causing
external fragmentation
July – Oct 2021

Next …....
Shared Memory Multiprocessors
July – Oct 2021

 Shared Memory Multiprocessors
o A system with multiple CPUs “sharing” the same main memory is
called multiprocessor.
o In a multiprocessor system all processes on the various CPUs share
a unique logical address space, which is mapped on a physical
memory that can be distributed among the processors.
o Each process can read and write a data item simply using load and
store operations, and process communication is through shared
memory.
o It is the hardware that makes all CPUs access and use the same
main memory.
o This is an architectural model simple and easy to use for
programming; it can be applied to a wide variety of problems that
can be modeled as a set of tasks, to be executed in parallel (at least
partially)
July – Oct 2021

o Since all CPUs share the address space, only a single
instance of the operating system is required.
o When a process terminates or goes into a wait state for
whichever reason, the OS can look in the process table
(more precisely, in the ready processes queue) for another
process to be dispatched to the idle CPU.
o On the contrary, in systems with no shared memory, each
CPU must have its own copy of the operating system, and
processes can only communicate through message passing.
o The basic issue in shared memory multiprocessor systems is
memory itself, since the larger the number of processors
involved, the more difficult to work on memory efficiently.
July – Oct 2021

o All modern OS (Windows, Solaris, Linux, MacOS) support
symmetric multiprocessing, (SMP), with a scheduler running
on every processor (a simplified description, of course).
o “ready to run” processes can be inserted into a single
queue, that can be accessed by every scheduler,
alternatively there can be a “ready to run” queue for each
processor.
o When a scheduler is activated in a processor, it chooses one
of the “ready to run” processes and dispatches it on its
processor (with a single queue, things are somewhat more
difficult, can you guess why?)
July – Oct 2021

o A distinct feature in multiprocessor systems is load balancing.
o It is useless having many CPUs in a system, if processes are not
distributed evenly among the cores.
o With a single “ready-to-run” queue, load balancing is usually
automatic: if a processor is idle, its scheduler will pick a process
from the shared queue and will start it on that processor.
o Modern OSs designed for SMP often have a separate queue for
each processor (to avoid the problems associated with a single
queue).
o There is an explicit mechanism for load balancing, by which a
process on the wait list of an overloaded processor is moved to
the queue of another, less loaded processor.
 As an example, SMP Linux activates its load balancing scheme every 200
ms, and whenever a processor queue empties.
July – Oct 2021

o Migrating a process to a different processor can be costly when
each core has a private cache (can you guess why?).
o This is why some OSs, such as Linux, offer a system call to
specify that a process is tied to the processor, independently of
the processors load.
o There are three classes of multiprocessors, according to the way
each CPU sees main memory:
- Uniform Memory Access (UMA),
- Non Uniform Memory Access (NUMA)
- Cache Only Memory Access (COMA)
July – Oct 2021

1. Uniform Memory Access (UMA):
o The name of this type of architecture hints to the fact that
all processors share a unique centralized primary memory,
so each CPU has the same memory access time.
o Owing to this architecture, these systems are also called
Symmetric Shared-memory Multiprocessors (SMP)
o The simplest multiprocessor system has a single bus to
which connect at least two CPUs and a memory (shared
among all processors).
o When a CPU wants to access a memory location, it checks if
the bus is free, then it sends the request to the memory
interface module and waits for the requested data to be
available on the bus.
July – Oct 2021

1. Uniform Memory Access (UMA):
o Multicore processors are small UMA multiprocessor systems, where the first
shared cache (L2 or L3) is actually the communication channel.
o Shared memory can quickly become a bottleneck for system performances,
since all processors must synchronize on the single bus and memory
access.
o Larger multiprocessor systems (>32 CPUs) cannot use a single bus to
interconnet CPUs to memory modules, because bus contention becomes
un-manegeable.
o CPU – memory is realized through an interconnection network (in jargon
“fabric”).
o Caches local to each CPU alleviate the problem, furthermore each processor
can be equipped with a private memory to store data of computations that
need not be shared by other processors. Traffic to/from shared memory
can reduce considerably
July – Oct 2021

2. Non Uniform Memory Access (NUMA):
o Single bus UMA systems are limited in the number of
processors, and costly hardware is necessary to connect
more processors. Current technology prevents building UMA
systems with more than 256 processors.
o To build larger processors, a compromise is mandatory: not
all memory blocks can have the same access time with
respect to each CPU.
o This is the origin of the name NUMA systems: Non Uniform
Memory Access.
July – Oct 2021

o These systems have a shared logical address space, but
physical memory is distributed among CPUs, so that access
time to data depends on data position, in local or in a
remote memory (thus the NUMA denomination). These
systems are also called Distributed Shared Memory (DSM)
architectures
July – Oct 2021

o Since all NUMA systems have a single logical address space
shared by all CPUs, while physical memory is distributed
among processors, there are two types of memories: local
and remote memory.
o Yet, even remote memory is accessed by each CPU with
LOAD and STORE instructions.
o There are two types of NUMA systems:
• Non-Caching NUMA (NC-NUMA)
• Cache-Coherent NUMA (CC-NUMA)
July – Oct 2021

Non Caching -NUMA
o In a NC-NUMA system, processors have no local cache.
o Each memory access is managed with a modified MMU, which controls if
the request is for a local or for a remote block; in the latter case, the
request is forwarded to the node containing the requested data.
o Obviously, programs using remote data (with respect to the CPU
requesting them) will run much slower than what they would, if the
data were stored in the local memory
o In NC-NUMA systems there is no cache coherency problem, because
there is no caching at all: each memory item is in a single location.
o Remote memory access is however very inefficient. For this reason, NC-
NUMA systems can resort to special software that relocates memory
pages from one block to another, just to maximize performances.
o A page scanner demon activates every few seconds, examines statistics
on memory usage, and moves pages from one block to another, to
increase performances.
July – Oct 2021

Non Caching -NUMA
o Actually, in NC-NUMA systems, each processor can also have a private memory
and a cache, and only private date (those allocated in the private local memory)
can be in the cache.
o This solution increases the performances of each processor, and is adopted in
Cray T3D/E.
o Yet, remote data access time remains very high, 400 processor clock cycles in
Cray T3D/E, against 2 for retrieving data from local cache.
July – Oct 2021

CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt

CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt

Recommended

Recommended

More Related Content

Similar to CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt

Similar to CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt (20)

Recently uploaded

Recently uploaded (20)

CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt