SlideShare a Scribd company logo
1 of 344
July – Oct 2021 CSC 457 Lecture Notes 1
(Knowledge for development)
KIBABII UNIVERSITY(KIBU)
SCHOOL OF COMPUTING AND INFORMATICS
CSC 457E: ADVANCED MICROPROCESSOR ARCHITECTURE
COURSE OUTLINE
TIME: Tuesdays 11am – 1PM Room: ABB302
Lecturer: Eric Sifuna, BSc EEE ; MSc IS; R.Eng; MIEEE
Cellphone: 0707327418 Email: sifunaes@gmail.com
July – Oct 2021 CSC 457 Lecture Notes 2
Aim/Purpose:
 The purpose of this course is to teach students the
fundamentals of microprocessor and microcontroller
systems.
July – Oct 2021 CSC 457 Lecture Notes 3
 Learning outcomes:
At the end of this course, successful students should be able to:
 Describe how the hardware and software components of a microprocessor-based
system work together to implement system-level features
 Integrate both hardware and software aspects of digital devices (such as memory
and I/O interfaces) into microprocessor-based systems
 Gain hands-on experience with common microprocessor peripherals such as
UARTs, timers, and analog-to-digital and digital-to-analog converters
 Get practical experience in applied digital logic design and assembly-language
programming
 Use the tools and techniques used by practicing engineers to design, implement,
and debug microprocessor-based systems
July – Oct 2021 CSC 457 Lecture Notes 4
 High Performance microprocessor design:
 Computational Models
 An Argument for Parallel Architectures
 Internetworking Performance Issues and Scalability of Parallel Architectures
 Performance Evaluation:
 Performance of Modeling Method
 Pipeline Freeze Strategies, Prediction Strategies, Composite Strategies,
Benchmark Performance
 Pipelined processors and super pipeline concepts, Solutions to pipeline
hazards (e.g. prediction and delay branch etc.).
Course Topics
July – Oct 2021 CSC 457 Lecture Notes 5
 Memory and I/O systems:
 Cache Memory, Cache addressing, Multilevel caches, Virtual Memory,
Paged, Segmented, and Paged Organizations;
 Address Translation:
 Direct Page Table Translation, Inverted Page Table, Table Lookaside Buffer,
Virtual Memory Accessing rules, Shared Memory Multiprocessors,
Partitioning, Scheduling, Communication and Synchronization, Memory
Coherency.
 Superscalar Processor Design:
 Superscalar Concepts, Execution Model, Exception Recovery
 Register DataFlow, Out-of-Order Issue and Basic Software Scheduling.
Course Topics (cont..)
July – Oct 2021 CSC 457 Lecture Notes 6
 Instruction Level Parallelism Exploration:
 VLIW, simultaneous multithreading, processor coupling.
 Advanced Speculation Techniques:
 Speculation Techniques for Improving Load Related Instruction
Scheduling
 Performance Analysis for OpenMP Applications
 Fine-Grain Distributed Shared Memory on Clusters
 Future Processor Architectures:
 MAJC, Raw Network Computing, Quantum Computing
Course Topics (cont..)
July – Oct 2021 CSC 457 Lecture Notes 7
Delivery: Blended Learning, small group discussion, case
studies, individual projects and tutorials
Instructional Material and/or Equipment: Computers,
Learning Management System, writing boards, writing materials,
projectors etc.
Recommended Core Reading:
1. Microprocessors and Programmed Logic, Short, K., Prentice Hall.
Other references
1. Texts, Audio and video cassettes, computer software, other
resources
CSC 457 Lecture Notes 8
 High Performance microprocessor design:
 Computational Models
 An Argument for Parallel Architectures
 Internetworking Performance Issues and Scalability of
Parallel Architectures
In this introduction …
July – Oct 2021
CSC 457 Lecture Notes 9
1. Computational Models for High Performance
Microprocessors
July – Oct 2021
CSC 457 Lecture Notes 10
 Computational Models
o High performance RISC-based microprocessors
are defining the current history of high
performance computing
o A Complex Instruction Set Computer (CISC)
instruction set is made up of powerful primitives,
close in functionality to the primitives of high-level
languages
o “If RISC is faster, why did people bother with
CISC designs in the first place?”
 RISC wasn’t always both feasible and affordable
July – Oct 2021
CSC 457 Lecture Notes 11
Computational Models
o High-level language compilers were commonly available, but
they didn’t generate the fastest code, and they weren’t terribly
thrifty with memory.
o When programming, you needed to save both space and time.
A good instruction set was both easy to use and powerful
o computers had very little storage by today’s standards. An
instruction that could roll all the steps of a complex operation,
such as a do-loop, into single opcode was a plus, because
memory was precious.
o Complex instructions saved time, too. Almost every large computer
following the IBM 704 had a memory system that was slower than its
central processing unit (CPU). When a single instruction can perform
several operations, the overall number of instructions retrieved from
memory can be reduced.
July – Oct 2021
CSC 457 Lecture Notes 12
Computational Models
o There were several obvious pressures that
affected the development of RISC:
- The number of transistors that could fit on a single chip was
increasing. It was clear that one would eventually be able to fit all
the components from a processor board onto a single chip.
- Techniques such as pipelining were being explored to improve
performance. Variable-length instructions and variable-length
instruction execution times (due to varying numbers of microcode
steps) made implementing pipelines more difficult.
- As compilers improved, they found that well-optimized sequences
of stream- lined instructions often outperformed the equivalent
complicated multi-cycle instructions.
July – Oct 2021
CSC 457 Lecture Notes 13
Computational Models
o The RISC designers sought to create a high performance
single-chip processor with a fast clock rate.
o When a CPU can fit on a single chip, its cost is decreased,
its reliability is increased, and its clock speed can be
increased.
o While not all RISC processors are single-chip
implementation, most use a single chip.
o To accomplish this task, it was necessary to discard the
existing CISC instruction sets and develop a new minimal
instruction set that could fit on a single chip. Hence the
term Reduced Instruction Set Computer.
July – Oct 2021
CSC 457 Lecture Notes 14
Computational Models
o The earliest RISC processors had no floating-point support in
hardware, and some did not even support integer multiply in
hardware. However, these instructions could be implemented using
software routines that combined other instructions (a microcode of
sorts).
o These earliest RISC processors (most severely reduced) were not
overwhelming successes for four reasons:
 It took time for compilers, operating systems, and user software to be retuned to
take advantage of the new processors.
 If an application depended on the performance of one of the software-implemented
instructions, its performance suffered dramatically.
 Because RISC instructions were simpler, more instructions were needed to
accomplish the task.
 Because all the RISC instructions were 32 bits long, and commonly used CISC
instructions were as short as 8 bits, RISC program executables were often larger.
• As a result of these last two issues, a RISC program may have to fetch more
memory for its instructions than a CISC program. This increased appetite for
July – Oct 2021
CSC 457 Lecture Notes 15
Computational Models
o As a result of these last two issues, a RISC program may have to fetch
more memory for its instructions than a CISC program. This increased
appetite for instructions actually clogged the memory bottleneck until
sufficient caches were added to the RISC processors.
o RISC processors quickly became known for their affordable high-speed
floating- point capability compared to CISC processors. This excellent
performance on scientific and engineering applications effectively
created a new type of computer system, the workstation
July – Oct 2021
CSC 457 Lecture Notes 16
 Parallel Architectures
o Concurrency and parallelism are related concepts, but they are distinct.
Concurrent programming happens when several computations are happening
in over-lapping time periods. Your laptop, for example, seems like it doing a
lot of things at the same time even though there are only 1, 2, or 4 cores. So,
we have concurrency without parallelism.
o At the other end of the spectrum, the CPU in your laptop is carrying out pieces
of the same computation in parallel to speed up the execution of the
instruction stream.
o Parallel computing occupies a unique spot in the universe of distributed
systems. Parallel computing is centralized—all of the processes are typically
under the control of a single entity. Parallel computing is usually hierarchical—
parallel architectures are frequently described as grids, trees, or pipelines.
Parallel computing is co-located—for efficiency, parallel processes are typically
located very close to each other, often in the same chassis or at least the
same data center. These choices are driven by the problem space and the
need for high performance.
July – Oct 2021
CSC 457 Lecture Notes 17
 Parallel Architectures
o Parallel computing occupies a unique spot in the universe
of distributed systems.
o Parallel computing is centralized—all of the processes are typically under
the control of a single entity.
o Parallel computing is usually hierarchical—parallel architectures are
frequently described as grids, trees, or pipelines.
o Parallel computing is co-located—for efficiency, parallel processes are
typically located very close to each other, often in the same chassis or at
least the same data center.
o These choices are driven by the problem space and the
need for high performance.
July – Oct 2021
CSC 457 Lecture Notes 18
 Parallel Architectures
o Definition of a parallel computer: A set of independent
processors that can work cooperatively to solve a problem
o A parallel system consists of an algorithm and the parallel
architecture that the algorithm is implemented.
o Note that an algorithm may have different performance on
different parallel architecture.
o For example, an algorithm may perform differently on a
linear array of processors and on a hypercube of
processors
July – Oct 2021
CSC 457 Lecture Notes 19
 Parallel Architectures
o Why Use Parallel Computing?
 Single processor speeds are reaching their ultimate limits
 Multi-core processors and multiple processors are the most
promising paths to performance improvements
o Concurrency: The property of a parallel algorithm that a number of
operations can be performed by separate processors at the same time.
Concurrency is the key concept in the design of parallel algorithms:
 Requires a different way of looking at the strategy to solve a
problem
 May require a very different approach from a serial program to
achieve high efficiency
July – Oct 2021
CSC 457 Lecture Notes 20
 Parallel Architectures
•
July – Oct 2021
CSC 457 Lecture Notes 21
 Parallel Architectures
o Protein folding problems involve a large number of independent
calculations that do not depend on data from other calculations
o Concurrent calculations with no dependence on the data from
other calculations are termed Embarrassingly Parallel
o These embarrassingly parallel problems are ideal for solution by
HPC methods, and can realize nearly ideal concurrency and
scalability
o Flexibility in the way a problem is solved is beneficial to finding
a parallel algorithm that yields a good parallel scaling.
o Often, one has to employ substantial creativity in the way a
parallel algorithm is implemented to achieve good scalability.
July – Oct 2021
CSC 457 Lecture Notes 22
 Parallel Architectures
o Understand the Dependencies
o One must understand all aspects of the problem to be solved, in
particular the possible dependencies of the data.
o It is important to understand fully all parts of a serial code that
you wish to parallelize, Example: Pressure Forces (Local) vs.
Gravitational Forces (Global)
o When designing a parallel algorithm, always remember:
 Computation is FAST
 Communication is SLOW
 Input/Output (I/O) is INCREDIBLY SLOW
o In addition to concurrency and scalability, there are a number of
other important factors in the design of parallel algorithms:
Locality; Granularity; Modularity; Flexibility ;Load balancing
July – Oct 2021
CSC 457 Lecture Notes 23
 Parallel Architectures
Parallel Computer Architectures
o Virtually all computers follow the basic design of the Von
Neumann Architecture as follows;
 Memory stores both instructions and data
 Control unit fetches instructions from memory, decodes
instructions, and then sequentially performs operations to
perform programmed task
 Arithmetic Unit performs mathematical operations
 Input/Output is interface to the user
July – Oct 2021
CSC 457 Lecture Notes 24
 Parallel Architectures
Flynn’s Taxonomy
o SISD: This is a standard serial computer: one set of instructions, one
data stream
o SIMD: All units execute same instructions on different data streams
(vector)
- Useful for specialized problems, such as graphics/image processing
- Old Vector Supercomputers worked this way, also moderns GPUs
o MISD: Single data stream operated on by different sets of instructions,
not generally used for parallel computers
o MIMD: Most common parallel computer, each processor can execute
different instructions on different data streams
-Often constructed of many SIMD subcomponents
July – Oct 2021
CSC 457 Lecture Notes 25
 Parallel Architectures
Flynn’s Taxonomy
o SISD: This is a standard serial computer: one set of instructions, one
data stream
o SIMD: All units execute same instructions on different data streams
(vector)
- Useful for specialized problems, such as graphics/image processing
- Old Vector Supercomputers worked this way, also moderns GPUs
o MISD: Single data stream operated on by different sets of instructions,
not generally used for parallel computers
o MIMD: Most common parallel computer, each processor can execute
different instructions on different data streams
-Often constructed of many SIMD subcomponents
July – Oct 2021
CSC 457 Lecture Notes 26
 Parallel Architectures
Parallel Computer Memory Architectures
o Shared Memory – memory shared among various CPUs
o Distributed Memory - each CPU has its memory
o Hybrid Distributed Shared Memory
July – Oct 2021
CSC 457 Lecture Notes 27
 Parallel Architectures
Relation to Parallel Programming Models
o OpenMP: Multi-threaded calculations occur within shared-
memory components of systems, with different threads working
on the same data.
o MPI: Based a distributed-memory model, data associated with
another processor must be communicated over the network
connection.
o GPUs: Graphics Processing Units (GPUs) incorporate many
(hundreds) of computing cores with single Control Unit, so this
is a shared-memory model.
o Processors vs. Cores: Most common parallel computer, each
processor can execute different instructions on different data
streams -Often constructed of many SIMD subcomponents
July – Oct 2021
CSC 457 Lecture Notes 28
 Parallel Architectures
Embarrassingly Parallel
o Refers to an approach that involves solving many similar but
independent tasks simultaneously
o Little to no coordination (and thus no communication) between
tasks
o Each task can be a simple serial program
o This is the “easiest” type of problem to implement in a parallel
manner. Essentially requires automatically coordinating many
independent calculations and possibly collating the results.
o Examples: Computer Graphics and Image Processing; Protein
Folding Calculations in Biology; Geographic Land Management
Simulations in Geography; Data Mining in numerous fields;
Event simulation and reconstruction in Particle Physics
July – Oct 2021
CSC 457 Lecture Notes 29
 Internetworking Performance Issues
and Scalability of Parallel Architectures
o Performance Limitations of Parallel Architectures
o Adding additional resources doesn’t necessarily speed up a
computation. There’s a limit defined by Amdahl’s Law.
o The basic idea of Amdahl’s law is that a parallel computation’s
maximum performance gain is limited by the portion of the
computation that has to happen serially which creates a
bottleneck.
o Serial portion include scheduling, resource allocation,
communication and synchronizing e.t.c.
o For example, if a computation that takes 20 hours on a single CPU has a serial
portion that takes 1 hour (5%), then Amdahl’s law shows that no matter how
many processors you put on the task, the maximum speed up is 20x.
Consequently, after a point, putting additional processors on the job is just
wasted resource.
July – Oct 2021
CSC 457 Lecture Notes 30
 Internetworking Performance Issues
and Scalability of Parallel Architectures
•
July – Oct 2021
CSC 457 Lecture Notes 31
 Internetworking Performance Issues
and Scalability of Parallel Architectures
Process Interaction
o Except for embarrassingly parallel algorithms, the threads in a
parallel computation need to communicate with each other.
There are two ways they can do this;
o Shared memory – the processes can share a storage location
that they use for communicating. Shared memory can also be
used to synchronize threads by using the shared memory as a
semaphore.
o Messaging – the processes communicate via messages. This
could be over a network or special-purposes bus. Networks for
this use are typically hyper-local and designed for this purpose.
July – Oct 2021
CSC 457 Lecture Notes 32
 Internetworking Performance Issues
and Scalability of Parallel Architectures
Consistent Execution
o The threads of execution for most parallel algorithms must be coupled
to achieve consistent execution.
o Parallel threads of execution communicate to transfer values between
processes. Parallel algorithms communicate not only to calculate the result,
but to achieve deterministic execution.
o For any given set of inputs, the parallel version of an algorithm should return
the same answer each time it is run as well as returning the same answer a
sequential version of the algorithm would return.
o Parallel algorithms achieve this by locking memory or otherwise sequencing
operations between threads. This communication overhead, as well as the
waiting required for sequencing impose a performance overhead.
o As we saw in our discussion of Amdahl’s Law, these sequential portions of a
parallel algorithm are the limiting factor in speeding up execution.
July – Oct 2021
CSC 457 Lecture Notes 33
 Internetworking Performance Issues
and Scalability of Parallel Architectures
o (Read tutorial on Performance and Scalability on Parallel Computing
attached)
July – Oct 2021
CSC 457 Lecture Notes 34
Next …....
Performance Evaluation
July – Oct 2021
CSC 457 Lecture Notes 35
 Performance Modelling
o The goal of performance modeling is to gain
understanding of a computer system’s performance on
various applications, by means of measurement and
analysis, and then to encapsulate these characteristics in a
compact formula.
o The resulting model can be used to gain greater
understanding of the performance phenomena involved
and to project performance to other system/application
combinations
July – Oct 2021
CSC 457 Lecture Notes 36
 Performance Modelling
o The performance profile of a given system/application
combination depends on numerous factors, including:
(1) System size;(2) System architecture;
(3) Processor speed; (4) Multi-level cache latency and
bandwidth;
(5) Interprocessor network latency and bandwidth;
(6) System software efficiency;
(7) Type of application; (8) Algorithms used;
(9) Programming language used; (10) Problem size;
(11) Amount of I/O;
July – Oct 2021
CSC 457 Lecture Notes 37
 Performance Modelling
o Performance models can be used to improve architecture
design, inform procurement, and guide application tuning
o Someone has observed that, due to the difficulty of
developing performance models for new applications, as
well as the increasing complexity of new systems, our
supercomputers have become better at predicting and
explaining natural phenomena (such as the weather) than
at predicting and explaining the performance of
themselves or other computers.
July – Oct 2021
CSC 457 Lecture Notes 38
 Performance Modelling
Applications of Performance Modelling
o Performance modeling can be used in numerous ways.
Here is a brief summary of these usages, both present-day
and future possibilities;
1. System design.
o Performance models are frequently employed by computer vendors in
their design of future systems. Typically engineers construct a
performance model for one or two key applications, and then compare
future technology options based on performance model projections.
Once performance modeling techniques are better developed, it may
be possible to target many more applications and technology options
July – Oct 2021
CSC 457 Lecture Notes 39
 Performance Modelling
Applications of Performance Modelling
2. Runtime estimation
o The most common application for a performance model is to enable a
scientist to estimate the runtime of a job when the input parameters
for the job are changed, or when a different number of processors is
used in a parallel computer system.
o One can also estimate the largest size of system that can be used to
run a given problem before the parallel efficiency drops to an
unacceptable area.
3. System tuning
o An example of using performance modeling for system tuning is where
performance model is used to diagnose and rectify a misconfigured
channel buffer, which yields a doubling of network performance for
programs sending short messages
July – Oct 2021
CSC 457 Lecture Notes 40
 Performance Modelling
Applications of Performance Modelling
4. Application Tuning
o If a memory performance model is combined with application
parameters, one can predict how cache hit-rates would change if a
different cache blocking factor were used in the application.
o Once the optimal cache blocking has been identified, then the code
can be permanently changed.
o Simple performance models can even be incorporated into an
application code, permitting on-the-fly selection of different program
options.
o Performance models, by providing performance expectations based on
the fundamental computational characteristics of algorithms, can also
enable algorithmic choice before going to the trouble to implement all
the possible choices
July – Oct 2021
CSC 457 Lecture Notes 41
 Pipeline Freeze Strategies
•
July – Oct 2021
CSC 457 Lecture Notes 42
 Pipeline Freeze Strategies
•
July – Oct 2021
CSC 457 Lecture Notes 43
 Branch prediction Strategies
o In an exceedingly parallel system, conditional instructions break the
continuous flow of programs or decrease the performance of the
pipelined processor, which causes delay.
o To decrease the delay prediction of branch direction is necessary. The
disparity in the branches needs accurately branch prediction
strategies. So, branch prediction is a vital part of the present pipelined
processor.
o Branch prediction is the process of making an educated guess as to
whether a branch will be taken or not taken based on a preset
algorithm .
o A branch is a category of instruction which causes the code to move to
another block to continue execution. Branch prediction has the ability
to be static or dynamic
July – Oct 2021
CSC 457 Lecture Notes 44
 Prediction Strategies
o Static branch prediction means that a given branch will always be
predicted as taken or not taken without possibility of change
throughout the duration of the program.
o Dynamic branch prediction means that the predicted outcome of a
branch is dependent on an algorithm, and the prediction may change
throughout the course of the program.
o Code is able to use a combination of both static and dynamic branch
predictors based on the type of branch.
o The improvement branch prediction is dependent on the number of
branches in the code, as well as the type of prediction being used as
different prediction methods have varied rates of success.
o Overall, branch prediction provides an increase in performance for
code containing branches. This improvement is based on the number
of computational cycles which are able to be used for computation
rather than wasted on a system which does not use branch prediction.
July – Oct 2021
CSC 457 Lecture Notes 45
 Prediction Strategies
o There are three different kinds of 6 branches: forward conditional,
backward conditional, and unconditional branches .
o Forward conditional branches are when a branch evaluates to a target
that is somewhere forward in the instruction stream.
o Backward conditional branches are when a branch evaluates to a
target that is somewhere backwards in the instruction stream.
Common instances of backward conditional branches are loops.
o Unconditional branches are branches which will always occur.
July – Oct 2021
CSC 457 Lecture Notes 46
 Prediction Strategies
o A static or dynamic prediction strategy will determine which different
algorithms or methods are available for use.
o For static branch prediction, the strategy may either be predict taken, predict not
taken, or some combination that specifies the branch type such as backward branch
predict taken, forward branch predict not taken . The third strategy is advantageous
for programs with loops because it will have a higher percentage of correctly
predicted branches for backward branches.
o Dynamic branch prediction is able to use one-level prediction, two-
level adaptive prediction, or a tournament predictor. One-level
prediction uses a counter based on a specific branch to use; said
branch’s history to predict its future outcomes .
o The address of the branch is used as an index into a table where these counters are
stored. When a branch is correctly predicted taken, a counter is incremented. When
a branch is correctly predicted not taken, the same counter is decremented. In the
case where the prediction was incorrect, the opposite occurs.
July – Oct 2021
CSC 457 Lecture Notes 47
 Prediction Strategies
o The two-level adaptive branch prediction is very similar to the one-level
branch prediction strategy. The two-level strategy uses the same counter
concept as the one-level, except the two-level implements this counter while
taking input from other branches. This strategy may also be used to predict
the direction of the branch based on the direction and outcomes of other
branches in the program. This strategy is also called a global history counter .
o Hybrid or tournament prediction strategies use a combination of two or more
other prediction strategies . For example, any static prediction used in
conjunction with a dynamic prediction strategy would be considered a hybrid
strategy.
o All of the strategies listed here are used in practice. The two-bit counter
presented in the one-level branch prediction strategy is used in a number of
other branch prediction strategies, including a predictor for choosing which
predictor to use.
o One disadvantage to each of these strategies is that their level of
improvement for a given code will vary depending on what is written into the
code
July – Oct 2021
CSC 457 Lecture Notes 48
 Composite Strategies
July – Oct 2021
(Blank!!)
CSC 457 Lecture Notes 49
 Benchmark Performance
(Blank!!)
July – Oct 2021
CSC 457 Lecture Notes 50
 Pipeline Processor Concepts
o High performance is an important issue in microprocessor
and its importance is exponentially increasing over the
years.
o To improve the performance, two alternative methods exist
(a) To improve the hardware by providing faster circuit
(b) To arrange the hardware, so that multi-operations can be
performed.
o On the basis of performance, pipelining is a process of
arrangement of hardware elements of the CPU such that its
overall performance is increased, simultaneous execution of
more than one instruction takes place in a pipelined
processor
July – Oct 2021
CSC 457 Lecture Notes 51
 Pipeline Processor Concepts
o A pipeline processor is comprised of a sequential, linear list
of segments, where each segment performs one computational task or
group of tasks.
o There are three things that one must observe about the pipeline.
1. First, the work (in a computer, the ISA) is divided up into pieces that
more or less fit into the segments alloted for them.
2. Second, this implies that in order for the pipeline to work efficiently
and smoothly, the work partitions must each take about the same time
to complete. Otherwise, the longest partition requiring time T would
hold up the pipeline, and every segment would have to take time T to
complete its work. For fast segments, this would mean much idle time.
3. Third, in order for the pipeline to work smoothly, there must be few (if
any) exceptions or hazards that cause errors or delays within the
pipeline. Otherwise, the instruction will have to be reloaded and the
pipeline restarted with the same instruction that causes the exception.
July – Oct 2021
CSC 457 Lecture Notes 52
 Pipeline Processor Concepts
o Work Partitioning: A multicycle datapath is based on the
assumption that computational work associated with the
execution of an instruction could be partitioned into a five-
step process, as follows:
July – Oct 2021
CSC 457 Lecture Notes 53
 Pipeline Processor Concepts
o Pipelining is one way of improving the overall processing performance of
a processor. This architectural approach allows the simultaneous
execution of several instructions.
o Pipelining is transparent to the programmer; it exploits parallelism at the
instruction level by overlapping the execution process of instructions.
o It is analogous to an assembly line where workers perform a specific
task and pass the partially completed product to the next worker
o The pipeline design technique decomposes a sequential process into
several subprocesses, called stages or segments. A stage performs a
particular function and produces an intermediate result.
o It consists of an input latch, also called a register or buffer, followed by
a processing circuit. (A processing circuit can be a combinational or
sequential circuit.)
July – Oct 2021
CSC 457 Lecture Notes 54
 Pipeline Processor Concepts
o At each clock pulse, every stage transfers its intermediate result to the
input latch of the next stage. In this way, the final result is produced
after the input data have passed through the entire pipeline, completing
one stage per clock pulse.
o The period of the clock pulse should be large enough to provide
sufficient time for a signal to traverse through the slowest stage, which
is called the bottleneck (i.e., the stage needing the longest amount of
time to complete).
o In addition, there should be enough time for a latch to store its input
signals.
o If the clock's period, P, is expressed as P = tb + tl, then tb should be
greater than the maximum delay of the bottleneck stage, and tl should
be sufficient for storing data into a latch
July – Oct 2021
CSC 457 Lecture Notes 55
 Pipeline Processor Concepts
Completion Time for pipelined processor
o The ability to overlap stages of a sequential process for different input
tasks (data or operations) results in an overall theoretical completion
time of Tpipe = m*P + (n-1)*P, where n is the number of input tasks, m is
the number of stages in the pipeline, and P is the clock period
o The term m*P is the time required for the first input task to get through
the pipeline, and the term (n-1)*P is the time required for the remaining
tasks.
o After the pipeline has been filled, it generates an output on each clock
cycle. In other words, after the pipeline is loaded, it will generate output
only as fast as its slowest stage.
o Even with this limitation, the pipeline will greatly outperform nonpipelined techniques,
which require each task to complete before another task’s execution sequence begins.
To be more specific, when n is large, a pipelined processor can produce output
approximately m times faster than a nonpipelined processor.
July – Oct 2021
CSC 457 Lecture Notes 56
 Pipeline Processor Concepts
•
July – Oct 2021
CSC 457 Lecture Notes 57
 Pipeline Processor Concepts
Pipeline Performance Measures
1. Speedup
o Now, speedup (S) may be represented as:
S = Tseq / Tpipe = n*m / (m+n -1)
The value S approaches m when n  . That is, the maximum
speedup, also called ideal speedup, of a pipeline processor
with m stages over an equivalent nonpipelined processor is m.
In other words, the ideal speedup is equal to the number of
pipeline stages. That is, when n is very large, a pipelined
processor can produce output approximately m times faster
than a nonpipelined processor. When n is small, the speedup
decreases; in fact, for n=1 the pipeline has the minimum
speedup of 1.
July – Oct 2021
CSC 457 Lecture Notes 58
 Pipeline Processor Concepts
Pipeline Performance Measures
2. Efficiency
o The efficiency E of a pipeline with m stages is defined as:
E = S/m = [n*m / (m+n -1)] / m = n / (m+n -1).
The efficiency E, which represents the speedup per stage,
approaches its maximum value of 1 when n  . When n=1, E will
have the value 1/m, which is the lowest obtainable value.
July – Oct 2021
CSC 457 Lecture Notes 59
 Pipeline Processor Concepts
•
July – Oct 2021
CSC 457 Lecture Notes 60
 Pipeline Processor Concepts
Pipeline Types
o Pipelines are usually divided into two classes: instruction pipelines and arithmetic
pipelines. A pipeline in each of these classes can be designed in two ways: static
or dynamic.
o A static pipeline can perform only one operation (such as addition or
multiplication) at a time. The operation of a static pipeline can only be changed
after the pipeline has been drained. (A pipeline is said to be drained when the
last input data leave the pipeline.) For example, consider a static pipeline that is
able to perform addition and multiplication. Each time that the pipeline switches
from a multiplication operation to an addition operation, it must be drained and
set for the new operation.
o The performance of static pipelines is severely degraded when the operations
change often, since this requires the pipeline to be drained and refilled each
time.
o A dynamic pipeline can perform more than one operation at a time. To perform
a particular operation on an input data, the data must go through a certain
sequence of stages. In dynamic pipelines the mechanism that controls when data should be fed to the pipeline is much
more complex than in static pipelines
July – Oct 2021
CSC 457 Lecture Notes 61
 Pipeline Processor Concepts
Instruction Pipeline
o An instruction pipeline increases the performance of a processor by
overlapping the processing of several different instructions. An
instruction pipeline often consists of five stages, as follows:
1. Instruction fetch (IF). Retrieval of instructions from cache (or main memory).
2. Instruction decoding (ID). Identification of the operation to be performed.
3. Operand fetch (OF). Decoding and retrieval of any required operands.
4. Execution (EX). Performing the operation on the operands.
5. Write-back (WB). Updating the destination operands.
An instruction pipeline overlaps the process of the preceding stages for
different instructions to achieve a much lower total completion time,
on average, for a series of instructions.
July – Oct 2021
CSC 457 Lecture Notes 62
 Pipeline Processor Concepts
Instruction Pipeline
o During the first cycle, or clock pulse, instruction i1 is fetched from
memory. Within the second cycle, instruction i1 is decoded while
instruction i2 is fetched. This process continues until all the
instructions are executed. The last instruction finishes the write-
back stage after the eighth clock cycle.
o Therefore, it takes 80 nanoseconds (ns) to complete execution of all
the four instructions when assuming the clock period to be 10 ns.
The total completion time is,
Tpipe = m*P+(n-1)*P
=5*10+(4-1)*10=80 ns.
Note that in a nonpipelined design the completion time will be much
higher.
July – Oct 2021
CSC 457 Lecture Notes 63
 Pipeline Processor Concepts
Instruction Pipeline
o Note that in a nonpipelined design the completion time will be much
higher.
Tseq = n*m*P = 4*5*10 = 200 ns
o It is worth noting that a pipeline simply takes advantage of these
naturally occurring stages to improve processing efficiency.
o Henry Ford made the same connection when he realized that all cars
were built in stages and invented the assembly line in the early
1900s.
o Even though pipelining speeds up the execution of instructions, it
does pose potential problems. Some of these problems and possible
solutions are discussed next
July – Oct 2021
CSC 457 Lecture Notes 64
 Pipeline Processor Concepts
Improving the Throughput of an Instruction Pipeline
o Three sources of architectural problems may affect the throughput
of an instruction pipeline. They are fetching, bottleneck, and issuing
problems. Some solutions are given for each.
1. The fetching problem
o In general, supplying instructions rapidly through a pipeline is costly
in terms of chip area. Buffering the data to be sent to the pipeline is
one simple way of improving the overall utilization of a pipeline. The
utilization of a pipeline is defined as the percentage of time that the
stages of the pipeline are used over a sufficiently long period of
time. A pipeline is utilized 100% of the time when every stage is
used (utilized) during each clock cycle.
July – Oct 2021
CSC 457 Lecture Notes 65
 Pipeline Processor Concepts
Improving the Throughput of an Instruction Pipeline
1. The fetching problem
o Occasionally, the pipeline has to be drained and refilled, for example, whenever
an interrupt or a branch occurs. The time spent refilling the pipeline can be
minimized by having instructions and data loaded ahead of time into various
geographically close buffers (like on-chip caches) for immediate transfer into the
pipeline. If instructions and data for normal execution can be fetched before they
are needed and stored in buffers, the pipeline will have a continuous source of
information with which to work. Prefetch algorithms are used to make sure
potentially needed instructions are available most of the time. Delays from
memory access conflicts can thereby be reduced if these algorithms are used,
since the time required to transfer data from main memory is far greater than the
time required to transfer data from a buffer.
July – Oct 2021
CSC 457 Lecture Notes 66
 Pipeline Processor Concepts
Improving the Throughput of an Instruction Pipeline
2. The bottleneck problem
o The bottleneck problem relates to the amount of load (work) assigned to a stage
in the pipeline.
o If too much work is applied to one stage, the time taken to complete an operation
at that stage can become unacceptably long.
o This relatively long time spent by the instruction at one stage will inevitably create
a bottleneck in the pipeline system.
o In such a system, it is better to remove the bottleneck that is the source of
congestion. One solution to this problem is to further subdivide the stage.
Another solution is to build multiple copies of this stage into the pipeline.
July – Oct 2021
CSC 457 Lecture Notes 67
 Pipeline Processor Concepts
Improving the Throughput of an Instruction Pipeline
3. The issuing problem
o If an instruction is available, but cannot be executed for some reason, a hazard
exists for that instruction. These hazards create issuing problems; they prevent
issuing an instruction for execution. Three types of hazard are discussed here.
They are called structural hazard, data hazard, and control hazard.
o A structural hazard refers to a situation in which a required resource is not
available (or is busy) for executing an instruction.
o A data hazard refers to a situation in which there exists a data dependency
(operand conflict) with a prior instruction.
o A control hazard refers to a situation in which an instruction, such as branch,
causes a change in the program flow. Each of these hazards is explained next.
July – Oct 2021
CSC 457 Lecture Notes 68
 Pipeline Processor Concepts
1. Structural Hazard
o structural hazard occurs as a result of resource conflicts between instructions.
One type of structural hazard that may occur is due to the design of execution
units. If an execution unit that requires more than one clock cycle (such as
multiply) is not fully pipelined or is not replicated, then a sequence of instructions
that uses the unit cannot be subsequently (one per clock cycle) issued for
execution. Replicating and/or pipelining execution units increases the number of
instructions that can be issued simultaneously.
o Another type of structural hazard that may occur is due to the design of register
files. If a register file does not have multiple write (read) ports, multiple writes
(reads) to (from) registers cannot be performed simultaneously. For example,
under certain situations the instruction pipeline might want to perform two
register writes in a clock cycle. This may not be possible when the register file has
only one write port. The effect of a structural hazard can be reduced fairly simply
by implementing multiple execution units and using register files with multiple
input/output ports
July – Oct 2021
CSC 457 Lecture Notes 69
 Pipeline Processor Concepts
2. Data Hazard
o In a nonpipelined processor, the instructions are executed one by one, and the execution
of an instruction is completed before the next instruction is started. In this way, the
instructions are executed in the same order as the program. However, this may not be true
in a pipelined processor, where instruction executions are overlapped. An instruction may
be started and completed before the previous instruction is completed. The data hazard,
which is also referred to as the data dependency problem, comes about as a result of
overlapping (or changing the order of) the execution of data-dependent instructions.
o The delaying of execution can be accomplished in two ways. One way is to delay the OF or
IF stages of i2 for two clock cycles. To insert a delay, an extra hardware component called a
pipeline interlock can be added to the pipeline. A pipeline interlock detects the
dependency and delays the dependent instructions until the conflict is resolved. Another
way is to let the compiler solve the dependency problem. During compilation, the compiler
detects the dependency between data and instructions. It then rearranges these
instructions so that the dependency is not hazardous to the system. If it is not possible to
rearrange the instructions, NOP (no operation) instructions are inserted to create delays.
July – Oct 2021
CSC 457 Lecture Notes 70
 Pipeline Processor Concepts
o There are three primary types of data hazards: RAW (read after write),
WAR (write after read), and WAW (write after write). The hazard names
denote the execution ordering of the instructions that must be maintained
to produce a valid result; otherwise, an invalid result might occur.
o RAW: it refers to the situation in which i2 reads a data source before i1 writes to it.
This may produce an invalid result since the read must be performed after the write
in order to obtain a valid result.
o WAR: This refers to the situation in which i2 writes to a location before i1 reads it.
o WAW: This refers to the situation in which i2 writes to a location before i1 writes to
it.
o Note that the WAR and WAW types of hazards cannot happen when the order of
completion of instructions execution in the program is preserved. However, one way
to enhance the architecture of an instruction pipeline is to increase concurrent
execution of the instructions by dispatching several independent instructions to
different functional units, such as adders/subtractors, multipliers, and dividers. That
is, the instructions can be executed out of order, and so their execution may be
completed out of order too.
July – Oct 2021
CSC 457 Lecture Notes 71
 Pipeline Processor Concepts
o The dependencies between instructions are checked statically by the
compiler and/or dynamically by the hardware at run time. This preserves
the execution order for dependent instructions, which ensures valid results.
o In general, dynamic dependency checking has the advantage of being able
to determine dependencies that are either impossible or hard to detect at
compile time. However, it may not be able to exploit all the parallelism
available in a loop because of the limited lookahead ability that can be
supported by the hardware.
o Two of the most commonly used techniques for dynamic dependency
checking are called Tomasulo's method and the scoreboard method
o Tomasulo's method increases concurrent execution of the instructions with
minimal (or no) effort by the compiler or the programmer.
o The scoreboard method: multiple functional units allow instructions to be
completed out of the original program order.
July – Oct 2021
CSC 457 Lecture Notes 72
 Pipeline Processor Concepts
3. Control Hazard
o In any set of instructions, there is normally a need for some kind of statement that allows
the flow of control to be something other than sequential. Instructions that do this are
included in every programming language and are called branches. In general, about 30% of
all instructions in a program are branches.
o This means that branch instructions in the pipeline can reduce the throughput
tremendously if not handled properly. Whenever a branch is taken, the performance of the
pipeline is seriously affected. Each such branch requires a new address to be loaded into
the program counter, which may invalidate all the instructions that are either already in
the pipeline or prefetched in the buffer. This draining and refilling of the pipeline for each
branch degrade the throughput of the pipeline to that of a sequential processor.
o Note that the presence of a branch statement does not automatically cause the pipeline to
drain and begin refilling. A branch not taken allows the continued sequential flow of
uninterrupted instructions to the pipeline. Only when a branch is taken does the problem
arise.
July – Oct 2021
CSC 457 Lecture Notes 73
 Pipeline Processor Concepts
3. Control Hazard
o Branch instructions can be classified into three groups: (1) unconditional branch,
(2) conditional branch, and (3) loop branch
o An unconditional branch always alters the sequential program flow. It sets a new
target address in the program counter, rather than incrementing it by 1 to point
to the next sequential instruction address, as is normally the case.
o A conditional branch sets a new target address in the program counter only when
a certain condition, usually based on a condition code, is satisfied. Otherwise, the
program counter is incremented by 1 as usual. A conditional branch selects a path
of instructions based on a certain condition. If the condition is satisfied, the path
starts from the target address and is called a target path. If it is not, the path
starts from the next sequential instruction and is called a sequential path.
o A loop branch in a loop statement usually jumps back to the beginning of the loop
and executes it either a fixed or a variable (data-dependent) number of times.
July – Oct 2021
CSC 457 Lecture Notes 74
 Pipeline Processor Concepts
Techniques for Reducing Effect of Branching on Processor Performance
o To reduce the effect of branching on processor performance, several techniques
have been proposed. Some of the better known techniques are branch prediction,
delayed branching, and multiple prefetching
1. Branch Prediction. In this type of design, the outcome of a branch decision is predicted
before the branch is actually executed. Therefore, based on a particular prediction, the
sequential path or the target path is chosen for execution. Although the chosen path often
reduces the branch penalty, it may increase the penalty in case of incorrect prediction.
2. Delayed Branching. The delayed branching scheme eliminates or significantly reduces
the effect of the branch penalty. In this type of design, a certain number of instructions
after the branch instruction is fetched and executed regardless of which path will be
chosen for the branch. For example, a processor with a branch delay of k executes a path
containing the next k sequential instructions and then either continues on the same path
or starts a new path from a new target address. As often as possible, the compiler tries to
fill the next k instruction slots after the branch with instructions that are independent from
the branch instruction. NOP (no operation) instructions are placed in any remaining empty
slots.
July – Oct 2021
CSC 457 Lecture Notes 75
 Pipeline Processor Concepts
Techniques for Reducing Effect of Branching on Processor Performance
3. Multiple Prefetching. In this type of design, the processor fetches both possible
paths. Once the branch decision is made, the unwanted path is thrown away. By
prefetching both possible paths, the fetch penalty is avoided in the case of an
incorrect prediction. To fetch both paths, two buffers are employed to service the
pipeline.
In normal execution, the first buffer is loaded with instructions from the next
sequential address of the branch instruction. If a branch occurs, the contents of the
first buffer are invalidated, and the secondary buffer, which has been loaded with
instructions from the target address of the branch instruction, is used as the primary
buffer.
This double buffering scheme ensures a constant flow of instructions and data to the
pipeline and reduces the time delays caused by the draining and refilling of the
pipeline. Some amount of performance degradation is unavoidable any time the
pipeline is drained, however
July – Oct 2021
CSC 457 Lecture Notes 76
 Pipeline Processor Concepts
Further Throughput Improvement of an Instruction Pipeline
o One way to increase the throughput of an instruction pipeline is
to exploit instruction-level parallelism. The common approaches
to accomplish such parallelism are called superscalar,
superpipeline, and very long instruction word (VLIW)
July – Oct 2021
CSC 457 Lecture Notes 77
 Pipeline Processor Concepts
Further Throughput Improvement of an Instruction Pipeline
1. Superscalar
o The superscalar approach relies on spatial parallelism, that is, multiple
operations running concurrently on separate hardware. This approach
achieves the execution of multiple instructions per clock cycle by issuing
several instructions to different functional units.
o A superscalar processor contains one or more instruction pipelines sharing a
set of functional units. It often contains functional units, such as an add
unit, multiply unit, divide unit, floating-point add unit, and graphic unit.
o A superscalar processor contains a control mechanism to preserve the
execution order of dependent instructions for ensuring a valid result. The
scoreboard method and Tomasulo's method can be used for implementing
such mechanisms.
o In practice, most of the processors are based on the superscalar approach
and employ a scoreboard method to ensure a valid result.
July – Oct 2021
CSC 457 Lecture Notes 78
 Pipeline Processor Concepts
Further Throughput Improvement of an Instruction Pipeline
2. Superpipeline
o The superpipeline approach achieves high performance by overlapping the
execution of multiple instructions on one instruction pipeline.
o A superpipeline processor often has an instruction pipeline with more stages
than a typical instruction pipeline design. In other words, the execution
process of an instruction is broken down into even finer steps. By increasing
the number of stages in the instruction pipeline, each stage has less work to
do. This allows the pipeline clock rate to increase (cycle time decreases),
since the clock rate depends on the delay found in the slowest stage of the
pipeline.
o An example of such an architecture is the MIPS R4000 processor. The R4000
subdivides instruction fetching and data cache access to create an eight-
stage pipeline.
July – Oct 2021
CSC 457 Lecture Notes 79
 Pipeline Processor Concepts
Further Throughput Improvement of an Instruction Pipeline
3. Very Long Instruction Word (VLIW).
o The very long instruction word (VLIW) approach makes extensive use of the
compiler by requiring it to incorporate several small independent operations
into a long instruction word.
o The instruction is large enough to provide, in parallel, enough control bits
over many functional units. In other words, a VLIW architecture provides
many more functional units than a typical processor design, together with a
compiler that finds parallelism across basic operations to keep the functional
units as busy as possible.
o The compiler compacts ordinary sequential codes into long instruction words
that make better use of resources. During execution, the control unit issues
one long instruction per cycle. The issued instruction initiates many
independent operations simultaneously
July – Oct 2021
CSC 457 Lecture Notes 80
 Pipeline Processor Concepts
Further Throughput Improvement of an Instruction Pipeline
o A comparison of the three approaches will show a few interesting
differences.
o For instance, the superscalar and VLIW approaches are more sensitive to
resource conflicts than the superpipelined approach.
o In a superscalar or VLIW processor, a resource must be duplicated to
reduce the chance of conflicts, while the superpipelined design avoids any
resource conflicts.
July – Oct 2021
CSC 457 Lecture Notes 81
Pipeline Datapath Design and Implementation
o The work involved in an instruction can be partitioned into
steps labelled IF (Instruction Fetch), ID (Instruction
Decode and data fetch), EX (ALU operations or R-format
execution), MEM (Memory operations), and WB (Write-
Back to register file)
July – Oct 2021
CSC 457 Lecture Notes 82
Pipeline Datapath Design and Implementation
MIPS Instructions and Pipelining
o Multi Instruction Processing System (MIPS) is a reduced
instruction set computer (RISC) instruction set architecture
(ISA)
o In order to implement MIPS instructions effectively on a
pipeline processor, we must ensure that the instructions
are the same length (simplicity favors regularity) for easy
IF and ID, similar to the multicycle datapath.
o We also need to have few but consistent instruction
formats, to avoid deciphering variable formats during IF
and ID, which would prohibitively increase pipeline
segment complexity for those tasks. Thus, the register
indices should be in the same place in each instruction.
July – Oct 2021
CSC 457 Lecture Notes 83
Next …....
Memory and I/O Systems
July – Oct 2021
CSC 457 Lecture Notes 84
Levels of Memory
o Level 1 or Register – It is a type of memory in which data is
stored and accepted that are immediately stored in CPU.
Most commonly used register is accumulator, Program
counter, address register etc.
o Level 2 or Cache memory – It is the fastest memory which
has faster access time where data is temporarily stored for
faster access.
o Level 3 or Main Memory – It is memory on which computer
works currently. It is small in size and once power is off data
no longer stays in this memory.
o Level 4 or Secondary Memory – It is external memory which
is not as fast as main memory but data stays permanently in
this memory.
July – Oct 2021
CSC 457 Lecture Notes 85
Cache Memory
o The cache is a smaller and faster memory which stores
copies of the data from frequently used main memory
locations.
o Cache Memory is a special very high-speed memory used
to speed up and synchronizing with high-speed CPU.
Cache memory is costly than main memory or disk
memory but economical than CPU registers.
o Cache memory is an extremely fast memory type that acts
as a buffer between RAM and the CPU. It holds frequently
requested data and instructions so that they are
immediately available to the CPU when needed.
July – Oct 2021
CSC 457 Lecture Notes 86
Cache Memory
o Cache memory is used to reduce the average time to access data
from the Main memory. The cache is a smaller and faster memory
which stores copies of the data from frequently used main memory
locations.
o There are various different independent caches in a CPU, which
store instructions and data.
July – Oct 2021
CSC 457 Lecture Notes 87
Basic Definitions in Cache Memory
o cache block - The basic unit for cache storage. May
contain multiple bytes/words of data.
o cache line - Same as cache block. Note that this is not the
same thing as a “row” of cache.
o cache set - A “row” in the cache. The number of blocks per
set is determined by the layout of the cache (e.g. direct
mapped, set-associative, or fully associative).
o tag - A unique identifier for a group of data. Because
different regions of memory may be mapped into a block,
the tag is used to differentiate between them.
o valid bit - A bit of information that indicates whether the
data in a block is valid (1) or not (0).
July – Oct 2021
CSC 457 Lecture Notes 88
Types of Cache Memory
o There are three general cache levels:
o L1 cache, or primary cache, is extremely fast but relatively
small, and is usually embedded in the processor chip as
CPU cache.
o L2 cache, or secondary cache, often has higher capacity
than L1. L2 cache may be embedded on the CPU, or it can
be on a separate chip or coprocessor and have a high-
speed alternative system bus connecting the cache and
CPU. That way it doesn't get slowed by traffic on the main
system bus.
July – Oct 2021
CSC 457 Lecture Notes 89
Types of Cache Memory
o Level 3 (L3) cache is specialized memory developed to improve the
performance of L1 and L2. L1 or L2 can be significantly faster than
L3, though L3 is usually double the speed of DRAM. With multicore
processors, each core can have dedicated L1 and L2 cache, but
they can share an L3 cache. If an L3 cache references an
instruction, it is usually elevated to a higher level of cache.
o Contrary to popular belief, implementing flash or more dynamic
RAM (DRAM) on a system won't increase cache memory. This can
be confusing since the terms memory caching (hard disk buffering)
and cache memory are often used interchangeably.
o Memory caching, using DRAM or flash to buffer disk reads, is
meant to improve storage I/O by caching data that is frequently
referenced in a buffer ahead of slower magnetic disk or tape.
Cache memory, on the other hand, provides read buffering for the
CPU
July – Oct 2021
CSC 457 Lecture Notes 90
Cache Memory Performance
o When the processor needs to read or write a location in main
memory, it first checks for a corresponding entry in the cache.
o If the processor finds that the memory location is in the cache, a
cache hit has occurred and data is read from cache
o If the processor does not find the memory location in the cache, a
cache miss has occurred. For a cache miss, the cache allocates a
new entry and copies in data from main memory, then the request
is fulfilled from the contents of the cache.
o The performance of cache memory is frequently measured in terms
of a quantity called Hit ratio.
Hit ratio = hit / (hit + miss) = no. of hits/total accesses
o We can improve Cache performance using higher cache block size,
higher associativity, reduce miss rate, reduce miss penalty, and
reduce the time to hit in the cache.
July – Oct 2021
CSC 457 Lecture Notes 91
Architecture and data flow of a typical
cache memory
July – Oct 2021
CSC 457 Lecture Notes 92
Cache Memory Mapping
o There are three different types of mapping used for the purpose of
cache memory which are as follows: Direct mapping, Associative
mapping, and Set-Associative mapping.
o Direct mapped cache has each block mapped to exactly one cache memory
location. Conceptually, a direct mapped cache is like rows in a table with three
columns: the cache block that contains the actual data fetched and stored, a
tag with all or part of the address of the data that was fetched, and a flag bit
that shows the presence in the row entry of a valid bit of data.
o Fully associative cache mapping is similar to direct mapping in structure but
allows a memory block to be mapped to any cache location rather than to a
prespecified cache memory location as is the case with direct mapping.
o Set associative cache mapping can be viewed as a compromise between direct
mapping and fully associative mapping in which each block is mapped to a
subset of cache locations. It is sometimes called N-way set associative
mapping, which provides for a location in main memory to be cached to any of
"N" locations in the L1 cache.
July – Oct 2021
CSC 457 Lecture Notes 93
Locality of Reference
o The ability of cache memory to improve a computer's
performance relies on the concept of locality of reference.
o Locality describes various situations that make a system
more predictable.
o Cache memory takes advantage of these situations to create
a pattern of memory access that it can rely upon.
o There are several types of locality. Two key ones for cache
are:
 Temporal locality. This is when the same resources are
accessed repeatedly in a short amount of time.
 Spatial locality. This refers to accessing various data or
resources that are near each other.
July – Oct 2021
CSC 457 Lecture Notes 94
Importance of Cache Memory
o Cache memory is important because it improves the efficiency of
data retrieval (improve performance). It stores program
instructions and data that are used repeatedly in the operation of
programs or information that the CPU is likely to need next. The
computer processor can access this information more quickly from
the cache than from the main memory. Fast access to these
instructions increases the overall speed of the program.
o Aside from its main function of improving performance, cache
memory is a valuable resource for evaluating a computer's overall
performance. Users can do this by looking at cache's hit-to-miss
ratio. Cache hits are instances in which the system successfully
retrieves data from the cache. A cache miss is when the system
looks for the data in the cache, can't find it, and looks somewhere
else instead. In some cases, users can improve the hit-miss ratio
by adjusting the cache memory block size i.e. the size of data units
stored.
July – Oct 2021
CSC 457 Lecture Notes 95
Practice Questions
Que-1: A computer has a 256 KByte, 4-way set associative, write back data
cache with the block size of 32 Bytes. The processor sends 32-bit addresses to the
cache controller. Each cache tag directory entry contains, in addition, to address
tag, 2 valid bits, 1 modified bit and 1 replacement bit. The number of bits in the
tag field of an address is
(A) 11
(B) 14
(C) 16
(D) 27
Answer: (C)
Explanation: https://www.geeksforgeeks.org/gate-gate-cs-2012-question-54/
July – Oct 2021
CSC 457 Lecture Notes 96
Practice Questions
o Explanation:
o A set-associative scheme is a hybrid between a fully associative cache, and
direct mapped cache. It’s considered a reasonable compromise between the
complex hardware needed for fully associative caches (which requires parallel
searches of all slots), and the simplistic direct-mapped scheme, which may
cause collisions of addresses to the same slot (similar to collisions in a hash
table).
• Number of blocks = Cache-Size/Block-Size = 256 KB / 32 Bytes = 213
• Number of Sets = 213 / 4 = 211
o Tag + Set offset + Byte offset = 32
o Tag + 11 + 5 = 32
o Tag = 16
July – Oct 2021
CSC 457 Lecture Notes 97
Practice Questions
Que-2: Consider the data given in previous question. The size of the cache tag
directory is
(A) 160 Kbits
(B) 136 bits
(C) 40 Kbits
(D) 32 bits
Answer: (A)
Explanation: https://www.geeksforgeeks.org/gate-gate-cs-2012-question-55/
July – Oct 2021
CSC 457 Lecture Notes 98
Practice Questions
Explanation: 16 bit address
2 bit valid
1 modified
1 replace
Total bits = 20
20 × no. of blocks
= 160 K bits.
July – Oct 2021
CSC 457 Lecture Notes 99
Practice Questions
Que-3: An 8KB direct-mapped write-back cache is organized as multiple blocks,
each of size 32-bytes. The processor generates 32-bit addresses. The cache
controller maintains the tag information for each cache block comprising of the
following; 1 Valid bit; 1 Modified bit
As many bits as the minimum needed to identify the memory block mapped in
the cache. What is the total size of memory needed at the cache controller to store
meta-data (tags) for the cache?
(A) 4864 bits
(B) 6144 bits
(C) 6656 bits
(D) 5376 bits
Answer: (D)
Explanation: https://www.geeksforgeeks.org/gate-gate-cs-2011-question-43/
July – Oct 2021
CSC 457 Lecture Notes 100
Practice Questions
Explanation
oCache size = 8 KB
oBlock size = 32 bytes
oNumber of cache lines = Cache size / Block size = (8 × 1024 bytes)/32 = 256
ototal bits required to store meta-data of 1 line = 1 + 1 + 19 = 21 bits
ototal memory required = 21 × 256 = 5376 bits
July – Oct 2021
CSC 457 Lecture Notes 101
Locating Data in the Cache
o Given an address, we can determine whether the data at
that memory location is in the cache. To do so, we use the
following procedure:
1. Use the set index to determine which cache set the
address should reside in.
2. For each block in the corresponding cache set, compare
the tag associated with that block to the tag from the
memory address. If there is a match, proceed to the next
step. Otherwise, the data is not in the cache.
3. For the block where the data was found, look at valid
bit. If it is 1, the data is in the cache, otherwise it is not.
July – Oct 2021
CSC 457 Lecture Notes 102
Locating Data in the Cache
o If the data at that address is in the cache, then we use the block offset from
that address to find the data within the cache block where the data was
found. Figure 1 below shows divisions of the address for cache use
o All of the information needed to locate the data in the cache is given in the
address. Fig. 1 shows which parts of the address are used for locating data in
the cache.
o The least significant bits are used to determine the block offset. If the block
size is B then b = log2 B bits will be needed in the address to specify the block
offset. The next highest group of bits is the set index and is used to determine
which cache set we will look at.
o If S is the number of sets in our cache, then the set index has s = log2 S bits.
Note that in a fully-associative cache, there is only 1 set so the set index will
not exist. The remaining bits are used for the tag. If ℓ is the length of the
address (in bits), then the number of tag bits is t = ℓ − b − s.
July – Oct 2021
tag (t bits) Set index
(s bits)
Block offset
(b bits)
CSC 457 Lecture Notes 103
Cache Addressing
o (Read Tutorial on “Hardware Organization and Design” – 15 pages)
July – Oct 2021
CSC 457 Lecture Notes 104
Multilevel Cache Organisation
o Multilevel Caches is one of the techniques to improve Cache Performance by
reducing the “MISS PENALTY”. Miss Penalty refers to the extra time required
to bring the data into cache from the Main memory whenever there is a “miss”
in the cache.
o For clear understanding let us consider an example where the CPU requires 10
Memory References for accessing the desired information and consider this
scenario in the following 3 cases of System design :
Case 1 : System Design without Cache Memory
o Here the CPU directly communicates with the main memory and no caches
are involved. In this case, the CPU needs to access the main memory 10 times
to access the desired information.
July – Oct 2021
CSC 457 Lecture Notes 105
Multilevel Cache Organisation
Case 2 : System Design with Cache Memory
o Here the CPU at first checks whether the desired data is present in the
Cache Memory or not i.e. whether there is a “hit” in cache or “miss” in
the cache.
o Suppose there is 3 miss in Cache Memory then the Main Memory will
be accessed only 3 times.
o We can see that here the miss penalty is reduced because the Main
Memory is accessed a lesser number of times than that in the previous
case.
July – Oct 2021
CSC 457 Lecture Notes 106
Multilevel Cache Organisation
Case 3 : System Design with Multilevel Cache Memory
o Here the Cache performance is optimized further by introducing
multilevel Caches. As shown in the above figure, we are considering 2
level Cache Design.
o Suppose there is 3 miss in the L1 Cache Memory and out of these 3
misses there is 2 miss in the L2 Cache Memory then the Main Memory
will be accessed only 2 times.
o It is clear that here the Miss Penalty is reduced considerably than that
in the previous case thereby improving the Performance of Cache
Memory.
July – Oct 2021
CSC 457 Lecture Notes 107
Multilevel Cache Organisation
o We can observe from the above 3 cases that we are trying to decrease the
number of Main Memory References and thus decreasing the Miss Penalty in
order to improve the overall System Performance. Also, it is important to note
that in the Multilevel Cache Design, L1 Cache is attached to the CPU and it is
small in size but fast. Although, L2 Cache is attached to the Primary Cache i.e.
L1 Cache and it is larger in size and slower but still faster than the Main
Memory.
o Effective Access Time = Hit rate * Cache access time + Miss rate * Lower
level access time
o Average access Time For Multilevel Cache:(Tavg)
Tavg = H1 * C1 + (1 – H1) * (H2 * C2 +(1 – H2) *M )
where H1 is the Hit rate in the L1 caches; H2 is the Hit rate in the L2
cache; C1 is the Time to access information in the L1 caches; C2 is the Miss
penalty to transfer information from the L2 cache to an L1 cache and M is the
Miss penalty to transfer information from the main memory to the L2 cache.
July – Oct 2021
CSC 457 Lecture Notes 108
Multilevel Cache Organisation
Exercise
o Que 1 - Find the Average memory access time for a processor with a 2 ns
clock cycle time, a miss rate of 0.04 misses per instruction, a missed penalty
of 25 clock cycles, and a cache access time (including hit detection) of 1
clock cycle. Also, assume that the read and write miss penalties are the same
and ignore other write stalls.
Solution
Average Memory access time(AMAT)= Hit Time + Miss Rate * Miss Penalty.
Hit Time = 1 clock cycle (Hit time = Hit rate * access time) but here Hit time
is directly given so,
Miss rate = 0.04 ;Miss Penalty= 25 clock cycle (this is the time taken
by the above level of memory after the hit)
so, AMAT= 1 + 0.04 * 25
AMAT= 2 clock cycle
according to question 1 clock cycle = 2 ns
AMAT = 4ns
July – Oct 2021
CSC 457 Lecture Notes 109
Virtual Memory Terminologies
o Virtual Memory - A storage allocation scheme in which secondary memory
can be addressed as though it were part of main memory. The addresses
a program may use to reference memory are distinguished from the
addresses the memory system uses to identify physical storage sites, and
program-generated addresses are translated automatically to the
corresponding machine addresses. The size of virtual storage is limited by
the addressing scheme of the computer system and by the amount of
secondary memory available and not by the actual number of main
storage locations.
o Virtual Address - The address assigned to a location in virtual memory to
allow that location to be accessed as though it were part of main
memory.
o Virtual address space - The virtual storage assigned to a process.
o Address space - The range of memory addresses available to a process.
o Real address - The address of a storage location in main memory.
July – Oct 2021
CSC 457 Lecture Notes 110
Virtual Memory
o Virtual Memory is a storage allocation scheme in which secondary
memory can be addressed as though it were part of main memory.
o The addresses a program may use to reference memory are
distinguished from the addresses the memory system uses to
identify physical storage sites, and program generated addresses
are translated automatically to the corresponding machine
addresses.
o The size of virtual storage is limited by the addressing scheme of
the computer system and amount of secondary memory is
available not by the actual number of the main storage locations.
o It is a technique that is implemented using both hardware and
software. It maps memory addresses used by a program, called
virtual addresses, into physical addresses in computer memory.
July – Oct 2021
CSC 457 Lecture Notes 111
Virtual Memory
o Two characteristics fundamental to memory management:
1) all memory references are logical addresses that are
dynamically translated into physical addresses at run time
2) a process may be broken up into a number of pieces
that don’t need to be contiguously located in main
memory during execution
o If these two characteristics are present, it is not necessary
that all of the pages or segments of a process be in main
memory during execution. This means that the required
pages need to be loaded into memory whenever required.
Virtual memory is implemented using Demand Paging or
Demand Segmentation.
July – Oct 2021
CSC 457 Lecture Notes 112
Virtual Memory
o Two characteristics fundamental to memory management:
1) all memory references are logical addresses that are
dynamically translated into physical addresses at run time
2) a process may be broken up into a number of pieces
that don’t need to be contiguously located in main
memory during execution
o If these two characteristics are present, it is not necessary
that all of the pages or segments of a process be in main
memory during execution. This means that the required
pages need to be loaded into memory whenever required.
Virtual memory is implemented using Demand Paging or
Demand Segmentation.
July – Oct 2021
CSC 457 Lecture Notes 113
Thrashing
o A state in which the system spends most of its time
swapping process pieces rather than executing instructions
o To avoid this, the operating system tries to guess, based
on recent history, which pieces are least likely to be used
in the near future
July – Oct 2021
CSC 457 Lecture Notes 114
Principle of Locality
o Program and data references within a process
tend to cluster
o Only a few pieces of a process will be needed over
a short period of time
o Therefore it is possible to make intelligent guesses
about which pieces will be needed in the future
o Avoids thrashing
July – Oct 2021
CSC 457 Lecture Notes 115
 Support Needed for Virtual Memory
o For virtual memory to be practical and effective:
1. Hardware must support paging and segmentation
2. Operating system must include software for managing the
movement of pages and/or segments between secondary
memory and main memory
July – Oct 2021
CSC 457 Lecture Notes 116
 Paging
o The term virtual memory is usually associated
with systems that employ paging
o Use of paging to achieve virtual memory was first
reported for the Atlas computer
o Each process has its own page table and each
page table entry contains the frame number of
the corresponding page in main memory
July – Oct 2021
CSC 457 Lecture Notes 117
 Paging
o Paging is a memory management scheme that eliminates the
need for contiguous allocation of physical memory. This scheme
permits the physical address space of a process to be non –
contiguous
• Logical Address or Virtual Address (represented in bits): An
address generated by the CPU
• Logical Address Space or Virtual Address Space(
represented in words or bytes): The set of all logical
addresses generated by a program
• Physical Address (represented in bits): An address actually
available on memory unit
• Physical Address Space (represented in words or bytes):
The set of all physical addresses corresponding to the logical
addresses
July – Oct 2021
CSC 457 Lecture Notes 118
 Paging
o Example:
• If Logical Address = 31 bit, then Logical Address
Space = 231 words = 2 G words (1 G = 230)
• If Logical Address Space = 128 M words = 27 *
220 words, then Logical Address = log2 227 = 27 bits
• If Physical Address = 22 bit, then Physical Address
Space = 222 words = 4 M words (1 M = 220)
• If Physical Address Space = 16 M words = 24 *
220 words, then Physical Address = log2 224 = 24 bits
July – Oct 2021
CSC 457 Lecture Notes 119
 Paging
o The mapping from virtual to physical address
is done by the memory management unit
(MMU) which is a hardware device and this
mapping is known as paging technique.
o The Physical Address Space is conceptually
divided into a number of fixed-size blocks,
called frames.
o The Logical address Space is also split into
fixed-size blocks, called pages.
o Page Size = Frame Size
July – Oct 2021
CSC 457 Lecture Notes 120
 Paging
• Let us consider an example:
• Physical Address = 12 bits, then Physical Address Space = 4 K words
• Logical Address = 13 bits, then Logical Address Space = 8 K words
• Page size = frame size = 1 K words (assumption)
July – Oct 2021
CSC 457 Lecture Notes 121
 Paging
o Address generated by CPU is divided into
• Page number(p): Number of bits required to represent the
pages in Logical Address Space or Page number
• Page offset(d): Number of bits required to represent
particular word in a page or page size of Logical Address
Space or word number of a page or page offset.
o Physical Address is divided into
• Frame number(f): Number of bits required to represent the
frame of Physical Address Space or Frame number.
• Frame offset(d): Number of bits required to represent
particular word in a frame or frame size of Physical Address
Space or word number of a frame or frame offset
July – Oct 2021
CSC 457 Lecture Notes 122
 Paging
o The hardware implementation of page table can be done by
using dedicated registers. But the usage of register for the
page table is satisfactory only if page table is small.
o If page table contain large number of entries then we can use
TLB (translation Look-aside buffer), a special, small, fast look
up hardware cache.
o The TLB is associative, high speed memory.
o Each entry in TLB consists of two parts: a tag and a value.
o When this memory is used, then an item is compared with all
tags simultaneously. If the item is found, then corresponding
value is returned.
July – Oct 2021
CSC 457 Lecture Notes 123
 Paging
July – Oct 2021
CSC 457 Lecture Notes 124
 Paging
o Main memory access time = m
o If page table are kept in main memory,
o Effective access time = m(for page table) + m(for particular
page in page table)
July – Oct 2021
CSC 457 Lecture Notes 125
 Paging
Sample Question
o Consider a machine with 64 MB physical memory and a 32-bit
virtual address space. If the page size is 4KB, what is the
approximate size of the page table?
(A) 16 MB
(B) 8 MB
(C) 2 MB
(D) 24 MB
Answer: (C)
o Explanation: See question 1 of
https://www.geeksforgeeks.org/operating-systems-set-2/
July – Oct 2021
CSC 457 Lecture Notes 126
 Paging
Explanation:
A page entry is used to get address of physical memory. Here we assume that single
level of Paging is happening. So the resulting page table will contain entries for all the
pages of the Virtual address space.
Number of entries in page table = (virtual address space size)/(page size)
Using above formula we can say that there will be 2^(32-12) = 2^20 entries in page
table.
No. of bits required to address the 64MB Physical memory = 26.
So there will be 2^(26-12) = 2^14 page frames in the physical memory. And page
table needs to store the address of all these 2^14 page frames. Therefore, each page
table entry will contain 14 bits address of the page frame and 1 bit for valid-invalid
bit.
Since memory is byte addressable. So we take that each page table entry is 16 bits
i.e. 2 bytes long.
Size of page table = (total number of page table entries) *(size of a page table entry)
= (2^20 *2) = 2MB
July – Oct 2021
CSC 457 Lecture Notes 127
 Demand Paging
o The process of loading the page into memory on demand
(whenever page fault occurs) is known as demand paging.
The process includes the following steps :
July – Oct 2021
CSC 457 Lecture Notes 128
 Demand Paging
1. If CPU try to refer a page that is currently not available in the main
memory, it generates an interrupt indicating memory access fault.
2. The OS puts the interrupted process in a blocking state. For the
execution to proceed the OS must bring the required page into the
memory.
3. The OS will search for the required page in the logical address space.
4. The required page will be brought from logical address space to
physical address space. The page replacement algorithms are used for
the decision making of replacing the page in physical address space.
5. The page table will updated accordingly.
6. The signal will be sent to the CPU to continue the program execution
and it will place the process back into ready state.
o Hence whenever a page fault occurs these steps are followed by the
operating system and the required page is brought into memory.
July – Oct 2021
CSC 457 Lecture Notes 129
 Advantages of Demand Paging
1. More processes may be maintained in the main memory: Because we
are going to load only some of the pages of any particular process,
there is room for more processes. This leads to more efficient
utilization of the processor because it is more likely that at least one of
the more numerous processes will be in the ready state at any
particular time.
2. A process may be larger than all of main memory: One of the most
fundamental restrictions in programming is lifted. A process larger
than the main memory can be executed because of demand paging.
The OS itself loads pages of a process in main memory as required.
3. It allows greater multiprogramming levels by using less of the
available (primary) memory for each process
July – Oct 2021
CSC 457 Lecture Notes 130
 Page Fault Service Time
o The time taken to service the page fault is called as page
fault service time. The page fault service time includes the
time taken to perform all the above six steps.
Let Main memory access time is: m
Page fault service time is: s
Page fault rate is : p
Then, Effective memory access time = (p*s) + (1-p)*m
July – Oct 2021
CSC 457 Lecture Notes 131
 Swapping
o Swapping a process out means removing all of its pages
from memory, or marking them so that they will be
removed by the normal page replacement process.
o Suspending a process ensures that it is not runnable while
it is swapped out. At some later time, the system swaps
back the process from the secondary storage to main
memory.
o When a process is busy swapping pages in and out then
this situation is called thrashing
July – Oct 2021
CSC 457 Lecture Notes 132
 Inverted Page Table
o Page number portion of a virtual address is
mapped into a hash value
 hash value points to inverted page table
o Fixed proportion of real memory is required for
the tables regardless of the number of processes
or virtual pages supported
o Structure is called inverted because it indexes
page table entries by frame number rather than
by virtual page number
July – Oct 2021
CSC 457 Lecture Notes 133
 Inverted Page Table
o Each entry in the page table includes:
July – Oct 2021
The process
that owns
this page
Includes flags
and protection
and locking info
the index value of
the next entry in
the chain
Page
Number
Process
Identifier
Control
Bits
Chain
Pointer
CSC 457 Lecture Notes 134
 Translation Lookaside Buffer (TLB)
o Each virtual memory reference can cause two
physical memory accesses:
 one to fetch the page table entry
 one to fetch the data
o To overcome the effect of doubling the memory
access time, most virtual memory schemes make
use of a special high-speed cache called a
translation lookaside buffer (TLB)
July – Oct 2021
CSC 457 Lecture Notes 135
 Translation Lookaside Buffer (TLB)
o The TLB only contains some of the page table
entries so we cannot simply index into the TLB
based on page number
 each TLB entry must include the page number as well
as the complete page table entry (associative mapping)
o The processor is equipped with hardware that
allows it to interrogate simultaneously a number
of TLB entries to determine if there is a match on
page number
July – Oct 2021
CSC 457 Lecture Notes 136
 Translation Lookaside Buffer (TLB)
o Some TLBs store address-space identifiers (ASIDs)
in each TLB entry –
– uniquely identifies each process
– provide address-space protection for that process
– Otherwise need to flush at every context switch
o TLBs typically small (64 to 1,024 entries)
o On a TLB miss, value is loaded into the TLB for
faster access next time
– Replacement policies must be considered
– Some entries can be wired down for permanent fast
access
July – Oct 2021
CSC 457 Lecture Notes 137
 Improving Efficiency of Virtual Address Translation)
o Next step towards improvement of the efficiency of virtual address
translation is the memory management unit - MMU, introduced into
modern microprocessors.
o The functioning of the memory management unit is based on the use
of address translation buffers and other registers, in which current
pointers to all tables used in virtual to physical address translation are
stored
July – Oct 2021
CSC 457 Lecture Notes 138
 Improving Efficiency of Virtual Address Translation)
o The MMU unit checks if the requested page descriptor is in
the TLB. If so, the MMU generates the physical address for
the main memory.
o If the descriptor is missing in TLB, then MMU brings the
descriptor from the main memory and updates the TLB.
o Next, depending on the presence of the page in the main
memory, the MMU performs address translation or
launches the transmission of the page to the main memory
from the auxiliary store.
July – Oct 2021
CSC 457 Lecture Notes 139
 Segmentation
o A process is divided into Segments. The chunks that a
program is divided into which are not necessarily all of the
same sizes are called segments. Segmentation gives user’s
view of the process which paging does not give. Here the
user’s view is mapped to physical memory.
o There are types of segmentation:
1. Virtual memory segmentation –
Each process is divided into a number of segments, not all
of which are resident at any one point in time.
2. Simple segmentation –
Each process is divided into a number of segments, all of
which are loaded into memory at run time, though not
necessarily contiguously.
July – Oct 2021
CSC 457 Lecture Notes 140
 Segmentation
o There is no simple relationship between logical addresses
and physical addresses in segmentation. A table stores the
information about all such segments and is called Segment
Table.
o Segment Table – It maps two-dimensional Logical address
into one-dimensional Physical address. It’s each table
entry has:
o Base Address: It contains the starting physical address
where the segments reside in memory.
o Limit: It specifies the length of the segment.
July – Oct 2021
CSC 457 Lecture Notes 141
 Segmentation
July – Oct 2021
CSC 457 Lecture Notes 142
 Segmentation
o Translation of Two dimensional Logical Address to one
dimensional Physical Address
o Address generated by the CPU is divided into:
 Segment number (s): Number of bits required to represent the segment.
 Segment offset (d): Number of bits required to represent the size of the
segment.
July – Oct 2021
CSC 457 Lecture Notes 143
 Segmentation
Advantages of Segmentation
1. No Internal fragmentation
2. Segment table consumes less space in comparison to page
table in paging
Disadvantage of Segmentation
1. As processes are loaded and removed from the memory,
the free memory space is broken into little pieces, causing
external fragmentation
July – Oct 2021
CSC 457 Lecture Notes 144
Next …....
Shared Memory Multiprocessors
July – Oct 2021
CSC 457 Lecture Notes 145
 Shared Memory Multiprocessors
o A system with multiple CPUs “sharing” the same main memory is
called multiprocessor.
o In a multiprocessor system all processes on the various CPUs share
a unique logical address space, which is mapped on a physical
memory that can be distributed among the processors.
o Each process can read and write a data item simply using load and
store operations, and process communication is through shared
memory.
o It is the hardware that makes all CPUs access and use the same
main memory.
o This is an architectural model simple and easy to use for
programming; it can be applied to a wide variety of problems that
can be modeled as a set of tasks, to be executed in parallel (at least
partially)
July – Oct 2021
CSC 457 Lecture Notes 146
 Shared Memory Multiprocessors
o Since all CPUs share the address space, only a single
instance of the operating system is required.
o When a process terminates or goes into a wait state for
whichever reason, the OS can look in the process table
(more precisely, in the ready processes queue) for another
process to be dispatched to the idle CPU.
o On the contrary, in systems with no shared memory, each
CPU must have its own copy of the operating system, and
processes can only communicate through message passing.
o The basic issue in shared memory multiprocessor systems is
memory itself, since the larger the number of processors
involved, the more difficult to work on memory efficiently.
July – Oct 2021
CSC 457 Lecture Notes 147
 Shared Memory Multiprocessors
o All modern OS (Windows, Solaris, Linux, MacOS) support
symmetric multiprocessing, (SMP), with a scheduler running
on every processor (a simplified description, of course).
o “ready to run” processes can be inserted into a single
queue, that can be accessed by every scheduler,
alternatively there can be a “ready to run” queue for each
processor.
o When a scheduler is activated in a processor, it chooses one
of the “ready to run” processes and dispatches it on its
processor (with a single queue, things are somewhat more
difficult, can you guess why?)
July – Oct 2021
CSC 457 Lecture Notes 148
 Shared Memory Multiprocessors
o A distinct feature in multiprocessor systems is load balancing.
o It is useless having many CPUs in a system, if processes are not
distributed evenly among the cores.
o With a single “ready-to-run” queue, load balancing is usually
automatic: if a processor is idle, its scheduler will pick a process
from the shared queue and will start it on that processor.
o Modern OSs designed for SMP often have a separate queue for
each processor (to avoid the problems associated with a single
queue).
o There is an explicit mechanism for load balancing, by which a
process on the wait list of an overloaded processor is moved to
the queue of another, less loaded processor.
 As an example, SMP Linux activates its load balancing scheme every 200
ms, and whenever a processor queue empties.
July – Oct 2021
CSC 457 Lecture Notes 149
 Shared Memory Multiprocessors
o Migrating a process to a different processor can be costly when
each core has a private cache (can you guess why?).
o This is why some OSs, such as Linux, offer a system call to
specify that a process is tied to the processor, independently of
the processors load.
o There are three classes of multiprocessors, according to the way
each CPU sees main memory:
- Uniform Memory Access (UMA),
- Non Uniform Memory Access (NUMA)
- Cache Only Memory Access (COMA)
July – Oct 2021
CSC 457 Lecture Notes 150
 Shared Memory Multiprocessors
1. Uniform Memory Access (UMA):
o The name of this type of architecture hints to the fact that
all processors share a unique centralized primary memory,
so each CPU has the same memory access time.
o Owing to this architecture, these systems are also called
Symmetric Shared-memory Multiprocessors (SMP)
o The simplest multiprocessor system has a single bus to
which connect at least two CPUs and a memory (shared
among all processors).
o When a CPU wants to access a memory location, it checks if
the bus is free, then it sends the request to the memory
interface module and waits for the requested data to be
available on the bus.
July – Oct 2021
CSC 457 Lecture Notes 151
 Shared Memory Multiprocessors
1. Uniform Memory Access (UMA):
o Multicore processors are small UMA multiprocessor systems, where the first
shared cache (L2 or L3) is actually the communication channel.
o Shared memory can quickly become a bottleneck for system performances,
since all processors must synchronize on the single bus and memory
access.
o Larger multiprocessor systems (>32 CPUs) cannot use a single bus to
interconnet CPUs to memory modules, because bus contention becomes
un-manegeable.
o CPU – memory is realized through an interconnection network (in jargon
“fabric”).
o Caches local to each CPU alleviate the problem, furthermore each processor
can be equipped with a private memory to store data of computations that
need not be shared by other processors. Traffic to/from shared memory
can reduce considerably
July – Oct 2021
CSC 457 Lecture Notes 152
 Shared Memory Multiprocessors
2. Non Uniform Memory Access (NUMA):
o Single bus UMA systems are limited in the number of
processors, and costly hardware is necessary to connect
more processors. Current technology prevents building UMA
systems with more than 256 processors.
o To build larger processors, a compromise is mandatory: not
all memory blocks can have the same access time with
respect to each CPU.
o This is the origin of the name NUMA systems: Non Uniform
Memory Access.
July – Oct 2021
CSC 457 Lecture Notes 153
 Shared Memory Multiprocessors
2. Non Uniform Memory Access (NUMA):
o These systems have a shared logical address space, but
physical memory is distributed among CPUs, so that access
time to data depends on data position, in local or in a
remote memory (thus the NUMA denomination). These
systems are also called Distributed Shared Memory (DSM)
architectures
July – Oct 2021
CSC 457 Lecture Notes 154
 Shared Memory Multiprocessors
2. Non Uniform Memory Access (NUMA):
o Since all NUMA systems have a single logical address space
shared by all CPUs, while physical memory is distributed
among processors, there are two types of memories: local
and remote memory.
o Yet, even remote memory is accessed by each CPU with
LOAD and STORE instructions.
o There are two types of NUMA systems:
• Non-Caching NUMA (NC-NUMA)
• Cache-Coherent NUMA (CC-NUMA)
July – Oct 2021
CSC 457 Lecture Notes 155
 Shared Memory Multiprocessors
Non Caching -NUMA
o In a NC-NUMA system, processors have no local cache.
o Each memory access is managed with a modified MMU, which controls if
the request is for a local or for a remote block; in the latter case, the
request is forwarded to the node containing the requested data.
o Obviously, programs using remote data (with respect to the CPU
requesting them) will run much slower than what they would, if the
data were stored in the local memory
o In NC-NUMA systems there is no cache coherency problem, because
there is no caching at all: each memory item is in a single location.
o Remote memory access is however very inefficient. For this reason, NC-
NUMA systems can resort to special software that relocates memory
pages from one block to another, just to maximize performances.
o A page scanner demon activates every few seconds, examines statistics
on memory usage, and moves pages from one block to another, to
increase performances.
July – Oct 2021
CSC 457 Lecture Notes 156
 Shared Memory Multiprocessors
Non Caching -NUMA
o Actually, in NC-NUMA systems, each processor can also have a private memory
and a cache, and only private date (those allocated in the private local memory)
can be in the cache.
o This solution increases the performances of each processor, and is adopted in
Cray T3D/E.
o Yet, remote data access time remains very high, 400 processor clock cycles in
Cray T3D/E, against 2 for retrieving data from local cache.
July – Oct 2021
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt
CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt

More Related Content

Similar to CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt

Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageMayaData Inc
 
Exploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spaceExploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spacejsvetter
 
Risc and cisc eugene clewlow
Risc and cisc   eugene clewlowRisc and cisc   eugene clewlow
Risc and cisc eugene clewlowChaudhary Manzoor
 
Risc and cisc eugene clewlow
Risc and cisc   eugene clewlowRisc and cisc   eugene clewlow
Risc and cisc eugene clewlowkaran saini
 
CC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfCC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfHasanAfwaaz1
 
Risc cisc Difference
Risc cisc DifferenceRisc cisc Difference
Risc cisc DifferenceSehrish Asif
 
Microcontroller architecture
Microcontroller architectureMicrocontroller architecture
Microcontroller architectureVikas Dongre
 
Risc and cisc eugene clewlow
Risc and cisc   eugene clewlowRisc and cisc   eugene clewlow
Risc and cisc eugene clewlowManish Prajapati
 
Co question bank LAKSHMAIAH
Co question bank LAKSHMAIAH Co question bank LAKSHMAIAH
Co question bank LAKSHMAIAH veena babu
 
CS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdf
CS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdfCS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdf
CS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdfAsst.prof M.Gokilavani
 
A New Direction for Computer Architecture Research
A New Direction for Computer Architecture ResearchA New Direction for Computer Architecture Research
A New Direction for Computer Architecture Researchdbpublications
 
Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureSynergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureMichael Gschwind
 
Risc processors all syllabus5
Risc processors all syllabus5Risc processors all syllabus5
Risc processors all syllabus5faiyaz_vt
 
Implementation of RISC-Based Architecture for Low power applications
Implementation of RISC-Based Architecture for Low power applicationsImplementation of RISC-Based Architecture for Low power applications
Implementation of RISC-Based Architecture for Low power applicationsIOSR Journals
 

Similar to CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt (20)

Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 
Exploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spaceExploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design space
 
Risc and cisc eugene clewlow
Risc and cisc   eugene clewlowRisc and cisc   eugene clewlow
Risc and cisc eugene clewlow
 
Risc and cisc eugene clewlow
Risc and cisc   eugene clewlowRisc and cisc   eugene clewlow
Risc and cisc eugene clewlow
 
CC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdfCC LECTURE NOTES (1).pdf
CC LECTURE NOTES (1).pdf
 
CISC.pptx
CISC.pptxCISC.pptx
CISC.pptx
 
Parallel computing persentation
Parallel computing persentationParallel computing persentation
Parallel computing persentation
 
Risc cisc Difference
Risc cisc DifferenceRisc cisc Difference
Risc cisc Difference
 
Ca alternative architecture
Ca alternative architectureCa alternative architecture
Ca alternative architecture
 
Japan's post K Computer
Japan's post K ComputerJapan's post K Computer
Japan's post K Computer
 
Microcontroller architecture
Microcontroller architectureMicrocontroller architecture
Microcontroller architecture
 
Ef35745749
Ef35745749Ef35745749
Ef35745749
 
Risc and cisc eugene clewlow
Risc and cisc   eugene clewlowRisc and cisc   eugene clewlow
Risc and cisc eugene clewlow
 
Co question bank LAKSHMAIAH
Co question bank LAKSHMAIAH Co question bank LAKSHMAIAH
Co question bank LAKSHMAIAH
 
CS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdf
CS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdfCS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdf
CS304PC:Computer Organization and Architecture UNIT V_merged_merged.pdf
 
A New Direction for Computer Architecture Research
A New Direction for Computer Architecture ResearchA New Direction for Computer Architecture Research
A New Direction for Computer Architecture Research
 
Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureSynergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architecture
 
Risc processors all syllabus5
Risc processors all syllabus5Risc processors all syllabus5
Risc processors all syllabus5
 
Implementation of RISC-Based Architecture for Low power applications
Implementation of RISC-Based Architecture for Low power applicationsImplementation of RISC-Based Architecture for Low power applications
Implementation of RISC-Based Architecture for Low power applications
 
R&c
R&cR&c
R&c
 

Recently uploaded

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 

Recently uploaded (20)

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 

CSC 457 - Advanced Microprocessor Architecture Lecture Notes - 31.08.2021.ppt

  • 1. July – Oct 2021 CSC 457 Lecture Notes 1 (Knowledge for development) KIBABII UNIVERSITY(KIBU) SCHOOL OF COMPUTING AND INFORMATICS CSC 457E: ADVANCED MICROPROCESSOR ARCHITECTURE COURSE OUTLINE TIME: Tuesdays 11am – 1PM Room: ABB302 Lecturer: Eric Sifuna, BSc EEE ; MSc IS; R.Eng; MIEEE Cellphone: 0707327418 Email: sifunaes@gmail.com
  • 2. July – Oct 2021 CSC 457 Lecture Notes 2 Aim/Purpose:  The purpose of this course is to teach students the fundamentals of microprocessor and microcontroller systems.
  • 3. July – Oct 2021 CSC 457 Lecture Notes 3  Learning outcomes: At the end of this course, successful students should be able to:  Describe how the hardware and software components of a microprocessor-based system work together to implement system-level features  Integrate both hardware and software aspects of digital devices (such as memory and I/O interfaces) into microprocessor-based systems  Gain hands-on experience with common microprocessor peripherals such as UARTs, timers, and analog-to-digital and digital-to-analog converters  Get practical experience in applied digital logic design and assembly-language programming  Use the tools and techniques used by practicing engineers to design, implement, and debug microprocessor-based systems
  • 4. July – Oct 2021 CSC 457 Lecture Notes 4  High Performance microprocessor design:  Computational Models  An Argument for Parallel Architectures  Internetworking Performance Issues and Scalability of Parallel Architectures  Performance Evaluation:  Performance of Modeling Method  Pipeline Freeze Strategies, Prediction Strategies, Composite Strategies, Benchmark Performance  Pipelined processors and super pipeline concepts, Solutions to pipeline hazards (e.g. prediction and delay branch etc.). Course Topics
  • 5. July – Oct 2021 CSC 457 Lecture Notes 5  Memory and I/O systems:  Cache Memory, Cache addressing, Multilevel caches, Virtual Memory, Paged, Segmented, and Paged Organizations;  Address Translation:  Direct Page Table Translation, Inverted Page Table, Table Lookaside Buffer, Virtual Memory Accessing rules, Shared Memory Multiprocessors, Partitioning, Scheduling, Communication and Synchronization, Memory Coherency.  Superscalar Processor Design:  Superscalar Concepts, Execution Model, Exception Recovery  Register DataFlow, Out-of-Order Issue and Basic Software Scheduling. Course Topics (cont..)
  • 6. July – Oct 2021 CSC 457 Lecture Notes 6  Instruction Level Parallelism Exploration:  VLIW, simultaneous multithreading, processor coupling.  Advanced Speculation Techniques:  Speculation Techniques for Improving Load Related Instruction Scheduling  Performance Analysis for OpenMP Applications  Fine-Grain Distributed Shared Memory on Clusters  Future Processor Architectures:  MAJC, Raw Network Computing, Quantum Computing Course Topics (cont..)
  • 7. July – Oct 2021 CSC 457 Lecture Notes 7 Delivery: Blended Learning, small group discussion, case studies, individual projects and tutorials Instructional Material and/or Equipment: Computers, Learning Management System, writing boards, writing materials, projectors etc. Recommended Core Reading: 1. Microprocessors and Programmed Logic, Short, K., Prentice Hall. Other references 1. Texts, Audio and video cassettes, computer software, other resources
  • 8. CSC 457 Lecture Notes 8  High Performance microprocessor design:  Computational Models  An Argument for Parallel Architectures  Internetworking Performance Issues and Scalability of Parallel Architectures In this introduction … July – Oct 2021
  • 9. CSC 457 Lecture Notes 9 1. Computational Models for High Performance Microprocessors July – Oct 2021
  • 10. CSC 457 Lecture Notes 10  Computational Models o High performance RISC-based microprocessors are defining the current history of high performance computing o A Complex Instruction Set Computer (CISC) instruction set is made up of powerful primitives, close in functionality to the primitives of high-level languages o “If RISC is faster, why did people bother with CISC designs in the first place?”  RISC wasn’t always both feasible and affordable July – Oct 2021
  • 11. CSC 457 Lecture Notes 11 Computational Models o High-level language compilers were commonly available, but they didn’t generate the fastest code, and they weren’t terribly thrifty with memory. o When programming, you needed to save both space and time. A good instruction set was both easy to use and powerful o computers had very little storage by today’s standards. An instruction that could roll all the steps of a complex operation, such as a do-loop, into single opcode was a plus, because memory was precious. o Complex instructions saved time, too. Almost every large computer following the IBM 704 had a memory system that was slower than its central processing unit (CPU). When a single instruction can perform several operations, the overall number of instructions retrieved from memory can be reduced. July – Oct 2021
  • 12. CSC 457 Lecture Notes 12 Computational Models o There were several obvious pressures that affected the development of RISC: - The number of transistors that could fit on a single chip was increasing. It was clear that one would eventually be able to fit all the components from a processor board onto a single chip. - Techniques such as pipelining were being explored to improve performance. Variable-length instructions and variable-length instruction execution times (due to varying numbers of microcode steps) made implementing pipelines more difficult. - As compilers improved, they found that well-optimized sequences of stream- lined instructions often outperformed the equivalent complicated multi-cycle instructions. July – Oct 2021
  • 13. CSC 457 Lecture Notes 13 Computational Models o The RISC designers sought to create a high performance single-chip processor with a fast clock rate. o When a CPU can fit on a single chip, its cost is decreased, its reliability is increased, and its clock speed can be increased. o While not all RISC processors are single-chip implementation, most use a single chip. o To accomplish this task, it was necessary to discard the existing CISC instruction sets and develop a new minimal instruction set that could fit on a single chip. Hence the term Reduced Instruction Set Computer. July – Oct 2021
  • 14. CSC 457 Lecture Notes 14 Computational Models o The earliest RISC processors had no floating-point support in hardware, and some did not even support integer multiply in hardware. However, these instructions could be implemented using software routines that combined other instructions (a microcode of sorts). o These earliest RISC processors (most severely reduced) were not overwhelming successes for four reasons:  It took time for compilers, operating systems, and user software to be retuned to take advantage of the new processors.  If an application depended on the performance of one of the software-implemented instructions, its performance suffered dramatically.  Because RISC instructions were simpler, more instructions were needed to accomplish the task.  Because all the RISC instructions were 32 bits long, and commonly used CISC instructions were as short as 8 bits, RISC program executables were often larger. • As a result of these last two issues, a RISC program may have to fetch more memory for its instructions than a CISC program. This increased appetite for July – Oct 2021
  • 15. CSC 457 Lecture Notes 15 Computational Models o As a result of these last two issues, a RISC program may have to fetch more memory for its instructions than a CISC program. This increased appetite for instructions actually clogged the memory bottleneck until sufficient caches were added to the RISC processors. o RISC processors quickly became known for their affordable high-speed floating- point capability compared to CISC processors. This excellent performance on scientific and engineering applications effectively created a new type of computer system, the workstation July – Oct 2021
  • 16. CSC 457 Lecture Notes 16  Parallel Architectures o Concurrency and parallelism are related concepts, but they are distinct. Concurrent programming happens when several computations are happening in over-lapping time periods. Your laptop, for example, seems like it doing a lot of things at the same time even though there are only 1, 2, or 4 cores. So, we have concurrency without parallelism. o At the other end of the spectrum, the CPU in your laptop is carrying out pieces of the same computation in parallel to speed up the execution of the instruction stream. o Parallel computing occupies a unique spot in the universe of distributed systems. Parallel computing is centralized—all of the processes are typically under the control of a single entity. Parallel computing is usually hierarchical— parallel architectures are frequently described as grids, trees, or pipelines. Parallel computing is co-located—for efficiency, parallel processes are typically located very close to each other, often in the same chassis or at least the same data center. These choices are driven by the problem space and the need for high performance. July – Oct 2021
  • 17. CSC 457 Lecture Notes 17  Parallel Architectures o Parallel computing occupies a unique spot in the universe of distributed systems. o Parallel computing is centralized—all of the processes are typically under the control of a single entity. o Parallel computing is usually hierarchical—parallel architectures are frequently described as grids, trees, or pipelines. o Parallel computing is co-located—for efficiency, parallel processes are typically located very close to each other, often in the same chassis or at least the same data center. o These choices are driven by the problem space and the need for high performance. July – Oct 2021
  • 18. CSC 457 Lecture Notes 18  Parallel Architectures o Definition of a parallel computer: A set of independent processors that can work cooperatively to solve a problem o A parallel system consists of an algorithm and the parallel architecture that the algorithm is implemented. o Note that an algorithm may have different performance on different parallel architecture. o For example, an algorithm may perform differently on a linear array of processors and on a hypercube of processors July – Oct 2021
  • 19. CSC 457 Lecture Notes 19  Parallel Architectures o Why Use Parallel Computing?  Single processor speeds are reaching their ultimate limits  Multi-core processors and multiple processors are the most promising paths to performance improvements o Concurrency: The property of a parallel algorithm that a number of operations can be performed by separate processors at the same time. Concurrency is the key concept in the design of parallel algorithms:  Requires a different way of looking at the strategy to solve a problem  May require a very different approach from a serial program to achieve high efficiency July – Oct 2021
  • 20. CSC 457 Lecture Notes 20  Parallel Architectures • July – Oct 2021
  • 21. CSC 457 Lecture Notes 21  Parallel Architectures o Protein folding problems involve a large number of independent calculations that do not depend on data from other calculations o Concurrent calculations with no dependence on the data from other calculations are termed Embarrassingly Parallel o These embarrassingly parallel problems are ideal for solution by HPC methods, and can realize nearly ideal concurrency and scalability o Flexibility in the way a problem is solved is beneficial to finding a parallel algorithm that yields a good parallel scaling. o Often, one has to employ substantial creativity in the way a parallel algorithm is implemented to achieve good scalability. July – Oct 2021
  • 22. CSC 457 Lecture Notes 22  Parallel Architectures o Understand the Dependencies o One must understand all aspects of the problem to be solved, in particular the possible dependencies of the data. o It is important to understand fully all parts of a serial code that you wish to parallelize, Example: Pressure Forces (Local) vs. Gravitational Forces (Global) o When designing a parallel algorithm, always remember:  Computation is FAST  Communication is SLOW  Input/Output (I/O) is INCREDIBLY SLOW o In addition to concurrency and scalability, there are a number of other important factors in the design of parallel algorithms: Locality; Granularity; Modularity; Flexibility ;Load balancing July – Oct 2021
  • 23. CSC 457 Lecture Notes 23  Parallel Architectures Parallel Computer Architectures o Virtually all computers follow the basic design of the Von Neumann Architecture as follows;  Memory stores both instructions and data  Control unit fetches instructions from memory, decodes instructions, and then sequentially performs operations to perform programmed task  Arithmetic Unit performs mathematical operations  Input/Output is interface to the user July – Oct 2021
  • 24. CSC 457 Lecture Notes 24  Parallel Architectures Flynn’s Taxonomy o SISD: This is a standard serial computer: one set of instructions, one data stream o SIMD: All units execute same instructions on different data streams (vector) - Useful for specialized problems, such as graphics/image processing - Old Vector Supercomputers worked this way, also moderns GPUs o MISD: Single data stream operated on by different sets of instructions, not generally used for parallel computers o MIMD: Most common parallel computer, each processor can execute different instructions on different data streams -Often constructed of many SIMD subcomponents July – Oct 2021
  • 25. CSC 457 Lecture Notes 25  Parallel Architectures Flynn’s Taxonomy o SISD: This is a standard serial computer: one set of instructions, one data stream o SIMD: All units execute same instructions on different data streams (vector) - Useful for specialized problems, such as graphics/image processing - Old Vector Supercomputers worked this way, also moderns GPUs o MISD: Single data stream operated on by different sets of instructions, not generally used for parallel computers o MIMD: Most common parallel computer, each processor can execute different instructions on different data streams -Often constructed of many SIMD subcomponents July – Oct 2021
  • 26. CSC 457 Lecture Notes 26  Parallel Architectures Parallel Computer Memory Architectures o Shared Memory – memory shared among various CPUs o Distributed Memory - each CPU has its memory o Hybrid Distributed Shared Memory July – Oct 2021
  • 27. CSC 457 Lecture Notes 27  Parallel Architectures Relation to Parallel Programming Models o OpenMP: Multi-threaded calculations occur within shared- memory components of systems, with different threads working on the same data. o MPI: Based a distributed-memory model, data associated with another processor must be communicated over the network connection. o GPUs: Graphics Processing Units (GPUs) incorporate many (hundreds) of computing cores with single Control Unit, so this is a shared-memory model. o Processors vs. Cores: Most common parallel computer, each processor can execute different instructions on different data streams -Often constructed of many SIMD subcomponents July – Oct 2021
  • 28. CSC 457 Lecture Notes 28  Parallel Architectures Embarrassingly Parallel o Refers to an approach that involves solving many similar but independent tasks simultaneously o Little to no coordination (and thus no communication) between tasks o Each task can be a simple serial program o This is the “easiest” type of problem to implement in a parallel manner. Essentially requires automatically coordinating many independent calculations and possibly collating the results. o Examples: Computer Graphics and Image Processing; Protein Folding Calculations in Biology; Geographic Land Management Simulations in Geography; Data Mining in numerous fields; Event simulation and reconstruction in Particle Physics July – Oct 2021
  • 29. CSC 457 Lecture Notes 29  Internetworking Performance Issues and Scalability of Parallel Architectures o Performance Limitations of Parallel Architectures o Adding additional resources doesn’t necessarily speed up a computation. There’s a limit defined by Amdahl’s Law. o The basic idea of Amdahl’s law is that a parallel computation’s maximum performance gain is limited by the portion of the computation that has to happen serially which creates a bottleneck. o Serial portion include scheduling, resource allocation, communication and synchronizing e.t.c. o For example, if a computation that takes 20 hours on a single CPU has a serial portion that takes 1 hour (5%), then Amdahl’s law shows that no matter how many processors you put on the task, the maximum speed up is 20x. Consequently, after a point, putting additional processors on the job is just wasted resource. July – Oct 2021
  • 30. CSC 457 Lecture Notes 30  Internetworking Performance Issues and Scalability of Parallel Architectures • July – Oct 2021
  • 31. CSC 457 Lecture Notes 31  Internetworking Performance Issues and Scalability of Parallel Architectures Process Interaction o Except for embarrassingly parallel algorithms, the threads in a parallel computation need to communicate with each other. There are two ways they can do this; o Shared memory – the processes can share a storage location that they use for communicating. Shared memory can also be used to synchronize threads by using the shared memory as a semaphore. o Messaging – the processes communicate via messages. This could be over a network or special-purposes bus. Networks for this use are typically hyper-local and designed for this purpose. July – Oct 2021
  • 32. CSC 457 Lecture Notes 32  Internetworking Performance Issues and Scalability of Parallel Architectures Consistent Execution o The threads of execution for most parallel algorithms must be coupled to achieve consistent execution. o Parallel threads of execution communicate to transfer values between processes. Parallel algorithms communicate not only to calculate the result, but to achieve deterministic execution. o For any given set of inputs, the parallel version of an algorithm should return the same answer each time it is run as well as returning the same answer a sequential version of the algorithm would return. o Parallel algorithms achieve this by locking memory or otherwise sequencing operations between threads. This communication overhead, as well as the waiting required for sequencing impose a performance overhead. o As we saw in our discussion of Amdahl’s Law, these sequential portions of a parallel algorithm are the limiting factor in speeding up execution. July – Oct 2021
  • 33. CSC 457 Lecture Notes 33  Internetworking Performance Issues and Scalability of Parallel Architectures o (Read tutorial on Performance and Scalability on Parallel Computing attached) July – Oct 2021
  • 34. CSC 457 Lecture Notes 34 Next ….... Performance Evaluation July – Oct 2021
  • 35. CSC 457 Lecture Notes 35  Performance Modelling o The goal of performance modeling is to gain understanding of a computer system’s performance on various applications, by means of measurement and analysis, and then to encapsulate these characteristics in a compact formula. o The resulting model can be used to gain greater understanding of the performance phenomena involved and to project performance to other system/application combinations July – Oct 2021
  • 36. CSC 457 Lecture Notes 36  Performance Modelling o The performance profile of a given system/application combination depends on numerous factors, including: (1) System size;(2) System architecture; (3) Processor speed; (4) Multi-level cache latency and bandwidth; (5) Interprocessor network latency and bandwidth; (6) System software efficiency; (7) Type of application; (8) Algorithms used; (9) Programming language used; (10) Problem size; (11) Amount of I/O; July – Oct 2021
  • 37. CSC 457 Lecture Notes 37  Performance Modelling o Performance models can be used to improve architecture design, inform procurement, and guide application tuning o Someone has observed that, due to the difficulty of developing performance models for new applications, as well as the increasing complexity of new systems, our supercomputers have become better at predicting and explaining natural phenomena (such as the weather) than at predicting and explaining the performance of themselves or other computers. July – Oct 2021
  • 38. CSC 457 Lecture Notes 38  Performance Modelling Applications of Performance Modelling o Performance modeling can be used in numerous ways. Here is a brief summary of these usages, both present-day and future possibilities; 1. System design. o Performance models are frequently employed by computer vendors in their design of future systems. Typically engineers construct a performance model for one or two key applications, and then compare future technology options based on performance model projections. Once performance modeling techniques are better developed, it may be possible to target many more applications and technology options July – Oct 2021
  • 39. CSC 457 Lecture Notes 39  Performance Modelling Applications of Performance Modelling 2. Runtime estimation o The most common application for a performance model is to enable a scientist to estimate the runtime of a job when the input parameters for the job are changed, or when a different number of processors is used in a parallel computer system. o One can also estimate the largest size of system that can be used to run a given problem before the parallel efficiency drops to an unacceptable area. 3. System tuning o An example of using performance modeling for system tuning is where performance model is used to diagnose and rectify a misconfigured channel buffer, which yields a doubling of network performance for programs sending short messages July – Oct 2021
  • 40. CSC 457 Lecture Notes 40  Performance Modelling Applications of Performance Modelling 4. Application Tuning o If a memory performance model is combined with application parameters, one can predict how cache hit-rates would change if a different cache blocking factor were used in the application. o Once the optimal cache blocking has been identified, then the code can be permanently changed. o Simple performance models can even be incorporated into an application code, permitting on-the-fly selection of different program options. o Performance models, by providing performance expectations based on the fundamental computational characteristics of algorithms, can also enable algorithmic choice before going to the trouble to implement all the possible choices July – Oct 2021
  • 41. CSC 457 Lecture Notes 41  Pipeline Freeze Strategies • July – Oct 2021
  • 42. CSC 457 Lecture Notes 42  Pipeline Freeze Strategies • July – Oct 2021
  • 43. CSC 457 Lecture Notes 43  Branch prediction Strategies o In an exceedingly parallel system, conditional instructions break the continuous flow of programs or decrease the performance of the pipelined processor, which causes delay. o To decrease the delay prediction of branch direction is necessary. The disparity in the branches needs accurately branch prediction strategies. So, branch prediction is a vital part of the present pipelined processor. o Branch prediction is the process of making an educated guess as to whether a branch will be taken or not taken based on a preset algorithm . o A branch is a category of instruction which causes the code to move to another block to continue execution. Branch prediction has the ability to be static or dynamic July – Oct 2021
  • 44. CSC 457 Lecture Notes 44  Prediction Strategies o Static branch prediction means that a given branch will always be predicted as taken or not taken without possibility of change throughout the duration of the program. o Dynamic branch prediction means that the predicted outcome of a branch is dependent on an algorithm, and the prediction may change throughout the course of the program. o Code is able to use a combination of both static and dynamic branch predictors based on the type of branch. o The improvement branch prediction is dependent on the number of branches in the code, as well as the type of prediction being used as different prediction methods have varied rates of success. o Overall, branch prediction provides an increase in performance for code containing branches. This improvement is based on the number of computational cycles which are able to be used for computation rather than wasted on a system which does not use branch prediction. July – Oct 2021
  • 45. CSC 457 Lecture Notes 45  Prediction Strategies o There are three different kinds of 6 branches: forward conditional, backward conditional, and unconditional branches . o Forward conditional branches are when a branch evaluates to a target that is somewhere forward in the instruction stream. o Backward conditional branches are when a branch evaluates to a target that is somewhere backwards in the instruction stream. Common instances of backward conditional branches are loops. o Unconditional branches are branches which will always occur. July – Oct 2021
  • 46. CSC 457 Lecture Notes 46  Prediction Strategies o A static or dynamic prediction strategy will determine which different algorithms or methods are available for use. o For static branch prediction, the strategy may either be predict taken, predict not taken, or some combination that specifies the branch type such as backward branch predict taken, forward branch predict not taken . The third strategy is advantageous for programs with loops because it will have a higher percentage of correctly predicted branches for backward branches. o Dynamic branch prediction is able to use one-level prediction, two- level adaptive prediction, or a tournament predictor. One-level prediction uses a counter based on a specific branch to use; said branch’s history to predict its future outcomes . o The address of the branch is used as an index into a table where these counters are stored. When a branch is correctly predicted taken, a counter is incremented. When a branch is correctly predicted not taken, the same counter is decremented. In the case where the prediction was incorrect, the opposite occurs. July – Oct 2021
  • 47. CSC 457 Lecture Notes 47  Prediction Strategies o The two-level adaptive branch prediction is very similar to the one-level branch prediction strategy. The two-level strategy uses the same counter concept as the one-level, except the two-level implements this counter while taking input from other branches. This strategy may also be used to predict the direction of the branch based on the direction and outcomes of other branches in the program. This strategy is also called a global history counter . o Hybrid or tournament prediction strategies use a combination of two or more other prediction strategies . For example, any static prediction used in conjunction with a dynamic prediction strategy would be considered a hybrid strategy. o All of the strategies listed here are used in practice. The two-bit counter presented in the one-level branch prediction strategy is used in a number of other branch prediction strategies, including a predictor for choosing which predictor to use. o One disadvantage to each of these strategies is that their level of improvement for a given code will vary depending on what is written into the code July – Oct 2021
  • 48. CSC 457 Lecture Notes 48  Composite Strategies July – Oct 2021 (Blank!!)
  • 49. CSC 457 Lecture Notes 49  Benchmark Performance (Blank!!) July – Oct 2021
  • 50. CSC 457 Lecture Notes 50  Pipeline Processor Concepts o High performance is an important issue in microprocessor and its importance is exponentially increasing over the years. o To improve the performance, two alternative methods exist (a) To improve the hardware by providing faster circuit (b) To arrange the hardware, so that multi-operations can be performed. o On the basis of performance, pipelining is a process of arrangement of hardware elements of the CPU such that its overall performance is increased, simultaneous execution of more than one instruction takes place in a pipelined processor July – Oct 2021
  • 51. CSC 457 Lecture Notes 51  Pipeline Processor Concepts o A pipeline processor is comprised of a sequential, linear list of segments, where each segment performs one computational task or group of tasks. o There are three things that one must observe about the pipeline. 1. First, the work (in a computer, the ISA) is divided up into pieces that more or less fit into the segments alloted for them. 2. Second, this implies that in order for the pipeline to work efficiently and smoothly, the work partitions must each take about the same time to complete. Otherwise, the longest partition requiring time T would hold up the pipeline, and every segment would have to take time T to complete its work. For fast segments, this would mean much idle time. 3. Third, in order for the pipeline to work smoothly, there must be few (if any) exceptions or hazards that cause errors or delays within the pipeline. Otherwise, the instruction will have to be reloaded and the pipeline restarted with the same instruction that causes the exception. July – Oct 2021
  • 52. CSC 457 Lecture Notes 52  Pipeline Processor Concepts o Work Partitioning: A multicycle datapath is based on the assumption that computational work associated with the execution of an instruction could be partitioned into a five- step process, as follows: July – Oct 2021
  • 53. CSC 457 Lecture Notes 53  Pipeline Processor Concepts o Pipelining is one way of improving the overall processing performance of a processor. This architectural approach allows the simultaneous execution of several instructions. o Pipelining is transparent to the programmer; it exploits parallelism at the instruction level by overlapping the execution process of instructions. o It is analogous to an assembly line where workers perform a specific task and pass the partially completed product to the next worker o The pipeline design technique decomposes a sequential process into several subprocesses, called stages or segments. A stage performs a particular function and produces an intermediate result. o It consists of an input latch, also called a register or buffer, followed by a processing circuit. (A processing circuit can be a combinational or sequential circuit.) July – Oct 2021
  • 54. CSC 457 Lecture Notes 54  Pipeline Processor Concepts o At each clock pulse, every stage transfers its intermediate result to the input latch of the next stage. In this way, the final result is produced after the input data have passed through the entire pipeline, completing one stage per clock pulse. o The period of the clock pulse should be large enough to provide sufficient time for a signal to traverse through the slowest stage, which is called the bottleneck (i.e., the stage needing the longest amount of time to complete). o In addition, there should be enough time for a latch to store its input signals. o If the clock's period, P, is expressed as P = tb + tl, then tb should be greater than the maximum delay of the bottleneck stage, and tl should be sufficient for storing data into a latch July – Oct 2021
  • 55. CSC 457 Lecture Notes 55  Pipeline Processor Concepts Completion Time for pipelined processor o The ability to overlap stages of a sequential process for different input tasks (data or operations) results in an overall theoretical completion time of Tpipe = m*P + (n-1)*P, where n is the number of input tasks, m is the number of stages in the pipeline, and P is the clock period o The term m*P is the time required for the first input task to get through the pipeline, and the term (n-1)*P is the time required for the remaining tasks. o After the pipeline has been filled, it generates an output on each clock cycle. In other words, after the pipeline is loaded, it will generate output only as fast as its slowest stage. o Even with this limitation, the pipeline will greatly outperform nonpipelined techniques, which require each task to complete before another task’s execution sequence begins. To be more specific, when n is large, a pipelined processor can produce output approximately m times faster than a nonpipelined processor. July – Oct 2021
  • 56. CSC 457 Lecture Notes 56  Pipeline Processor Concepts • July – Oct 2021
  • 57. CSC 457 Lecture Notes 57  Pipeline Processor Concepts Pipeline Performance Measures 1. Speedup o Now, speedup (S) may be represented as: S = Tseq / Tpipe = n*m / (m+n -1) The value S approaches m when n  . That is, the maximum speedup, also called ideal speedup, of a pipeline processor with m stages over an equivalent nonpipelined processor is m. In other words, the ideal speedup is equal to the number of pipeline stages. That is, when n is very large, a pipelined processor can produce output approximately m times faster than a nonpipelined processor. When n is small, the speedup decreases; in fact, for n=1 the pipeline has the minimum speedup of 1. July – Oct 2021
  • 58. CSC 457 Lecture Notes 58  Pipeline Processor Concepts Pipeline Performance Measures 2. Efficiency o The efficiency E of a pipeline with m stages is defined as: E = S/m = [n*m / (m+n -1)] / m = n / (m+n -1). The efficiency E, which represents the speedup per stage, approaches its maximum value of 1 when n  . When n=1, E will have the value 1/m, which is the lowest obtainable value. July – Oct 2021
  • 59. CSC 457 Lecture Notes 59  Pipeline Processor Concepts • July – Oct 2021
  • 60. CSC 457 Lecture Notes 60  Pipeline Processor Concepts Pipeline Types o Pipelines are usually divided into two classes: instruction pipelines and arithmetic pipelines. A pipeline in each of these classes can be designed in two ways: static or dynamic. o A static pipeline can perform only one operation (such as addition or multiplication) at a time. The operation of a static pipeline can only be changed after the pipeline has been drained. (A pipeline is said to be drained when the last input data leave the pipeline.) For example, consider a static pipeline that is able to perform addition and multiplication. Each time that the pipeline switches from a multiplication operation to an addition operation, it must be drained and set for the new operation. o The performance of static pipelines is severely degraded when the operations change often, since this requires the pipeline to be drained and refilled each time. o A dynamic pipeline can perform more than one operation at a time. To perform a particular operation on an input data, the data must go through a certain sequence of stages. In dynamic pipelines the mechanism that controls when data should be fed to the pipeline is much more complex than in static pipelines July – Oct 2021
  • 61. CSC 457 Lecture Notes 61  Pipeline Processor Concepts Instruction Pipeline o An instruction pipeline increases the performance of a processor by overlapping the processing of several different instructions. An instruction pipeline often consists of five stages, as follows: 1. Instruction fetch (IF). Retrieval of instructions from cache (or main memory). 2. Instruction decoding (ID). Identification of the operation to be performed. 3. Operand fetch (OF). Decoding and retrieval of any required operands. 4. Execution (EX). Performing the operation on the operands. 5. Write-back (WB). Updating the destination operands. An instruction pipeline overlaps the process of the preceding stages for different instructions to achieve a much lower total completion time, on average, for a series of instructions. July – Oct 2021
  • 62. CSC 457 Lecture Notes 62  Pipeline Processor Concepts Instruction Pipeline o During the first cycle, or clock pulse, instruction i1 is fetched from memory. Within the second cycle, instruction i1 is decoded while instruction i2 is fetched. This process continues until all the instructions are executed. The last instruction finishes the write- back stage after the eighth clock cycle. o Therefore, it takes 80 nanoseconds (ns) to complete execution of all the four instructions when assuming the clock period to be 10 ns. The total completion time is, Tpipe = m*P+(n-1)*P =5*10+(4-1)*10=80 ns. Note that in a nonpipelined design the completion time will be much higher. July – Oct 2021
  • 63. CSC 457 Lecture Notes 63  Pipeline Processor Concepts Instruction Pipeline o Note that in a nonpipelined design the completion time will be much higher. Tseq = n*m*P = 4*5*10 = 200 ns o It is worth noting that a pipeline simply takes advantage of these naturally occurring stages to improve processing efficiency. o Henry Ford made the same connection when he realized that all cars were built in stages and invented the assembly line in the early 1900s. o Even though pipelining speeds up the execution of instructions, it does pose potential problems. Some of these problems and possible solutions are discussed next July – Oct 2021
  • 64. CSC 457 Lecture Notes 64  Pipeline Processor Concepts Improving the Throughput of an Instruction Pipeline o Three sources of architectural problems may affect the throughput of an instruction pipeline. They are fetching, bottleneck, and issuing problems. Some solutions are given for each. 1. The fetching problem o In general, supplying instructions rapidly through a pipeline is costly in terms of chip area. Buffering the data to be sent to the pipeline is one simple way of improving the overall utilization of a pipeline. The utilization of a pipeline is defined as the percentage of time that the stages of the pipeline are used over a sufficiently long period of time. A pipeline is utilized 100% of the time when every stage is used (utilized) during each clock cycle. July – Oct 2021
  • 65. CSC 457 Lecture Notes 65  Pipeline Processor Concepts Improving the Throughput of an Instruction Pipeline 1. The fetching problem o Occasionally, the pipeline has to be drained and refilled, for example, whenever an interrupt or a branch occurs. The time spent refilling the pipeline can be minimized by having instructions and data loaded ahead of time into various geographically close buffers (like on-chip caches) for immediate transfer into the pipeline. If instructions and data for normal execution can be fetched before they are needed and stored in buffers, the pipeline will have a continuous source of information with which to work. Prefetch algorithms are used to make sure potentially needed instructions are available most of the time. Delays from memory access conflicts can thereby be reduced if these algorithms are used, since the time required to transfer data from main memory is far greater than the time required to transfer data from a buffer. July – Oct 2021
  • 66. CSC 457 Lecture Notes 66  Pipeline Processor Concepts Improving the Throughput of an Instruction Pipeline 2. The bottleneck problem o The bottleneck problem relates to the amount of load (work) assigned to a stage in the pipeline. o If too much work is applied to one stage, the time taken to complete an operation at that stage can become unacceptably long. o This relatively long time spent by the instruction at one stage will inevitably create a bottleneck in the pipeline system. o In such a system, it is better to remove the bottleneck that is the source of congestion. One solution to this problem is to further subdivide the stage. Another solution is to build multiple copies of this stage into the pipeline. July – Oct 2021
  • 67. CSC 457 Lecture Notes 67  Pipeline Processor Concepts Improving the Throughput of an Instruction Pipeline 3. The issuing problem o If an instruction is available, but cannot be executed for some reason, a hazard exists for that instruction. These hazards create issuing problems; they prevent issuing an instruction for execution. Three types of hazard are discussed here. They are called structural hazard, data hazard, and control hazard. o A structural hazard refers to a situation in which a required resource is not available (or is busy) for executing an instruction. o A data hazard refers to a situation in which there exists a data dependency (operand conflict) with a prior instruction. o A control hazard refers to a situation in which an instruction, such as branch, causes a change in the program flow. Each of these hazards is explained next. July – Oct 2021
  • 68. CSC 457 Lecture Notes 68  Pipeline Processor Concepts 1. Structural Hazard o structural hazard occurs as a result of resource conflicts between instructions. One type of structural hazard that may occur is due to the design of execution units. If an execution unit that requires more than one clock cycle (such as multiply) is not fully pipelined or is not replicated, then a sequence of instructions that uses the unit cannot be subsequently (one per clock cycle) issued for execution. Replicating and/or pipelining execution units increases the number of instructions that can be issued simultaneously. o Another type of structural hazard that may occur is due to the design of register files. If a register file does not have multiple write (read) ports, multiple writes (reads) to (from) registers cannot be performed simultaneously. For example, under certain situations the instruction pipeline might want to perform two register writes in a clock cycle. This may not be possible when the register file has only one write port. The effect of a structural hazard can be reduced fairly simply by implementing multiple execution units and using register files with multiple input/output ports July – Oct 2021
  • 69. CSC 457 Lecture Notes 69  Pipeline Processor Concepts 2. Data Hazard o In a nonpipelined processor, the instructions are executed one by one, and the execution of an instruction is completed before the next instruction is started. In this way, the instructions are executed in the same order as the program. However, this may not be true in a pipelined processor, where instruction executions are overlapped. An instruction may be started and completed before the previous instruction is completed. The data hazard, which is also referred to as the data dependency problem, comes about as a result of overlapping (or changing the order of) the execution of data-dependent instructions. o The delaying of execution can be accomplished in two ways. One way is to delay the OF or IF stages of i2 for two clock cycles. To insert a delay, an extra hardware component called a pipeline interlock can be added to the pipeline. A pipeline interlock detects the dependency and delays the dependent instructions until the conflict is resolved. Another way is to let the compiler solve the dependency problem. During compilation, the compiler detects the dependency between data and instructions. It then rearranges these instructions so that the dependency is not hazardous to the system. If it is not possible to rearrange the instructions, NOP (no operation) instructions are inserted to create delays. July – Oct 2021
  • 70. CSC 457 Lecture Notes 70  Pipeline Processor Concepts o There are three primary types of data hazards: RAW (read after write), WAR (write after read), and WAW (write after write). The hazard names denote the execution ordering of the instructions that must be maintained to produce a valid result; otherwise, an invalid result might occur. o RAW: it refers to the situation in which i2 reads a data source before i1 writes to it. This may produce an invalid result since the read must be performed after the write in order to obtain a valid result. o WAR: This refers to the situation in which i2 writes to a location before i1 reads it. o WAW: This refers to the situation in which i2 writes to a location before i1 writes to it. o Note that the WAR and WAW types of hazards cannot happen when the order of completion of instructions execution in the program is preserved. However, one way to enhance the architecture of an instruction pipeline is to increase concurrent execution of the instructions by dispatching several independent instructions to different functional units, such as adders/subtractors, multipliers, and dividers. That is, the instructions can be executed out of order, and so their execution may be completed out of order too. July – Oct 2021
  • 71. CSC 457 Lecture Notes 71  Pipeline Processor Concepts o The dependencies between instructions are checked statically by the compiler and/or dynamically by the hardware at run time. This preserves the execution order for dependent instructions, which ensures valid results. o In general, dynamic dependency checking has the advantage of being able to determine dependencies that are either impossible or hard to detect at compile time. However, it may not be able to exploit all the parallelism available in a loop because of the limited lookahead ability that can be supported by the hardware. o Two of the most commonly used techniques for dynamic dependency checking are called Tomasulo's method and the scoreboard method o Tomasulo's method increases concurrent execution of the instructions with minimal (or no) effort by the compiler or the programmer. o The scoreboard method: multiple functional units allow instructions to be completed out of the original program order. July – Oct 2021
  • 72. CSC 457 Lecture Notes 72  Pipeline Processor Concepts 3. Control Hazard o In any set of instructions, there is normally a need for some kind of statement that allows the flow of control to be something other than sequential. Instructions that do this are included in every programming language and are called branches. In general, about 30% of all instructions in a program are branches. o This means that branch instructions in the pipeline can reduce the throughput tremendously if not handled properly. Whenever a branch is taken, the performance of the pipeline is seriously affected. Each such branch requires a new address to be loaded into the program counter, which may invalidate all the instructions that are either already in the pipeline or prefetched in the buffer. This draining and refilling of the pipeline for each branch degrade the throughput of the pipeline to that of a sequential processor. o Note that the presence of a branch statement does not automatically cause the pipeline to drain and begin refilling. A branch not taken allows the continued sequential flow of uninterrupted instructions to the pipeline. Only when a branch is taken does the problem arise. July – Oct 2021
  • 73. CSC 457 Lecture Notes 73  Pipeline Processor Concepts 3. Control Hazard o Branch instructions can be classified into three groups: (1) unconditional branch, (2) conditional branch, and (3) loop branch o An unconditional branch always alters the sequential program flow. It sets a new target address in the program counter, rather than incrementing it by 1 to point to the next sequential instruction address, as is normally the case. o A conditional branch sets a new target address in the program counter only when a certain condition, usually based on a condition code, is satisfied. Otherwise, the program counter is incremented by 1 as usual. A conditional branch selects a path of instructions based on a certain condition. If the condition is satisfied, the path starts from the target address and is called a target path. If it is not, the path starts from the next sequential instruction and is called a sequential path. o A loop branch in a loop statement usually jumps back to the beginning of the loop and executes it either a fixed or a variable (data-dependent) number of times. July – Oct 2021
  • 74. CSC 457 Lecture Notes 74  Pipeline Processor Concepts Techniques for Reducing Effect of Branching on Processor Performance o To reduce the effect of branching on processor performance, several techniques have been proposed. Some of the better known techniques are branch prediction, delayed branching, and multiple prefetching 1. Branch Prediction. In this type of design, the outcome of a branch decision is predicted before the branch is actually executed. Therefore, based on a particular prediction, the sequential path or the target path is chosen for execution. Although the chosen path often reduces the branch penalty, it may increase the penalty in case of incorrect prediction. 2. Delayed Branching. The delayed branching scheme eliminates or significantly reduces the effect of the branch penalty. In this type of design, a certain number of instructions after the branch instruction is fetched and executed regardless of which path will be chosen for the branch. For example, a processor with a branch delay of k executes a path containing the next k sequential instructions and then either continues on the same path or starts a new path from a new target address. As often as possible, the compiler tries to fill the next k instruction slots after the branch with instructions that are independent from the branch instruction. NOP (no operation) instructions are placed in any remaining empty slots. July – Oct 2021
  • 75. CSC 457 Lecture Notes 75  Pipeline Processor Concepts Techniques for Reducing Effect of Branching on Processor Performance 3. Multiple Prefetching. In this type of design, the processor fetches both possible paths. Once the branch decision is made, the unwanted path is thrown away. By prefetching both possible paths, the fetch penalty is avoided in the case of an incorrect prediction. To fetch both paths, two buffers are employed to service the pipeline. In normal execution, the first buffer is loaded with instructions from the next sequential address of the branch instruction. If a branch occurs, the contents of the first buffer are invalidated, and the secondary buffer, which has been loaded with instructions from the target address of the branch instruction, is used as the primary buffer. This double buffering scheme ensures a constant flow of instructions and data to the pipeline and reduces the time delays caused by the draining and refilling of the pipeline. Some amount of performance degradation is unavoidable any time the pipeline is drained, however July – Oct 2021
  • 76. CSC 457 Lecture Notes 76  Pipeline Processor Concepts Further Throughput Improvement of an Instruction Pipeline o One way to increase the throughput of an instruction pipeline is to exploit instruction-level parallelism. The common approaches to accomplish such parallelism are called superscalar, superpipeline, and very long instruction word (VLIW) July – Oct 2021
  • 77. CSC 457 Lecture Notes 77  Pipeline Processor Concepts Further Throughput Improvement of an Instruction Pipeline 1. Superscalar o The superscalar approach relies on spatial parallelism, that is, multiple operations running concurrently on separate hardware. This approach achieves the execution of multiple instructions per clock cycle by issuing several instructions to different functional units. o A superscalar processor contains one or more instruction pipelines sharing a set of functional units. It often contains functional units, such as an add unit, multiply unit, divide unit, floating-point add unit, and graphic unit. o A superscalar processor contains a control mechanism to preserve the execution order of dependent instructions for ensuring a valid result. The scoreboard method and Tomasulo's method can be used for implementing such mechanisms. o In practice, most of the processors are based on the superscalar approach and employ a scoreboard method to ensure a valid result. July – Oct 2021
  • 78. CSC 457 Lecture Notes 78  Pipeline Processor Concepts Further Throughput Improvement of an Instruction Pipeline 2. Superpipeline o The superpipeline approach achieves high performance by overlapping the execution of multiple instructions on one instruction pipeline. o A superpipeline processor often has an instruction pipeline with more stages than a typical instruction pipeline design. In other words, the execution process of an instruction is broken down into even finer steps. By increasing the number of stages in the instruction pipeline, each stage has less work to do. This allows the pipeline clock rate to increase (cycle time decreases), since the clock rate depends on the delay found in the slowest stage of the pipeline. o An example of such an architecture is the MIPS R4000 processor. The R4000 subdivides instruction fetching and data cache access to create an eight- stage pipeline. July – Oct 2021
  • 79. CSC 457 Lecture Notes 79  Pipeline Processor Concepts Further Throughput Improvement of an Instruction Pipeline 3. Very Long Instruction Word (VLIW). o The very long instruction word (VLIW) approach makes extensive use of the compiler by requiring it to incorporate several small independent operations into a long instruction word. o The instruction is large enough to provide, in parallel, enough control bits over many functional units. In other words, a VLIW architecture provides many more functional units than a typical processor design, together with a compiler that finds parallelism across basic operations to keep the functional units as busy as possible. o The compiler compacts ordinary sequential codes into long instruction words that make better use of resources. During execution, the control unit issues one long instruction per cycle. The issued instruction initiates many independent operations simultaneously July – Oct 2021
  • 80. CSC 457 Lecture Notes 80  Pipeline Processor Concepts Further Throughput Improvement of an Instruction Pipeline o A comparison of the three approaches will show a few interesting differences. o For instance, the superscalar and VLIW approaches are more sensitive to resource conflicts than the superpipelined approach. o In a superscalar or VLIW processor, a resource must be duplicated to reduce the chance of conflicts, while the superpipelined design avoids any resource conflicts. July – Oct 2021
  • 81. CSC 457 Lecture Notes 81 Pipeline Datapath Design and Implementation o The work involved in an instruction can be partitioned into steps labelled IF (Instruction Fetch), ID (Instruction Decode and data fetch), EX (ALU operations or R-format execution), MEM (Memory operations), and WB (Write- Back to register file) July – Oct 2021
  • 82. CSC 457 Lecture Notes 82 Pipeline Datapath Design and Implementation MIPS Instructions and Pipelining o Multi Instruction Processing System (MIPS) is a reduced instruction set computer (RISC) instruction set architecture (ISA) o In order to implement MIPS instructions effectively on a pipeline processor, we must ensure that the instructions are the same length (simplicity favors regularity) for easy IF and ID, similar to the multicycle datapath. o We also need to have few but consistent instruction formats, to avoid deciphering variable formats during IF and ID, which would prohibitively increase pipeline segment complexity for those tasks. Thus, the register indices should be in the same place in each instruction. July – Oct 2021
  • 83. CSC 457 Lecture Notes 83 Next ….... Memory and I/O Systems July – Oct 2021
  • 84. CSC 457 Lecture Notes 84 Levels of Memory o Level 1 or Register – It is a type of memory in which data is stored and accepted that are immediately stored in CPU. Most commonly used register is accumulator, Program counter, address register etc. o Level 2 or Cache memory – It is the fastest memory which has faster access time where data is temporarily stored for faster access. o Level 3 or Main Memory – It is memory on which computer works currently. It is small in size and once power is off data no longer stays in this memory. o Level 4 or Secondary Memory – It is external memory which is not as fast as main memory but data stays permanently in this memory. July – Oct 2021
  • 85. CSC 457 Lecture Notes 85 Cache Memory o The cache is a smaller and faster memory which stores copies of the data from frequently used main memory locations. o Cache Memory is a special very high-speed memory used to speed up and synchronizing with high-speed CPU. Cache memory is costly than main memory or disk memory but economical than CPU registers. o Cache memory is an extremely fast memory type that acts as a buffer between RAM and the CPU. It holds frequently requested data and instructions so that they are immediately available to the CPU when needed. July – Oct 2021
  • 86. CSC 457 Lecture Notes 86 Cache Memory o Cache memory is used to reduce the average time to access data from the Main memory. The cache is a smaller and faster memory which stores copies of the data from frequently used main memory locations. o There are various different independent caches in a CPU, which store instructions and data. July – Oct 2021
  • 87. CSC 457 Lecture Notes 87 Basic Definitions in Cache Memory o cache block - The basic unit for cache storage. May contain multiple bytes/words of data. o cache line - Same as cache block. Note that this is not the same thing as a “row” of cache. o cache set - A “row” in the cache. The number of blocks per set is determined by the layout of the cache (e.g. direct mapped, set-associative, or fully associative). o tag - A unique identifier for a group of data. Because different regions of memory may be mapped into a block, the tag is used to differentiate between them. o valid bit - A bit of information that indicates whether the data in a block is valid (1) or not (0). July – Oct 2021
  • 88. CSC 457 Lecture Notes 88 Types of Cache Memory o There are three general cache levels: o L1 cache, or primary cache, is extremely fast but relatively small, and is usually embedded in the processor chip as CPU cache. o L2 cache, or secondary cache, often has higher capacity than L1. L2 cache may be embedded on the CPU, or it can be on a separate chip or coprocessor and have a high- speed alternative system bus connecting the cache and CPU. That way it doesn't get slowed by traffic on the main system bus. July – Oct 2021
  • 89. CSC 457 Lecture Notes 89 Types of Cache Memory o Level 3 (L3) cache is specialized memory developed to improve the performance of L1 and L2. L1 or L2 can be significantly faster than L3, though L3 is usually double the speed of DRAM. With multicore processors, each core can have dedicated L1 and L2 cache, but they can share an L3 cache. If an L3 cache references an instruction, it is usually elevated to a higher level of cache. o Contrary to popular belief, implementing flash or more dynamic RAM (DRAM) on a system won't increase cache memory. This can be confusing since the terms memory caching (hard disk buffering) and cache memory are often used interchangeably. o Memory caching, using DRAM or flash to buffer disk reads, is meant to improve storage I/O by caching data that is frequently referenced in a buffer ahead of slower magnetic disk or tape. Cache memory, on the other hand, provides read buffering for the CPU July – Oct 2021
  • 90. CSC 457 Lecture Notes 90 Cache Memory Performance o When the processor needs to read or write a location in main memory, it first checks for a corresponding entry in the cache. o If the processor finds that the memory location is in the cache, a cache hit has occurred and data is read from cache o If the processor does not find the memory location in the cache, a cache miss has occurred. For a cache miss, the cache allocates a new entry and copies in data from main memory, then the request is fulfilled from the contents of the cache. o The performance of cache memory is frequently measured in terms of a quantity called Hit ratio. Hit ratio = hit / (hit + miss) = no. of hits/total accesses o We can improve Cache performance using higher cache block size, higher associativity, reduce miss rate, reduce miss penalty, and reduce the time to hit in the cache. July – Oct 2021
  • 91. CSC 457 Lecture Notes 91 Architecture and data flow of a typical cache memory July – Oct 2021
  • 92. CSC 457 Lecture Notes 92 Cache Memory Mapping o There are three different types of mapping used for the purpose of cache memory which are as follows: Direct mapping, Associative mapping, and Set-Associative mapping. o Direct mapped cache has each block mapped to exactly one cache memory location. Conceptually, a direct mapped cache is like rows in a table with three columns: the cache block that contains the actual data fetched and stored, a tag with all or part of the address of the data that was fetched, and a flag bit that shows the presence in the row entry of a valid bit of data. o Fully associative cache mapping is similar to direct mapping in structure but allows a memory block to be mapped to any cache location rather than to a prespecified cache memory location as is the case with direct mapping. o Set associative cache mapping can be viewed as a compromise between direct mapping and fully associative mapping in which each block is mapped to a subset of cache locations. It is sometimes called N-way set associative mapping, which provides for a location in main memory to be cached to any of "N" locations in the L1 cache. July – Oct 2021
  • 93. CSC 457 Lecture Notes 93 Locality of Reference o The ability of cache memory to improve a computer's performance relies on the concept of locality of reference. o Locality describes various situations that make a system more predictable. o Cache memory takes advantage of these situations to create a pattern of memory access that it can rely upon. o There are several types of locality. Two key ones for cache are:  Temporal locality. This is when the same resources are accessed repeatedly in a short amount of time.  Spatial locality. This refers to accessing various data or resources that are near each other. July – Oct 2021
  • 94. CSC 457 Lecture Notes 94 Importance of Cache Memory o Cache memory is important because it improves the efficiency of data retrieval (improve performance). It stores program instructions and data that are used repeatedly in the operation of programs or information that the CPU is likely to need next. The computer processor can access this information more quickly from the cache than from the main memory. Fast access to these instructions increases the overall speed of the program. o Aside from its main function of improving performance, cache memory is a valuable resource for evaluating a computer's overall performance. Users can do this by looking at cache's hit-to-miss ratio. Cache hits are instances in which the system successfully retrieves data from the cache. A cache miss is when the system looks for the data in the cache, can't find it, and looks somewhere else instead. In some cases, users can improve the hit-miss ratio by adjusting the cache memory block size i.e. the size of data units stored. July – Oct 2021
  • 95. CSC 457 Lecture Notes 95 Practice Questions Que-1: A computer has a 256 KByte, 4-way set associative, write back data cache with the block size of 32 Bytes. The processor sends 32-bit addresses to the cache controller. Each cache tag directory entry contains, in addition, to address tag, 2 valid bits, 1 modified bit and 1 replacement bit. The number of bits in the tag field of an address is (A) 11 (B) 14 (C) 16 (D) 27 Answer: (C) Explanation: https://www.geeksforgeeks.org/gate-gate-cs-2012-question-54/ July – Oct 2021
  • 96. CSC 457 Lecture Notes 96 Practice Questions o Explanation: o A set-associative scheme is a hybrid between a fully associative cache, and direct mapped cache. It’s considered a reasonable compromise between the complex hardware needed for fully associative caches (which requires parallel searches of all slots), and the simplistic direct-mapped scheme, which may cause collisions of addresses to the same slot (similar to collisions in a hash table). • Number of blocks = Cache-Size/Block-Size = 256 KB / 32 Bytes = 213 • Number of Sets = 213 / 4 = 211 o Tag + Set offset + Byte offset = 32 o Tag + 11 + 5 = 32 o Tag = 16 July – Oct 2021
  • 97. CSC 457 Lecture Notes 97 Practice Questions Que-2: Consider the data given in previous question. The size of the cache tag directory is (A) 160 Kbits (B) 136 bits (C) 40 Kbits (D) 32 bits Answer: (A) Explanation: https://www.geeksforgeeks.org/gate-gate-cs-2012-question-55/ July – Oct 2021
  • 98. CSC 457 Lecture Notes 98 Practice Questions Explanation: 16 bit address 2 bit valid 1 modified 1 replace Total bits = 20 20 × no. of blocks = 160 K bits. July – Oct 2021
  • 99. CSC 457 Lecture Notes 99 Practice Questions Que-3: An 8KB direct-mapped write-back cache is organized as multiple blocks, each of size 32-bytes. The processor generates 32-bit addresses. The cache controller maintains the tag information for each cache block comprising of the following; 1 Valid bit; 1 Modified bit As many bits as the minimum needed to identify the memory block mapped in the cache. What is the total size of memory needed at the cache controller to store meta-data (tags) for the cache? (A) 4864 bits (B) 6144 bits (C) 6656 bits (D) 5376 bits Answer: (D) Explanation: https://www.geeksforgeeks.org/gate-gate-cs-2011-question-43/ July – Oct 2021
  • 100. CSC 457 Lecture Notes 100 Practice Questions Explanation oCache size = 8 KB oBlock size = 32 bytes oNumber of cache lines = Cache size / Block size = (8 × 1024 bytes)/32 = 256 ototal bits required to store meta-data of 1 line = 1 + 1 + 19 = 21 bits ototal memory required = 21 × 256 = 5376 bits July – Oct 2021
  • 101. CSC 457 Lecture Notes 101 Locating Data in the Cache o Given an address, we can determine whether the data at that memory location is in the cache. To do so, we use the following procedure: 1. Use the set index to determine which cache set the address should reside in. 2. For each block in the corresponding cache set, compare the tag associated with that block to the tag from the memory address. If there is a match, proceed to the next step. Otherwise, the data is not in the cache. 3. For the block where the data was found, look at valid bit. If it is 1, the data is in the cache, otherwise it is not. July – Oct 2021
  • 102. CSC 457 Lecture Notes 102 Locating Data in the Cache o If the data at that address is in the cache, then we use the block offset from that address to find the data within the cache block where the data was found. Figure 1 below shows divisions of the address for cache use o All of the information needed to locate the data in the cache is given in the address. Fig. 1 shows which parts of the address are used for locating data in the cache. o The least significant bits are used to determine the block offset. If the block size is B then b = log2 B bits will be needed in the address to specify the block offset. The next highest group of bits is the set index and is used to determine which cache set we will look at. o If S is the number of sets in our cache, then the set index has s = log2 S bits. Note that in a fully-associative cache, there is only 1 set so the set index will not exist. The remaining bits are used for the tag. If ℓ is the length of the address (in bits), then the number of tag bits is t = ℓ − b − s. July – Oct 2021 tag (t bits) Set index (s bits) Block offset (b bits)
  • 103. CSC 457 Lecture Notes 103 Cache Addressing o (Read Tutorial on “Hardware Organization and Design” – 15 pages) July – Oct 2021
  • 104. CSC 457 Lecture Notes 104 Multilevel Cache Organisation o Multilevel Caches is one of the techniques to improve Cache Performance by reducing the “MISS PENALTY”. Miss Penalty refers to the extra time required to bring the data into cache from the Main memory whenever there is a “miss” in the cache. o For clear understanding let us consider an example where the CPU requires 10 Memory References for accessing the desired information and consider this scenario in the following 3 cases of System design : Case 1 : System Design without Cache Memory o Here the CPU directly communicates with the main memory and no caches are involved. In this case, the CPU needs to access the main memory 10 times to access the desired information. July – Oct 2021
  • 105. CSC 457 Lecture Notes 105 Multilevel Cache Organisation Case 2 : System Design with Cache Memory o Here the CPU at first checks whether the desired data is present in the Cache Memory or not i.e. whether there is a “hit” in cache or “miss” in the cache. o Suppose there is 3 miss in Cache Memory then the Main Memory will be accessed only 3 times. o We can see that here the miss penalty is reduced because the Main Memory is accessed a lesser number of times than that in the previous case. July – Oct 2021
  • 106. CSC 457 Lecture Notes 106 Multilevel Cache Organisation Case 3 : System Design with Multilevel Cache Memory o Here the Cache performance is optimized further by introducing multilevel Caches. As shown in the above figure, we are considering 2 level Cache Design. o Suppose there is 3 miss in the L1 Cache Memory and out of these 3 misses there is 2 miss in the L2 Cache Memory then the Main Memory will be accessed only 2 times. o It is clear that here the Miss Penalty is reduced considerably than that in the previous case thereby improving the Performance of Cache Memory. July – Oct 2021
  • 107. CSC 457 Lecture Notes 107 Multilevel Cache Organisation o We can observe from the above 3 cases that we are trying to decrease the number of Main Memory References and thus decreasing the Miss Penalty in order to improve the overall System Performance. Also, it is important to note that in the Multilevel Cache Design, L1 Cache is attached to the CPU and it is small in size but fast. Although, L2 Cache is attached to the Primary Cache i.e. L1 Cache and it is larger in size and slower but still faster than the Main Memory. o Effective Access Time = Hit rate * Cache access time + Miss rate * Lower level access time o Average access Time For Multilevel Cache:(Tavg) Tavg = H1 * C1 + (1 – H1) * (H2 * C2 +(1 – H2) *M ) where H1 is the Hit rate in the L1 caches; H2 is the Hit rate in the L2 cache; C1 is the Time to access information in the L1 caches; C2 is the Miss penalty to transfer information from the L2 cache to an L1 cache and M is the Miss penalty to transfer information from the main memory to the L2 cache. July – Oct 2021
  • 108. CSC 457 Lecture Notes 108 Multilevel Cache Organisation Exercise o Que 1 - Find the Average memory access time for a processor with a 2 ns clock cycle time, a miss rate of 0.04 misses per instruction, a missed penalty of 25 clock cycles, and a cache access time (including hit detection) of 1 clock cycle. Also, assume that the read and write miss penalties are the same and ignore other write stalls. Solution Average Memory access time(AMAT)= Hit Time + Miss Rate * Miss Penalty. Hit Time = 1 clock cycle (Hit time = Hit rate * access time) but here Hit time is directly given so, Miss rate = 0.04 ;Miss Penalty= 25 clock cycle (this is the time taken by the above level of memory after the hit) so, AMAT= 1 + 0.04 * 25 AMAT= 2 clock cycle according to question 1 clock cycle = 2 ns AMAT = 4ns July – Oct 2021
  • 109. CSC 457 Lecture Notes 109 Virtual Memory Terminologies o Virtual Memory - A storage allocation scheme in which secondary memory can be addressed as though it were part of main memory. The addresses a program may use to reference memory are distinguished from the addresses the memory system uses to identify physical storage sites, and program-generated addresses are translated automatically to the corresponding machine addresses. The size of virtual storage is limited by the addressing scheme of the computer system and by the amount of secondary memory available and not by the actual number of main storage locations. o Virtual Address - The address assigned to a location in virtual memory to allow that location to be accessed as though it were part of main memory. o Virtual address space - The virtual storage assigned to a process. o Address space - The range of memory addresses available to a process. o Real address - The address of a storage location in main memory. July – Oct 2021
  • 110. CSC 457 Lecture Notes 110 Virtual Memory o Virtual Memory is a storage allocation scheme in which secondary memory can be addressed as though it were part of main memory. o The addresses a program may use to reference memory are distinguished from the addresses the memory system uses to identify physical storage sites, and program generated addresses are translated automatically to the corresponding machine addresses. o The size of virtual storage is limited by the addressing scheme of the computer system and amount of secondary memory is available not by the actual number of the main storage locations. o It is a technique that is implemented using both hardware and software. It maps memory addresses used by a program, called virtual addresses, into physical addresses in computer memory. July – Oct 2021
  • 111. CSC 457 Lecture Notes 111 Virtual Memory o Two characteristics fundamental to memory management: 1) all memory references are logical addresses that are dynamically translated into physical addresses at run time 2) a process may be broken up into a number of pieces that don’t need to be contiguously located in main memory during execution o If these two characteristics are present, it is not necessary that all of the pages or segments of a process be in main memory during execution. This means that the required pages need to be loaded into memory whenever required. Virtual memory is implemented using Demand Paging or Demand Segmentation. July – Oct 2021
  • 112. CSC 457 Lecture Notes 112 Virtual Memory o Two characteristics fundamental to memory management: 1) all memory references are logical addresses that are dynamically translated into physical addresses at run time 2) a process may be broken up into a number of pieces that don’t need to be contiguously located in main memory during execution o If these two characteristics are present, it is not necessary that all of the pages or segments of a process be in main memory during execution. This means that the required pages need to be loaded into memory whenever required. Virtual memory is implemented using Demand Paging or Demand Segmentation. July – Oct 2021
  • 113. CSC 457 Lecture Notes 113 Thrashing o A state in which the system spends most of its time swapping process pieces rather than executing instructions o To avoid this, the operating system tries to guess, based on recent history, which pieces are least likely to be used in the near future July – Oct 2021
  • 114. CSC 457 Lecture Notes 114 Principle of Locality o Program and data references within a process tend to cluster o Only a few pieces of a process will be needed over a short period of time o Therefore it is possible to make intelligent guesses about which pieces will be needed in the future o Avoids thrashing July – Oct 2021
  • 115. CSC 457 Lecture Notes 115  Support Needed for Virtual Memory o For virtual memory to be practical and effective: 1. Hardware must support paging and segmentation 2. Operating system must include software for managing the movement of pages and/or segments between secondary memory and main memory July – Oct 2021
  • 116. CSC 457 Lecture Notes 116  Paging o The term virtual memory is usually associated with systems that employ paging o Use of paging to achieve virtual memory was first reported for the Atlas computer o Each process has its own page table and each page table entry contains the frame number of the corresponding page in main memory July – Oct 2021
  • 117. CSC 457 Lecture Notes 117  Paging o Paging is a memory management scheme that eliminates the need for contiguous allocation of physical memory. This scheme permits the physical address space of a process to be non – contiguous • Logical Address or Virtual Address (represented in bits): An address generated by the CPU • Logical Address Space or Virtual Address Space( represented in words or bytes): The set of all logical addresses generated by a program • Physical Address (represented in bits): An address actually available on memory unit • Physical Address Space (represented in words or bytes): The set of all physical addresses corresponding to the logical addresses July – Oct 2021
  • 118. CSC 457 Lecture Notes 118  Paging o Example: • If Logical Address = 31 bit, then Logical Address Space = 231 words = 2 G words (1 G = 230) • If Logical Address Space = 128 M words = 27 * 220 words, then Logical Address = log2 227 = 27 bits • If Physical Address = 22 bit, then Physical Address Space = 222 words = 4 M words (1 M = 220) • If Physical Address Space = 16 M words = 24 * 220 words, then Physical Address = log2 224 = 24 bits July – Oct 2021
  • 119. CSC 457 Lecture Notes 119  Paging o The mapping from virtual to physical address is done by the memory management unit (MMU) which is a hardware device and this mapping is known as paging technique. o The Physical Address Space is conceptually divided into a number of fixed-size blocks, called frames. o The Logical address Space is also split into fixed-size blocks, called pages. o Page Size = Frame Size July – Oct 2021
  • 120. CSC 457 Lecture Notes 120  Paging • Let us consider an example: • Physical Address = 12 bits, then Physical Address Space = 4 K words • Logical Address = 13 bits, then Logical Address Space = 8 K words • Page size = frame size = 1 K words (assumption) July – Oct 2021
  • 121. CSC 457 Lecture Notes 121  Paging o Address generated by CPU is divided into • Page number(p): Number of bits required to represent the pages in Logical Address Space or Page number • Page offset(d): Number of bits required to represent particular word in a page or page size of Logical Address Space or word number of a page or page offset. o Physical Address is divided into • Frame number(f): Number of bits required to represent the frame of Physical Address Space or Frame number. • Frame offset(d): Number of bits required to represent particular word in a frame or frame size of Physical Address Space or word number of a frame or frame offset July – Oct 2021
  • 122. CSC 457 Lecture Notes 122  Paging o The hardware implementation of page table can be done by using dedicated registers. But the usage of register for the page table is satisfactory only if page table is small. o If page table contain large number of entries then we can use TLB (translation Look-aside buffer), a special, small, fast look up hardware cache. o The TLB is associative, high speed memory. o Each entry in TLB consists of two parts: a tag and a value. o When this memory is used, then an item is compared with all tags simultaneously. If the item is found, then corresponding value is returned. July – Oct 2021
  • 123. CSC 457 Lecture Notes 123  Paging July – Oct 2021
  • 124. CSC 457 Lecture Notes 124  Paging o Main memory access time = m o If page table are kept in main memory, o Effective access time = m(for page table) + m(for particular page in page table) July – Oct 2021
  • 125. CSC 457 Lecture Notes 125  Paging Sample Question o Consider a machine with 64 MB physical memory and a 32-bit virtual address space. If the page size is 4KB, what is the approximate size of the page table? (A) 16 MB (B) 8 MB (C) 2 MB (D) 24 MB Answer: (C) o Explanation: See question 1 of https://www.geeksforgeeks.org/operating-systems-set-2/ July – Oct 2021
  • 126. CSC 457 Lecture Notes 126  Paging Explanation: A page entry is used to get address of physical memory. Here we assume that single level of Paging is happening. So the resulting page table will contain entries for all the pages of the Virtual address space. Number of entries in page table = (virtual address space size)/(page size) Using above formula we can say that there will be 2^(32-12) = 2^20 entries in page table. No. of bits required to address the 64MB Physical memory = 26. So there will be 2^(26-12) = 2^14 page frames in the physical memory. And page table needs to store the address of all these 2^14 page frames. Therefore, each page table entry will contain 14 bits address of the page frame and 1 bit for valid-invalid bit. Since memory is byte addressable. So we take that each page table entry is 16 bits i.e. 2 bytes long. Size of page table = (total number of page table entries) *(size of a page table entry) = (2^20 *2) = 2MB July – Oct 2021
  • 127. CSC 457 Lecture Notes 127  Demand Paging o The process of loading the page into memory on demand (whenever page fault occurs) is known as demand paging. The process includes the following steps : July – Oct 2021
  • 128. CSC 457 Lecture Notes 128  Demand Paging 1. If CPU try to refer a page that is currently not available in the main memory, it generates an interrupt indicating memory access fault. 2. The OS puts the interrupted process in a blocking state. For the execution to proceed the OS must bring the required page into the memory. 3. The OS will search for the required page in the logical address space. 4. The required page will be brought from logical address space to physical address space. The page replacement algorithms are used for the decision making of replacing the page in physical address space. 5. The page table will updated accordingly. 6. The signal will be sent to the CPU to continue the program execution and it will place the process back into ready state. o Hence whenever a page fault occurs these steps are followed by the operating system and the required page is brought into memory. July – Oct 2021
  • 129. CSC 457 Lecture Notes 129  Advantages of Demand Paging 1. More processes may be maintained in the main memory: Because we are going to load only some of the pages of any particular process, there is room for more processes. This leads to more efficient utilization of the processor because it is more likely that at least one of the more numerous processes will be in the ready state at any particular time. 2. A process may be larger than all of main memory: One of the most fundamental restrictions in programming is lifted. A process larger than the main memory can be executed because of demand paging. The OS itself loads pages of a process in main memory as required. 3. It allows greater multiprogramming levels by using less of the available (primary) memory for each process July – Oct 2021
  • 130. CSC 457 Lecture Notes 130  Page Fault Service Time o The time taken to service the page fault is called as page fault service time. The page fault service time includes the time taken to perform all the above six steps. Let Main memory access time is: m Page fault service time is: s Page fault rate is : p Then, Effective memory access time = (p*s) + (1-p)*m July – Oct 2021
  • 131. CSC 457 Lecture Notes 131  Swapping o Swapping a process out means removing all of its pages from memory, or marking them so that they will be removed by the normal page replacement process. o Suspending a process ensures that it is not runnable while it is swapped out. At some later time, the system swaps back the process from the secondary storage to main memory. o When a process is busy swapping pages in and out then this situation is called thrashing July – Oct 2021
  • 132. CSC 457 Lecture Notes 132  Inverted Page Table o Page number portion of a virtual address is mapped into a hash value  hash value points to inverted page table o Fixed proportion of real memory is required for the tables regardless of the number of processes or virtual pages supported o Structure is called inverted because it indexes page table entries by frame number rather than by virtual page number July – Oct 2021
  • 133. CSC 457 Lecture Notes 133  Inverted Page Table o Each entry in the page table includes: July – Oct 2021 The process that owns this page Includes flags and protection and locking info the index value of the next entry in the chain Page Number Process Identifier Control Bits Chain Pointer
  • 134. CSC 457 Lecture Notes 134  Translation Lookaside Buffer (TLB) o Each virtual memory reference can cause two physical memory accesses:  one to fetch the page table entry  one to fetch the data o To overcome the effect of doubling the memory access time, most virtual memory schemes make use of a special high-speed cache called a translation lookaside buffer (TLB) July – Oct 2021
  • 135. CSC 457 Lecture Notes 135  Translation Lookaside Buffer (TLB) o The TLB only contains some of the page table entries so we cannot simply index into the TLB based on page number  each TLB entry must include the page number as well as the complete page table entry (associative mapping) o The processor is equipped with hardware that allows it to interrogate simultaneously a number of TLB entries to determine if there is a match on page number July – Oct 2021
  • 136. CSC 457 Lecture Notes 136  Translation Lookaside Buffer (TLB) o Some TLBs store address-space identifiers (ASIDs) in each TLB entry – – uniquely identifies each process – provide address-space protection for that process – Otherwise need to flush at every context switch o TLBs typically small (64 to 1,024 entries) o On a TLB miss, value is loaded into the TLB for faster access next time – Replacement policies must be considered – Some entries can be wired down for permanent fast access July – Oct 2021
  • 137. CSC 457 Lecture Notes 137  Improving Efficiency of Virtual Address Translation) o Next step towards improvement of the efficiency of virtual address translation is the memory management unit - MMU, introduced into modern microprocessors. o The functioning of the memory management unit is based on the use of address translation buffers and other registers, in which current pointers to all tables used in virtual to physical address translation are stored July – Oct 2021
  • 138. CSC 457 Lecture Notes 138  Improving Efficiency of Virtual Address Translation) o The MMU unit checks if the requested page descriptor is in the TLB. If so, the MMU generates the physical address for the main memory. o If the descriptor is missing in TLB, then MMU brings the descriptor from the main memory and updates the TLB. o Next, depending on the presence of the page in the main memory, the MMU performs address translation or launches the transmission of the page to the main memory from the auxiliary store. July – Oct 2021
  • 139. CSC 457 Lecture Notes 139  Segmentation o A process is divided into Segments. The chunks that a program is divided into which are not necessarily all of the same sizes are called segments. Segmentation gives user’s view of the process which paging does not give. Here the user’s view is mapped to physical memory. o There are types of segmentation: 1. Virtual memory segmentation – Each process is divided into a number of segments, not all of which are resident at any one point in time. 2. Simple segmentation – Each process is divided into a number of segments, all of which are loaded into memory at run time, though not necessarily contiguously. July – Oct 2021
  • 140. CSC 457 Lecture Notes 140  Segmentation o There is no simple relationship between logical addresses and physical addresses in segmentation. A table stores the information about all such segments and is called Segment Table. o Segment Table – It maps two-dimensional Logical address into one-dimensional Physical address. It’s each table entry has: o Base Address: It contains the starting physical address where the segments reside in memory. o Limit: It specifies the length of the segment. July – Oct 2021
  • 141. CSC 457 Lecture Notes 141  Segmentation July – Oct 2021
  • 142. CSC 457 Lecture Notes 142  Segmentation o Translation of Two dimensional Logical Address to one dimensional Physical Address o Address generated by the CPU is divided into:  Segment number (s): Number of bits required to represent the segment.  Segment offset (d): Number of bits required to represent the size of the segment. July – Oct 2021
  • 143. CSC 457 Lecture Notes 143  Segmentation Advantages of Segmentation 1. No Internal fragmentation 2. Segment table consumes less space in comparison to page table in paging Disadvantage of Segmentation 1. As processes are loaded and removed from the memory, the free memory space is broken into little pieces, causing external fragmentation July – Oct 2021
  • 144. CSC 457 Lecture Notes 144 Next ….... Shared Memory Multiprocessors July – Oct 2021
  • 145. CSC 457 Lecture Notes 145  Shared Memory Multiprocessors o A system with multiple CPUs “sharing” the same main memory is called multiprocessor. o In a multiprocessor system all processes on the various CPUs share a unique logical address space, which is mapped on a physical memory that can be distributed among the processors. o Each process can read and write a data item simply using load and store operations, and process communication is through shared memory. o It is the hardware that makes all CPUs access and use the same main memory. o This is an architectural model simple and easy to use for programming; it can be applied to a wide variety of problems that can be modeled as a set of tasks, to be executed in parallel (at least partially) July – Oct 2021
  • 146. CSC 457 Lecture Notes 146  Shared Memory Multiprocessors o Since all CPUs share the address space, only a single instance of the operating system is required. o When a process terminates or goes into a wait state for whichever reason, the OS can look in the process table (more precisely, in the ready processes queue) for another process to be dispatched to the idle CPU. o On the contrary, in systems with no shared memory, each CPU must have its own copy of the operating system, and processes can only communicate through message passing. o The basic issue in shared memory multiprocessor systems is memory itself, since the larger the number of processors involved, the more difficult to work on memory efficiently. July – Oct 2021
  • 147. CSC 457 Lecture Notes 147  Shared Memory Multiprocessors o All modern OS (Windows, Solaris, Linux, MacOS) support symmetric multiprocessing, (SMP), with a scheduler running on every processor (a simplified description, of course). o “ready to run” processes can be inserted into a single queue, that can be accessed by every scheduler, alternatively there can be a “ready to run” queue for each processor. o When a scheduler is activated in a processor, it chooses one of the “ready to run” processes and dispatches it on its processor (with a single queue, things are somewhat more difficult, can you guess why?) July – Oct 2021
  • 148. CSC 457 Lecture Notes 148  Shared Memory Multiprocessors o A distinct feature in multiprocessor systems is load balancing. o It is useless having many CPUs in a system, if processes are not distributed evenly among the cores. o With a single “ready-to-run” queue, load balancing is usually automatic: if a processor is idle, its scheduler will pick a process from the shared queue and will start it on that processor. o Modern OSs designed for SMP often have a separate queue for each processor (to avoid the problems associated with a single queue). o There is an explicit mechanism for load balancing, by which a process on the wait list of an overloaded processor is moved to the queue of another, less loaded processor.  As an example, SMP Linux activates its load balancing scheme every 200 ms, and whenever a processor queue empties. July – Oct 2021
  • 149. CSC 457 Lecture Notes 149  Shared Memory Multiprocessors o Migrating a process to a different processor can be costly when each core has a private cache (can you guess why?). o This is why some OSs, such as Linux, offer a system call to specify that a process is tied to the processor, independently of the processors load. o There are three classes of multiprocessors, according to the way each CPU sees main memory: - Uniform Memory Access (UMA), - Non Uniform Memory Access (NUMA) - Cache Only Memory Access (COMA) July – Oct 2021
  • 150. CSC 457 Lecture Notes 150  Shared Memory Multiprocessors 1. Uniform Memory Access (UMA): o The name of this type of architecture hints to the fact that all processors share a unique centralized primary memory, so each CPU has the same memory access time. o Owing to this architecture, these systems are also called Symmetric Shared-memory Multiprocessors (SMP) o The simplest multiprocessor system has a single bus to which connect at least two CPUs and a memory (shared among all processors). o When a CPU wants to access a memory location, it checks if the bus is free, then it sends the request to the memory interface module and waits for the requested data to be available on the bus. July – Oct 2021
  • 151. CSC 457 Lecture Notes 151  Shared Memory Multiprocessors 1. Uniform Memory Access (UMA): o Multicore processors are small UMA multiprocessor systems, where the first shared cache (L2 or L3) is actually the communication channel. o Shared memory can quickly become a bottleneck for system performances, since all processors must synchronize on the single bus and memory access. o Larger multiprocessor systems (>32 CPUs) cannot use a single bus to interconnet CPUs to memory modules, because bus contention becomes un-manegeable. o CPU – memory is realized through an interconnection network (in jargon “fabric”). o Caches local to each CPU alleviate the problem, furthermore each processor can be equipped with a private memory to store data of computations that need not be shared by other processors. Traffic to/from shared memory can reduce considerably July – Oct 2021
  • 152. CSC 457 Lecture Notes 152  Shared Memory Multiprocessors 2. Non Uniform Memory Access (NUMA): o Single bus UMA systems are limited in the number of processors, and costly hardware is necessary to connect more processors. Current technology prevents building UMA systems with more than 256 processors. o To build larger processors, a compromise is mandatory: not all memory blocks can have the same access time with respect to each CPU. o This is the origin of the name NUMA systems: Non Uniform Memory Access. July – Oct 2021
  • 153. CSC 457 Lecture Notes 153  Shared Memory Multiprocessors 2. Non Uniform Memory Access (NUMA): o These systems have a shared logical address space, but physical memory is distributed among CPUs, so that access time to data depends on data position, in local or in a remote memory (thus the NUMA denomination). These systems are also called Distributed Shared Memory (DSM) architectures July – Oct 2021
  • 154. CSC 457 Lecture Notes 154  Shared Memory Multiprocessors 2. Non Uniform Memory Access (NUMA): o Since all NUMA systems have a single logical address space shared by all CPUs, while physical memory is distributed among processors, there are two types of memories: local and remote memory. o Yet, even remote memory is accessed by each CPU with LOAD and STORE instructions. o There are two types of NUMA systems: • Non-Caching NUMA (NC-NUMA) • Cache-Coherent NUMA (CC-NUMA) July – Oct 2021
  • 155. CSC 457 Lecture Notes 155  Shared Memory Multiprocessors Non Caching -NUMA o In a NC-NUMA system, processors have no local cache. o Each memory access is managed with a modified MMU, which controls if the request is for a local or for a remote block; in the latter case, the request is forwarded to the node containing the requested data. o Obviously, programs using remote data (with respect to the CPU requesting them) will run much slower than what they would, if the data were stored in the local memory o In NC-NUMA systems there is no cache coherency problem, because there is no caching at all: each memory item is in a single location. o Remote memory access is however very inefficient. For this reason, NC- NUMA systems can resort to special software that relocates memory pages from one block to another, just to maximize performances. o A page scanner demon activates every few seconds, examines statistics on memory usage, and moves pages from one block to another, to increase performances. July – Oct 2021
  • 156. CSC 457 Lecture Notes 156  Shared Memory Multiprocessors Non Caching -NUMA o Actually, in NC-NUMA systems, each processor can also have a private memory and a cache, and only private date (those allocated in the private local memory) can be in the cache. o This solution increases the performances of each processor, and is adopted in Cray T3D/E. o Yet, remote data access time remains very high, 400 processor clock cycles in Cray T3D/E, against 2 for retrieving data from local cache. July – Oct 2021