MEMORY SYNTHESIS USING AI METHODS
Gabriel Mateescu
August 18, 1993
Research Project Report
Universit¨at Dortmund
European Economic Community Individual Fellowship
Contract Number: CIPA-3510-CT-925978
i
ii Memory Synthesis Using AI Methods
Memory Synthesis Using AI Methods iii
Contents
1 RESEARCH PROJECT GOALS 1
2 INTELLIGENT DESIGN ASSISTANT FOR MEMORY SYNTHESIS 3
2.1 High Level Organization of IDAMS . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Knowledge Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 PERFORMANCE AND COST 9
3.1 Performance Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Performance improving: Amdahl’s Law . . . . . . . . . . . . . . . . . . . . 10
3.3 CPU Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 COMPUTER ARCHITECTURE OVERVIEW 12
4.1 An Architecure Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Multiprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Multiprocessing performance . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.4 Interprocess communication and synchronization . . . . . . . . . . . . . . . 16
4.5 Coherence, Consistency, and Event Ordering . . . . . . . . . . . . . . . . . 17
5 MEMORY HIERARCHY DESIGN 19
5.1 General Principles of Memory Hierarchy . . . . . . . . . . . . . . . . . . . . 19
5.2 Performance Impact of Memory Hierarchy . . . . . . . . . . . . . . . . . . . 21
5.3 Aspects that Classify a Memory Hierarchy . . . . . . . . . . . . . . . . . . . 23
5.4 Cache Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.5 Line Placement and Identification . . . . . . . . . . . . . . . . . . . . . . . . 26
5.6 Line Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.7 Write Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.8 The Sources of Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.9 Line Size Impact on Average Memory-access Time . . . . . . . . . . . . . . 31
5.10 Operating System and Task Switch Impact on Miss Rate . . . . . . . . . . 31
5.11 An Example Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.12 Multiprocessor Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
iv Memory Synthesis Using AI Methods
5.13 The Cache-Coherence Problem . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.13.1 Cache Coherence for I/O . . . . . . . . . . . . . . . . . . . . . . . . 36
5.13.2 Cache-Coherence for Shared-Memory Multiprocessors . . . . . . . . 37
5.14 Cache Flushing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6 IMPROVING CACHE PERFORMANCE 39
6.1 Cache Organization and CPU Performance . . . . . . . . . . . . . . . . . . 39
6.2 Reducing Read Hit Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.3 Reducing Read Miss Penalty . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.4 Reducing Conflict Misses in a Direct-Mapped Cache . . . . . . . . . . . . . 42
6.4.1 Victim Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.4.2 Column-Associative Cache . . . . . . . . . . . . . . . . . . . . . . . . 43
6.5 Reducing Read Miss Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.6 Reducing Write Hit Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.6.1 Pipelined Writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.6.2 Subblock Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.7 Reducing Write Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.8 Two-level Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.8.1 Reducing Miss Penalty . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.8.2 Second-level Cache Design . . . . . . . . . . . . . . . . . . . . . . . . 50
6.9 Increasing Main Memory Bandwidth . . . . . . . . . . . . . . . . . . . . . . 51
6.9.1 Wider Main Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.9.2 Interleaved Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7 SYNCHRONIZATION PROTOCOLS 54
7.1 Performance Impact of Synchronization . . . . . . . . . . . . . . . . . . . . 54
7.2 Hardware Synchronization Primitives . . . . . . . . . . . . . . . . . . . . . . 54
7.2.1 TEST&SET(lock) and RESET(lock) . . . . . . . . . . . . . . . . . . . . 54
7.2.2 FETCH&ADD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.2.3 Full/Empty bit primitive . . . . . . . . . . . . . . . . . . . . . . . . 56
7.3 Synchronization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Memory Synthesis Using AI Methods v
7.3.1 LOCK and UNLOCK operations . . . . . . . . . . . . . . . . . . . . . . . 57
7.3.2 Semaphores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.3.3 Mutual Exclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.3.4 Barriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.4 Hot Spots in Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.4.1 Combining Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.4.2 Software Combining Trees . . . . . . . . . . . . . . . . . . . . . . . . 60
7.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8 SYSTEM CONSISTENCY MODELS 63
8.1 Event Ordering Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
8.2 Categorization of Shared Memory Accesses . . . . . . . . . . . . . . . . . . 64
8.3 Memory Access Labeling and Properly-Labeled Programs . . . . . . . . . . 66
8.4 Sequential Consistency Model . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8.4.1 Conditions for Sequential Consistency . . . . . . . . . . . . . . . . . 69
8.4.2 Consistency and Shared-Memory Architecture . . . . . . . . . . . . . 69
8.4.3 Performance of Sequential Consistency . . . . . . . . . . . . . . . . . 70
8.5 Processor Consistency Model . . . . . . . . . . . . . . . . . . . . . . . . . . 71
8.6 Weak Consistency Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
8.7 Release Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.8 Correctness of Operation and Performance Issues . . . . . . . . . . . . . . . 74
9 CACHE COHERENCE PROTOCOLS 76
9.1 Types of Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
9.2 Rules enforcing Cache Coherence . . . . . . . . . . . . . . . . . . . . . . . . 78
9.3 Cache Invalidation Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
9.4 Snooping Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.4.1 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.4.2 Snooping Protocol Example . . . . . . . . . . . . . . . . . . . . . . . 80
9.4.3 Improving Performance of Snooping Protocol . . . . . . . . . . . . . 82
9.5 Directory-based Cache Coherence . . . . . . . . . . . . . . . . . . . . . . . . 83
vi Memory Synthesis Using AI Methods
9.5.1 Classification of Directory Schemes . . . . . . . . . . . . . . . . . . . 83
9.5.2 Full-Map Centralized-Directory Protocol . . . . . . . . . . . . . . . . 83
9.5.3 Limited-Directory Protocol . . . . . . . . . . . . . . . . . . . . . . . 86
9.5.4 Distributed Directory and Memory . . . . . . . . . . . . . . . . . . . 87
9.6 Compiler-directed Cache Coherence Protocols . . . . . . . . . . . . . . . . . 94
9.7 Line Size Effect on Coherence Protocol Performance . . . . . . . . . . . . . 98
10 MEMORY SYSTEM DESIGN AS A SYNERGY 99
10.1 Computer design requirements . . . . . . . . . . . . . . . . . . . . . . . . . 99
10.2 General Memory Design Rules . . . . . . . . . . . . . . . . . . . . . . . . . 101
10.3 Dependences Between System Components . . . . . . . . . . . . . . . . . . 101
10.4 Optimizing cache design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
10.4.1 Cache Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
10.4.2 Associativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
10.4.3 Line Size and Cache Fetch Algorithm . . . . . . . . . . . . . . . . . 105
10.4.4 Line Replacement Strategy . . . . . . . . . . . . . . . . . . . . . . . 106
10.4.5 Write Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
10.4.6 Cache Coherence Protocol . . . . . . . . . . . . . . . . . . . . . . . . 107
10.4.7 Design Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
10.5 Design Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
11 CONCLUSIONS 109
Memory Synthesis Using AI Methods 1
1 RESEARCH PROJECT GOALS
This report presents the results of a three-month work carried out at University Dort-
mund, within the framework of the European Economic Community individual fellowship
contract CIPA-3510-CT-925978. The purpose of the work has been to provide domain
knowledge for a knowledge based memory synthesis tool that is now under development
at Lehrstuhl Informatik XII, Universit¨at Dortmund.
The increasing gap between processor and main memory speeds has led computer archi-
tects to the concept of memory hierarchy. The gap expresses the trend that CPUs are
getting faster and main memories are getting larger, but slower relative to the faster CPUs.
The performance improvement of the CPU has been and still is faster than that of the
memory. The memory hierarchy concept recognizes that smaller memory is faster, and is
based on organizing the memory in levels, each smaller, faster, and more expensive per
byte than the level below. Synthesizing a memory hierarchy for a System Under Design
(SUD) is a complex design task that cannot be approached separatedly from the entire
system design, beginning with the architecture features and going on to the compiler
technology and to application characteristics.
Two main problems arise in designing a memory hierarchy: first, the need of knowledge
about the design rules and the available design choices that are reflecting the state of the
art in the domain, and second, a way to evaluate the performance of the memory hierarchy
for the system under design, taking into account aspects such as the architecture, number
of processors, typical application programs for which the machine is targeted, compiler
technology, and manufacturing technology (e.g., silicon technology, packaging). Solving
the first problem requires extensive specific knowledge of design parameters and their
relationships, and unfortunately, some relationships can not be expressed exaclty for the
general case (e.g., which cache line-size to choose for a given cache size). Similarly, the
impact of many design decisions on performance is not quantifiable for the general case
(e.g., which miss rate will occur for a given cache size). Several analytical models have
been proposed for evaluating the performance impact of the design parameters. Generally,
a combination of simulation and analytical methods is used to evaluate the performance of
a design alternative. Analytical models for performance evaluation are limited to a given
architecture or to a range of similar architectures. However, when a new architecture
is developed, analytical models may not be available. To overcome these problems, a
knowledge-based memory synthesis tool is proposed. The tool is an expert system that
designs a (part of) memory hierarchy for a specified SUD, and it is currently developed
at Universit¨at Dortmund by Renate Beckmann who has dubbed it: Intelligent Design
Assistant for Memory Synthesis (IDAMS).
Because of the limited duration of the research stay, I have chosen to focus my attention
on the design of the upper level of memory hierarchy, that is, the cache. Design of cache
is tackled both for uniprocessor and multiprocessor architectures. The domain knowledge
provided in this report will be incorporated in the IDAMS. Efforts have been made to
2 Memory Synthesis Using AI Methods
cover the state of the art in cache design, but it is likely that some aspects have been
overlooked. However, the flexibility of IDAMS allows incorporation of future knowledge
as it will be available.
The organization of the report is as follows. First, the high-level organization of the design
assistant for memory synthesis (IDAMS) is explained in Chapter 2. Since cost-performance
is a crucial design evaluation criterion, the measure of computer performance is described
in Chapter 3. The architecture of a system affects the cache organization and coherence
protocol. Architectural aspects that should be considered when designing the memory
system are discussed briefly in Chapter 4. Chapter 5 presents the basic design issues for
the memory hierarchy, with emphasis on caches. There exists a great number of techiques
for improving the performance of the basic cache design and the most important ones are
presented in Chapter 6. Synchronization is imperative for parallel programming and the
efficiency of synchronization operations has a great impact on multiprocessor performance.
Synchronization issues are discussed in Chapter 7. The memory-consistency model of a
system has a direct effect on the complexity of the programming model, on the achievable
implementation efficiency and on the amount of overhead associated with cache coherence
protocols —thus on performance. The major memory consistency models are presented
in Chapter 8, and cache-coherence protocols are analized in Chapter 9. Finally, all is put
together in Chapter 10, and the steps that are involved in cache memory design are shown,
based on the knowledge incorporated in the previous chapters.
I would like to thank Renate Beckmann, assistant at the Department of Computer Science,
University of Dortmund. We had many useful discussions about this project and her ideas,
comments, and suggestions helped me a lot in clarifying several design aspects. Much
credit goes to Renate especially for her contributions to Chapter 2. I am particularly
grateful to Professor Peter Marwedel, Chair of the Department of Computer Science, for
giving me the opportunity to work at the University of Dortmund and for providing me
with a generous logistic support.
Memory Synthesis Using AI Methods 3
2 INTELLIGENT DESIGN ASSISTANT FOR MEMORY
SYNTHESIS
2.1 High Level Organization of IDAMS
The Intelligent Design Assistant for Memory Synthesis (IDAMS) is a knowledge-based
tool for memory synthesis that configurates (part of) a memory hierarchy for a specified
system under design (SUD).
Its input information contains some details of the architecture of the SUD and the ap-
plication domain for which the machine is designed. The architectural information will
influence some of the design decisions. For example, if cache is used in the memory hier-
archy we have to deal with the coherence problem to guarantee the consistency of shared
data.
Another required piece of information is the application domain. The more is known
about the applications the SUD is designed for, the more is known about the typical char-
acteristics of memory accesses, and the greater is the possibility to configurate the memory
hierarchy such that for typical cases memory accesses are fast enough. For example, in the
domain of digital signal processing (DSP) large amount of data are processed, but the ac-
cesses exhibit a sequential pattern. This kind of application favors memory organizations
for which once a datum is fetched, the next data can be found easily.
Because of the high dependences between the memory structure and the architectural
features of the SUD, it should be possible for IDAMS to interact with the environment
if additional information about the architecture is needed; if there are some design al-
ternatives on which IDAMS can not decide, then the designer (user of IDAMS) may be
asked.
IDAMS deals with the memory design alternatives by maintaining a generic model of the
memory hierachy. Every useful design possibility for the memory is expressed by param-
eters in this model. For example, the memory hierarchy can be expressed as consisting of
a main memory of size M, with interleaving yes/no, and a cache yes/no of size C, with
line size L, associativity n, and so on. To configurate a memory hierarchy for a specific
SUD means to adjust these parameters so that all requirements on the memory are met.
There are many parameters to adjust and some of them have a lot of possible values (e.g.,
memory size). This leads to a great number of possible choices when searching for the
right combination of parameter adjustments.
The output of IDAMS will be a model of the memory hierarchy that specifies the design
parameters and that meets the requirements imposed by the designer. This model may be
transmitted to a module generator that generates the components of the memory hierarchy
at a lower level of abstraction.
There is no complete theory about how to design a memory hierarchy. That makes it
difficult to write an algorithm for this problem. Therefore, IDAMS is organized as an
expert system. This approach has the advantage that knowledge about memory design
(domain knowledge) can be separated from that about the organization and control of the
4 Memory Synthesis Using AI Methods
.
expert
knowl.
specific
problem
user
interview
component
explanation
component
knowledge
acquisition
component
problem solving component
domain
specific
knowl.
intermediate states and problem solution
Figure 1: Expert System Architecture
the design process (inference component). This makes it simple to extend the knowledge —
a necessary feature in problems with incomplete theory and knowledge about the problem
solving step. Another advantage of the expert system approach is that it supports the
process of modeling the rules of thumb (heuristics) of an expert designer.
The well known architecture of an expert system is illustrated in Figure 1.
An expert system contains several kinds of knowledge: The domain specific knowledge
consists of rules about the domain; for IDAMS, these are rules about how to design a
memory hierarchy. The problem specific knowledge holds the information about the actual
problem to be solved; for IDAMS, this is the information and the requirements about the
SUD for which the memory hierarchy is designed.
The intermediate states are the descriptions of partial solutions. For this system, the
partial solution is initially the generic model of the memory hierarchy. During the problem
solving step the parameters of the model are adjusted by IDAMS. The solution of the
problem is the model in which all parameters have been adjusted.
An important part of an expert system is the expert system shell. It contains the problem
solving component (also called the inference unit). The problem solving unit searches in
the knowledge base for a rule that is applicable to the current intermediate state. If a rule
is found, it is applied to the current state and a new state is reached. If more than one
rule are found then the problem solver has to select one rule. This can be done by several
strategies: use the newest rule, use the most specific one (that one with the most specific
IF-part), take the one with the highest priority (given in the rule), or randomly select a
rule.
The interaction between the designer and the expert system is managed by the interview
component.
The knowledge acquisition component has to insert new knowledge given by the expert into
Memory Synthesis Using AI Methods 5
the knowledge base. New knowledge about the memory design process can be inserted
into IDAMS through this component. This may be an editor used to write new rules into
specified files.
The explanation component gives information about the problem solving process to make
it transparent to the user and to the expert. This may display the last rule applied (how
the next state is reached) or the state before applying the last rule (why the rule has been
selected). The explanation component may be emulated by trace modus.
From the IDAMS point of view, the user is the designer of the memory hierarchy for
a specific machine. The expert inserts the rules about memory synthesis into IDAMS.
He/she may be an expert in designing memories or he/she may be a knowledge engineer
who acquires his/her knowledge from literature or from a memory design expert. There
is a big amount of knowledge in the area of memory synthesis and some of it adresses
specific parts of the memory design process.
To keep an overview over the knowledge base and to handle changes and extensions of
the knowledge, it should be modular. This can be done by building the IDAMS with a
blackboard architecture, as shown in Figure 2.
In an expert system with a blackboard architecture there are several agents, and each
agent acts as an expert for a special subtask (containing the rules to solve the subtasks).
An agenda contains the (sub)tasks to be done. Each agent that is solving a subtask
erases it from the agenda and possibly creates new subtasks that are inserted there. The
agents can communicate through a blackboard from/on which every agent can read/write
information.
The knowledge can be structured in IDAMS with respect to the architecture of the SUD,
the components of the memory hierarchy, or the problem to deal with. Each module is
handled by some agents:
• architecture agents:
The architecture agents handle the knowledge about the architecture of the SUD.
These agents know the requirements that must be met by the system and the com-
ponents. Architecture agents may exist for uniprocessors, multiprocessors, etc.
• memory component agents:
These agents have knowledge about the parameters of a specific component. Memory
component agents may exist for different levels of the memory hierarchy: cache, main
memory, secondary memory. For example, the cache agent knows which parameters
of the cache to adjust and the constraints on the allowable choices.
• special domain agents:
The knowledge about the domain for which the system is designed is maintained by
these agents. They analyze the special requirements of the domain for which the
SUD is designed. Some decisions on the design of the memory hierarchy are made
here, taking into account the characteristics of the domain. Special domain agents
may exist for general-purpose processors, digital signal processing, AI machines, etc.
6 Memory Synthesis Using AI Methods
architect. agent
multiprocessor
blackboard
...
......
domain agent
DSP
component agent
cache
agent i
...
generic memory model
design parameters
...
cache size
.
effects to requirements
...
interferencing
architecture informations and requirements module generator
memory model
interaction
knowledge base
Figure 2: Blackboard Architecture of IDAMS
Memory Synthesis Using AI Methods 7
The agents have to work together to adjust the parameters because the parameters are
interdependent and are influenced by the domain for which the machine is targeted. For
example, if the line size of a cache has to be chosen, the cache agent and the special
domain agent are needed. If the special domain agent is the DSP agent, he will perhaps
favor large line sizes, because for this domain data is often accessed sequentially — data
next to the currently accessed data is likely to be needed soon. On the other hand, the
cache agent knows that increasing the line size may have the negative effect of increasing
the average memory-access time (as shown in Section 5.9). Generally, the dependences
between the parameters of the memory hierarchy are manifold, they are hard to express
exactly, and cooperation between agents is absolutely necessary.
The modular structure of the expert system allows using expert system building tools.
Expert system building tools are tools providing an expert system shell, so that only the
knowledge has to be inserted into the system.
The report deals mainly with the cache synthesis for uni- and multiprocessors. Cache,
the level closer to the processor in the memory hierarchy, is crucial for achieving high
performance. The cache design will be an important part of the IDAMS system.
2.2 Knowledge Acquisition
The availabiltiy of expert system tools helps building the IDAMS. The tools usually con-
sist of an expert system shell, hence the main work left to do is to acquire the necessary
knowledge, to formalize it, and to insert it into the system using an adequate represen-
tation. The process is called knowledge acquisition and is mentioned in the literature as
the bottleneck in constructing expert systems, because it is hard work: designers with
knowledge and expertise are usually busy and expensive. They get their knowledge from
working in the special domain. The knowledge is sometimes unstructured and unformal-
ized. Another problem is the lack of motivation: an expert with special knowledge has
some kind of power and to make the knowledge public domain means to loose power.
To solve the knowledge acquisition problem, two main strategies have been developed:
• Direct methods — experts are asked about their knowledge. Interviews, questionar-
ies, introspection (observing an expert solving a special problem and explaining the
steps), self report (asking an expert to explain an existing solution), and protocol
checking (asking an expert to check a protocol of a former knowledge acquisition
session) are direct methods.
• Indirect methods — they attempt to acquire knowledge that is not explicit in the
brain of the expert. Their goal is not to find out how to solve the problem, but to
discover how to select an adequate form of knowledge representation.
This report addresses the direct methods for knowledge acquisition. The sources of knowl-
dge include information extracted from books and scientific papers published by leading
scientists in the field, and my own experience in designing computers. The basic knowledge
8 Memory Synthesis Using AI Methods
necessary for synthesizing a cache with the IDAMS is presented and, as shown in Section
10.5, the acquired knowledge influences the steps followed by IDAMS in the design cycle.
Memory Synthesis Using AI Methods 9
3 PERFORMANCE AND COST
3.1 Performance Measurement
The designer of a computer system may have a single design goal: designing only for high-
performance, or for low-cost. In the first case, no cost is spared in achieving performance,
while in the second case performance is sacrificed to achieve the lowest cost. In between
these extremes is the cost/performance design, where the designer balances cost versus
performance.
We need a measure of the performance. Time is the measure of computer performance,
but time can be measured in different ways, depending on what we count. Generally,
performance may be viewed from two perspectives:
• the computer user is interested in reducing response time — the time between the
start an the completion of the job — also referred to as execution time, elapsed time,
or latency.
• the computer center manager is interested in increasing throughput — the total
amount of work done in a given time — sometimes called bandwidth.
Typically, the terms “response time”. “execution time”, and “throughput” are used when
an entire computing task is discussed. The terms “latency” and “bandwidth” are often
the terms of choice when discussing a memory system.
Therefore, the two aspects of performance are performing a given amount of work in
the least time —reducing response time—, or maximizing the number of jobs that are
performed in a given amount of time —increasing throughput.
Performance is frequently measured as a rate of some number of events per second, so
that lower time means higher performance. Given two machines, say X and Y, the phrase
“X is n% faster than Y” means that:
n = 100
PerformanceX − PerformanceY
PerformanceY
(1)
where
PerformanceX
PerformanceY
=
Execution timeY
Execution timeX
(2)
Because performance and execution times are reciprocicals, increasing performance de-
creases execution time. To help avoid confusion between the terms “increasing” and
“decreasing”, we usually say “improve performance”or “improve execution time” when we
mean increase performance and decrease execution time.
The program response time is measured in seconds per program, and includes disk accesses,
memory accesses, input/output activities, operating system overhead — everything. Even
if the response time seen by the user is the elapsed time of the program, one must take into
account that with multiprogramming the CPU works on another program while waiting
10 Memory Synthesis Using AI Methods
for I/O and may not necessarily minimize the response time of one program. The measure
of the CPU performance is the CPU time (Section 3.3) which means the time CPU is
executing a given program, not including the time running other programs or waiting for
I/O.
3.2 Performance improving: Amdahl’s Law
An important principle of computer design is to make the common case fast and favor the
frequent case over the infrequent case. That is, the performance impact of an improvement
of an event is higher if the occurence of the event is frequent. A fundamental law, called
Amdahl’s Law, is quantifying this principle:
The performance improvement to be gained from using some faster mode of
execution is limited by the fraction of the time the faster mode can be used.
In other words, Amdahl’l law addresses the speedup that can be gained by using a particular
feature. Speedup is defined as:
Speedup =
Performance for entire job using the enhancement when possible
Performance for entire job without using the enhancement
(3)
Speedup tells us how much faster a task will run using the machine with the enhance-
ments as opposed to the original machine. Amdahl’s Law expresses the law of diminishing
returns:
The incremental improvement in speedup gained by an improvement in per-
formance of just a portion of the computation diminishes as improvements are
added.
In other words, an enhancement that provides a speedup denoted by Speedupenhanced
but is used only a fraction of time, denoted by Fractionenhanced, will provide an overall
Speedup:
Speedupoverall =
1
(1 − Fractionenhanced) + Fractionenhanced
Speedupenhanced
(4)
An important corollary of Amdahl’s Law is that if an enhancement is only usable for a
fraction of a task, we can’t speed up the task by by more than the reciprocal of 1 minus
that fraction. Amdahl’s Law can serve as a guide to how much an enhancement will
improve performance and how to distribute resources to inprove cost/performance. The
idea is to spend resources proportional to where time is spent.
3.3 CPU Performance
The response time includes the time of waiting for I/O, the time the CPU may work on
another program (in multiprogramming systems), and operating system overhead. If we
Memory Synthesis Using AI Methods 11
are interested in the CPU performance, its measure is the time the CPU is computing.
CPU time is the time the CPU is computing, not including the time waiting for I/O or
running other programs. The CPU time can be further divided into the CPU time spent
in the program, called user CPU time, and the CPU time spent in the operating system,
called system CPU time.
In this report, we shall use the response time as the system performance measure
and the user CPU time as the CPU performance measure.
CPU time for a program can be expressed with the following formula:
CPU time = CPU clock cycles for a program ∗ Clock cycle period (5)
If we denote the number of instructions to execute a program by Instruction count (IC),
then we can calculate the average number of clock cycles per instruction (CPI):
CPI =
CPU clock cycles for a program
IC
(6)
Expanding equation (5), we get:
CPU time = IC ∗ CPI ∗ Clock cycle time (7)
CPU time can be divided into the clock cycles the CPU spends executing the program
and the clock cycles the CPU spends waiting for the memory system (when we say that
“the CPU is stalled”):
CPU time = IC ∗ CPIExecution +
Memory − stall cycles
Instruction
∗ Clock cycle time (8)
Example
What effect do the following performance enhancements have on throughput
and response time?
1. Faster clock cycle time;
2. Parallel processing of a job;
3. Multiple processors for separate jobs.
Answer
Since decreasing response time usually increases throughput, both 1 and 2
improve response time and throughput. In 3, no one job gets work done faster,
so only throughput increases.
12 Memory Synthesis Using AI Methods
4 COMPUTER ARCHITECTURE OVERVIEW
4.1 An Architecure Classification
Designing a computer system is a task having many aspects, including instruction set de-
sign, functional organization, logic design, and implementation. The design of the machine
should be optimized across these levels. The term instruction set architecture refers to the
actual programmer-visible instruction set. The instruction set architecture serves as the
boundary between the hardware and the software. The term organization includes the
high-level aspects of a computer’s design, such as the memory system, the bus structure,
and the internal CPU design. Hardware is used to refer to the specifics of a machine, such
as the detailed logic design and the packaging technology of the machine. Hardware is
actually the main aspect of implementation, which encompasses integrated circuit design,
packaging, power, and cooling. We shall use the term architecture to cover all three aspects
of computer design.
According to the number of processors included, computer systems fall into two broad
categories: uniprocessor computers —which contain only one CPU—, and multiprocessor
computers —which contain more CPU’s. The CPU can be partitioned into into three
functional units, the instruction unit (also called the control unit) (I-unit), the execution
unit (E-unit), and the storage unit (S-unit);
• the I-unit is responsible for instruction fetch and decode and generates the control
signals; it may contain some local buffers for instruction prefetch (lookahead);
• the E-unit executes the instructions and contains the logic for arithmetic and logical
operations (the data path blocks);
• the S-unit provides the memory interface between the I-unit and the E-unit; it
provides memory management (protection, virtual address translation) and may
contain some additional components whose goal is to reduce the access time to data.
A finer classification of computer architectures may be made taking into account not only
the number of CPU’s but also the instruction and data flow. Under this approach, there
are three architectures ([7]):
• Single Instruction Stream Single Data Stream (SISD) architecture, in which a single
instruction stream is executed on a single stream of operands;
• Single Instruction Stream Multiple Data Stream (SIMD) architecture, in which a
single instruction stream is executed on several streams of operands;
• Multiple Instruction Stream Multiple Data Stream (MIMD) architecture, in which
different instruction streams are executed on different operand streams in parallel.
In this report we shall use as synonym for the MIMD architecture the term multiprocessor.
Multiprocesing is a way to increase system computing power. A distinctive feature of an
Memory Synthesis Using AI Methods 13
Processor 1
Processor 2
...
Processor n
Interconn Network
Memory
Memory
...
Memory
I/O
I/O
Figure 3: Shared-memory system: all memory and I/O are remote and shared
MIMD architecture is that multiple instruction streams must communicate or synchronize
by passing message or sharing memory.
4.2 Multiprocessors
Among multiprocessors, we can distinguish two broad classes, according to the logical
architecture of their memory systems [7]:
• shared memory systems (also called tightly coupled systems): all processors access
a single address space (Figure 3).
• distributed systems (also called loosely coupled systems or multicomputers) : each
processor can access only its own memory; each processor’s memory is logically
disjoint from other processor’s memory (Figure 4). In order to communicate, the
processors are sending messages to each other.
A memory unit is called a memoru module. Each processor has registers, arithmetic and
logic units, and access to memory and input/output modules. The distinction between
the architectures stems from way memory and input/output units are accessed by the
processor:
• in the shared memory model, memory and I/O systems are separate subsystems
shared among all of the processors;
• in the distributed memory model, memory and I/O units are attached to individual
processors; no sharing of memory and input/output is permitted.
14 Memory Synthesis Using AI Methods
Processor 1
Processor 2
...
Processor n
Memory
Memory
Memory
I/O
I/O
I/O
Interconn Network
Figure 4: Distributed system: all memory and I/O are local and private
In the shared memory model the address space of all processors is the same and is dis-
tributed among the memory modules, while in the distributed memory model each pro-
cessor has its own address space mapped to the local memory. In both cases, the systems
contains multiple processors each capable of executing an independent program, therefore
the system fits the MIMD model.
Actually, these two architectures represent the extremes in the design space, and practical
designs may lie at the extremes or anywhere in between. Therefore, multiprocessors can
have any reasonable combination of shared global memory and private local memory.
The goal of multiprocessing may be either to maximize throughput of many jobs (these
are called throughput-oriented multiprocessors) or to speed up the execution of a single
job (these are called speed-up - oriented multiprocessors). In the first type of systems,
jobs are distinct from each other and execute as if they were running on different unipro-
cessors. In the second type an application is partitioned into a set of cooperating precesses
and these processes interact while executing concurrently on different processors. The
partitioning of a job into cooperating processes is called mutlithreading or parallelization.
The shared memory model provides a convenient means for information interchange and
synchronization since any pair of processors can communicate through a shared location.
Shared memory systems present to the programmer a single address space, enhancing
the programmability of a parallel machine by reducing the problems of data partition-
ing and dynamic load distribution. The shared address space also improves support for
automatically parallelizing compilers, standard operating systems, multiprogramming.
Memory Synthesis Using AI Methods 15
The distributed system (i.e., local and private memory and I/O) supports communication
through point-to-point exchange of information, usually by message passing.
Depending on the structure of the interconnection network, there are two types of shared-
memory architectures:
• bus-based memory systems: the memory and all processors (with optional private
caches) are connected to a common bus. In other words, communication on a bus
is of broadcast type: any memory access made by one CPU can be “seen” by all
CPU’s.
• general interconnection networks: they provide several simultaneous connections
between pairs of nodes, that is, only two nodes are involved in any connection: the
sender and the receiver. These interconnection networks adhere to the point-to-point
communication model and may be direct or multistage networks.
4.3 Multiprocessing performance
The main purpose of a multiprocessor is either to increase the throughput or to decrease
the execution time, and this is done by using several machines concurrently instead of
a single copy of the same machine. In some applications, the main purpose for using
multiple processors is for reliability rather than high performance; the idea is that if any
single processor fails, its workload can be performed by other processors in the system
(fault-tolerant computing). The design principles for fault-tolerant computers are quite
different from the principles that guide the design of high-performance systems. We shall
focus our attention on performance.
As mentioned in Section 3.2, the amount of overall Speedup is dependent on the fraction of
time that a given enhancement is actually used. In the case of improving performance by
using multiprocessing instead of uniprocessing, the overall efficiency is maximum when all
processors are engaged in useful work, no processor is idle, and no processor is executing
an instruction that would not be executed if the same algorithm were executing on a
single processor. This is the state of peak performance, when all N processors of a
multiprocessor are contributing to effective performance, and in this case the Speedup is
equal to N. Peak performance is rarely achievable because there are several factors ([7])
that introduce inefficiency, such as:
• the delays introduced by interprocessor communications;
• the overhead in synchronizing the work of one processor with another;
• lost efficiency when one or more processors run out of tasks;
• lost efficiency due to wasted effort by one or more processors;
• the processing costs for controlling the system and scheduling operations.
16 Memory Synthesis Using AI Methods
Even though both scheduling and synchronization are sources of overhead on uniproces-
sors, we cite them here because they degrade multiprocessoor performance beyond the
effects that may already be present on individual processors.
The sources of inefficinecy must be carefully examined because the increase in performance
of multiprocessing compared to serial processing may be compromised. For example, if
the combined ineficiencies produce an effective processing rate of only 10 percent of the
peak rate, then ten processors are required in a multiprocessor system just to do the work
of a single processor. The inefficiency tends to grow as the number of processors increases.
For a small number of processors (tens), careful design can hold the inefficiency to a low
figure. Moreover, the complexity of programming a machine with many (hundreds of)
processors far exceeds the complexity of programming a single processor or a computer
with a few processors. Therefore, the higher performance benefit of parallelism should be
compared with the increase in cost and complexity to find out the degree of parallelism
that can be used effectively.
4.4 Interprocess communication and synchronization
In MIMD computers parallel programs are executing. A parallel program is a set of con-
currently executing sequential processes. These processes cooperate and/or compete while
executing, either by explicitly exchanging information or by sharing variables. To enforce
correct sequencing of processes and data consistency, some methods of communication and
synchronization of processes must be used. The notions of communication and synchro-
nization are tightly related.
In general, communication refers to the exchange of data between different processes.
Usually, one or several sender processes transmit data to one or several receiver processes.
Interprocess communication is mostly the result of explicit directives in the program. For
example, parameters passed to a coroutine and results returned by such a coroutine con-
stitute interprocess communication.
Synchronization is a special form of communication, in which the data are control infor-
mation. Synchronization serves a dual purpose:
enforcing the correct sequencing of processes (e.g., control of a producer process and
consumer process, such that the consumer never reads stale data and the producer never
overwrites data that have not yet been read by the consumer);
ensuring data consistency through mutual exclusive access to certain shared writable
data. (e.g., protect the data in a database such that concurrent write accesses to the same
record in the database are not allowed).
Communication and synchronization can be implemented in two ways:
• through controlled sharing of data in memory; This method can be used in shared
memory systems. Synchronization is achieved through a hierarchy of mechanisms:
Memory Synthesis Using AI Methods 17
1. hardware level synchronization primitives such as TEST&SET(lock),
RESET(lock), FETCH&ADD(x,a), Empty/Full bit;
2. software-level synchronization mechanisms such as semaphores and barriers;
• message passing; This method can be used both for shared memory and distributed
systems.
Synchronization mechanisms are used to provide mutual exclusive access to shared variable
and to coordonate the execution of several processes:
Mutual exclusive access
Acces is mutually exclusive if no two processes access a shared variable simultaneously.
A critical section is an instruction sequence that has mutually exclusive access to shared
variables. Locks and semaphores can be used to guarantee mutual exclusive access. On a
uniprocessor, mutual exclusion can be guaranteed by disabling interrupts.
Conditional synchronization
Conditional synchronization is a method of process coordination which ensures that a set
of variables are in a specific state (condition) before any process requiring that condition
can proceed. Mechanisms such as Empty/Full bit, Fetch&Add, and Barrier can be used
to synchronize processes.
4.5 Coherence, Consistency, and Event Ordering
Memory coherence is a system’s ability to execute memory operation corectly. We need a
precise definition of correct execution. Censier and Feautrier define [1] a coherent memory
system as follows:
A memory scheme is coherent if the value returned on a LOAD opera-
tion is always the value given by the latest STORE operation with the
same address.
This definition, while very concise and intuitive, is difficult to interpret and too ambiguous
in the context of a multiprocessor, in which data accesses may be buffered and may not
be atomic.
An access by processor i on a variable X is called atomic if no other processor is allowed
to access any copy of X while the access by processor i is in progress.
A LOAD of a variable X s said to be performed at a point in time when issuing of a STORE
from any processor to the same address cannot affect the value returned by the LOAD.
A STORE on a variable X by processor i is said to be performed at a point in time when an
issued LOAD from any processor to the same address cannot return a value of X preceding
the STORE.
Accesses are buffered if multiple accesses can be queued before reaching their destination,
such as main memory or caches.
18 Memory Synthesis Using AI Methods
Serial computers present a simple and intuitive model that adheres to the memory coher-
ence as defined by Censier and Feautrier: a LOAD operation returns the last value written
to a given memory location and a STORE operation binds the value that will be returned
by subsequent LOADs until the next STORE to the same location. For multiprocessors, the
memory system model is more complex, because the definitions of “last value written”,
“subsequent LOADs”, and “next STORE” become unclear when there are multiple proces-
sors reading and writing a location. Furthermore, the order in which shared memory
operations are done by one process may be used by other processes to achieve implicit
synchronization. For example, a process may set a flag variable to indicate that a data
structure it was manipulating earlier is now in a consistent state.
The behavior of the machine with respect to memory accesses is defined by the mem-
ory consistency model. Consistency models place specific requirements on the order that
shared memory accesses (events) from one process may be observed by other processes in
the machine. The consistency model specifies what event orderings are legal when several
processes are accessing a common set of locations.
Two major classes of machine behavior with respect to memory consistency have been
defined: sequential consistency and weak consistency models of behavior (Chapter 8).
Because the only way that two concurent processes can affect each other’s execution is
through sharing of writable data and sending of interrupts, it is the order of these events
that really matters. The machine must enforce these models by proper ordering of
storage accesses and execution of synchronization and communication primitives. Thus,
the ordering of events in a multiprocessor is an important issue and it is related to memory
consistency.
Coherence problems may exist at various levels of a memory hierarchy. Inconsistencies,
that is, contradictory information, can occur between adjacent levels or within the same
level of a memory hierarchy. For example, in a shared memory multiprocessor with private
caches (Section 5.12), caches and main memory may contain inconsistent copies of data,
or multiple caches could possess different copies of the same memory word because one of
the processes has modified its data. The former inconsistency may not affect the correct
execution of the program, while the latter condition is shown to lead to an incorrect
behavior. Multiprocessor caches must be provided with mechanisms that make them to
behave correctly.
We can conclude that synchronization, coherence, and ordering of events are closely related
issues in the design of multiprocessors.
Memory Synthesis Using AI Methods 19
5 MEMORY HIERARCHY DESIGN
5.1 General Principles of Memory Hierarchy
As programmers tend to ask more amount of faster mmemory, fortunately a rule of thumb
applies. This rule, called “the 90/10 Rule”, states that:
A program spends 90% of its execution time in only 10% of the code.
This rule holds that all programs favor a portion of their address space at any instant of
time. Thus it can be restated as the “principle of locality”. An implication of locality is
that based on the program’s recent past, one can predict with reasonable accuracy what
instructions and data a program will use in the near future. This locality of reference
applies both to data and code accesses, but it is stronger for code accesses.
The propriety of locality has two dimensions (i.e., there are two types of locality):
• Temporal locality (locality in time) - If an item is referenced, it will tend to be
referenced again soon.
• Spatial locality (locality in space) - If an item is referenced, nearby items will tend
to be referenced soon.
Locality can be exploited to increase memory bandwidth and decrease the latency of
memory access, which are both crucial to system performance. The principle of locality of
reference says that data (near that) recently used is likely to be accessed again in the future.
According to the Amdahl’s Law, favoring accesses to such data will improve performance.
Thus, recently addressed items should be kept in the fastest memory. Because smaller
memories are faster, smaller memories are used to hold the most recently accessed items
close to the CPU, and successively larger (and slower) memories as we move further away
from the CPU are used to hold less recently accessed items. This type of organization
is called a memory hierarchy. A memory hierarchy is a natural reaction to locality and
technology. The principle of locality and the guideline that smaller hardware is faster
yield the concept of a hierarchy based on different speeds and sizes. Since slower memory
is cheaper, a memory hierarchy is organized into several levels - each smaller, faster, and
more expensive per byte than the level below. The levels of the hierarchy subset one
another: all data in one level is also found in the level below, and all data in that lower
level is also found in the one below it, and so on until we reach the bottom of the hierarchy.
Taking advantage of the principle of locality can improve performance; the address map-
ping from a larger memory to a smaller but faster memory is intended to provide access
to all levels of the memory hierarchy by allowing the processor to look for data in levels
with decreasing speed (that is, first in the fastest level).
A memory hierarchy normally consists of many levels, but it is managed between two
adjacent levels at a time. The upper level —the one closer to the processor— is smaller
and faster than the lower level (Figure 5).
20 Memory Synthesis Using AI Methods
CPU
Registers
Cache
Memory Bus
Memory
I/O Bus
I/O Device
Figure 5: The Levels of a Typical Memory Hierarchy
A cache is a small, fast memory located close to the CPU that holds the most recently
accessed code or data. Cache represents the level of memory hierarchy between the CPU
and main memory.
The minimum unit of information that can be either present or not present in the two-level
hierarchy is called a block. The size of the block may be either fixed or variable. If it is
fixed, the memory size is a multiple of that block size. Sometimes, the term line is used
instead of block to refer to the unit of information that can be either present or absent in
a cache.
Success or failure of an access to the upper level is designated as a hit or a miss: a hit
is a memory access found in the upper level, while a miss means it is not found in that
level. Hit rate, or hit ratio, is the fraction of memory accesses found in the upper level,
and it is usually represented as a percentage. Miss rate, or miss ratio, is the fraction of
memory accesses not found in the upper level and is equal to (1−Hit rate). For example,
when the CPU finds the needed data item in the cache we say that a cache hit occurs, and
when the CPU does not find it we say that a cache miss occurs. Likewise, for a computer
with virtual memory where the address space is broken into fixed-size units called pages,
a page may reside either in main memory or on disk. When the CPU references an item
within a page that is not present in the cache or main memory, a page fault occurs, and
the entire page is moved from the disk to main memory. The cache and main memory
have the same relationship as the main memory and disk. In this report we shall focus on
the relationship between cache and main memory.
Since performance is the major reason for having a memory hierarchy, the speed of hits
and misses is important. Hit time is the time to access the upper level of the memory
hierarchy, which includes the time to determine whether the access is a hit or a miss. Miss
penalty is the time to replace a block in the upper level with the corresponding block from
Memory Synthesis Using AI Methods 21
the lower level, plus the time to deliver this block to the requesting device (normally the
CPU). The miss penalty is further divided into two components:
• access time or access latency —the time to access the first word of a block on a miss;
• transfer time—the additional time to transfer the remaining words in the block.
Access time is related to the latency of the lower-level memory, while transfer time is
related to the bandwidth between the lower-level and upper-level memories. Defining the
lower memory bandwidth, B, as the number of bytes transferred between the lower- and
upper-level in a clock cycle, and denoting by L the number of bytes per block and by b
the number of bytes per word, we can write:
Miss penalty = Access latency +
L − b
B
(9)
5.2 Performance Impact of Memory Hierarchy
The impact of memory hierarchy on the CPU performance is dependent on the relative
weight of the number of Memory stall clock cycles in the CPU time. Let us recall equation
(8), Section 3.3, which shows the following aspects:
• the efect of Memory-stall is to increase the total CPI;
• the lower the CPIExecution, the more pronounced the impact on perfor-
mance of memory stall is;
• because memories have similar memory-access times, independent of the CPU, and
the memory-stall is measured in CPU cycles needed for a miss, it results that
a higher CPU clock rate leads to a larger miss penalty even if the main
memories are the same speed.
The importance of the cache for CPUs with low CPI and high clock rates is thus greater.
The number of memory-stall cycles per instruction may be expressed as:
Memory − stall clock cycles
Instruction
=
Reads
Instruction
∗ Read miss rate ∗ Read miss penalty+
+
Writes
Instruction
∗ Write miss rate ∗ Write miss penalty (10)
We can use a simplifyed formula by combining the read and write misses together:
Memory − stall clock cycles
Instruction
=
Memory accesses
Instruction
∗ Miss rate ∗ Miss penalty (11)
Therefore, we obtain for the CPU time:
22 Memory Synthesis Using AI Methods
CPU time =
IC ∗ CPIExecution +
Memory accesses
Instruction
∗ Miss rate ∗ Miss penalty ∗Clock cycle time
(12)
The Miss penalty, just as the value of Memory stall clock cycles, is measured in CPU cy-
cles; therefore, the same main memory will produce different miss penalties with different
values of the Clock cycle time.
Another form of formula (12) may be obtained if we measure the number of misses per
instruction instead of the number of misses per memory reference (i.e., instead of Miss
rate):
Misses
Instruction
=
Memory accesses
Instruction
∗ Miss rate (13)
The advantage of the Misses/Instruction measure over the Miss rate measure is that
it is independent of the hardware implementation which can artificially reduce the Miss
rate (e.g., when a single instruction makes repeated references to a single byte). Even
if the number of misses per instruction takes into account the real number of misses,
independent of the hardware implementation, it has the drawback of being architecture
dependent, that is, different architectures will have different values of this parameter.
However, with a single computer family one can use the following CPU-time formula:
CPU time = IC∗ CPIExecution +
Misses
Instruction
∗ Miss penalty ∗Clock cycle time (14)
When we evaluate the performance of the memory hierarchy, it is not enough to look
only at the Miss rate. While this parameter is independent of the speed of the hard-
ware, one should be aware that, as equation (12) shows, the effect of the Miss rate on
the CPU performance is dependent on the value of other parameters, i.e., CPIExecution,
Memory accesses/Instruction, and Miss penalty —which, as explained, is influenced
by the Clock cycle time. A better measure of the memory-hierarchy performance is the
average time to access memory:
Average memory − access time = Hit time + Miss rate ∗ Miss penalty (15)
In equation (15) the Miss penalty is measured in nanoseconds, therefore it is no more
dependent on the Clock cycle as is the Miss penalty in equation (10), (11), (12), and (14).
While minimizing Average memory-access time is a reasonable goal, the final goal is
to improve CPU performance, that is, to decrease the CPU execution time and one must
be aware that CPU performance is not linearly dependent of the Average access time.
Misses in a memory hierarchy mean that the computer must have a mechanism to transfer
blocks between upper- and lower-level memory. If the block transfer is tens of clock
cycles, it is controlled by hardware; this is the case for cache misses. If the block transfer
is thousands of clock cycles, it is usually controlled by software; this is the case for page
faults. For a cache miss, the processor normally waits for the memory transfer to complete.
For a page fault it would be too wasteful to let the CPU sit idle; therefore, the CPU is
Memory Synthesis Using AI Methods 23
interrupted and used for another process during the miss handling. Thus, avoiding a long
miss penalty for page faults means any memory access can result in a CPU interrupt. This
also means the CPU must be able to recover any memory address that can cause such
an interrupt, so that the system can know what to transfer to satisfy the miss. When
the memory transfer is complete, the original process is restored, and the instruction that
missed is retried.
The processor must also have some mechanism to determine whether or not information is
in the top level of the memory hierarchy. This check happens on every memory access and
affects hit time; maintaing acceptable performance requires the check to be implemented
in hardware.
5.3 Aspects that Classify a Memory Hierarchy
The fundamental principles that drive all memory hierarchies allow us to use terms that
transcend the levels we are talking about. The same principles allow us to pose four ques-
tions about any level of the hierarchy:
Q1: Where can a block be placed in the upper level? (Block placement)
Q2: How is a block found if it is in the upper level? (Block identification or Block lookup)
Q3: Which block should be replaced on a miss? (Block replacement)
Q4: What happens on a write? (Write strategy)
The answers to these questions induce a classification of a level of the memory hierar-
chy.
The memory address is divided into pieces that access each part of the hierarchy. Based
on this address, there are two pieces of data that must be determined:
1. what is the number of the block to which the memory address corresponds;
2. what data item within the block is addressed.
We shall consider a byte-addressable machine, in which the data item is a byte. For the
sake of simplicity we shall consider that the number of bytes in a block is a power of 2.
More specifically, we shall denote by m the number of bits in the memory address, and
by j the number of bits that identify a byte in a block (therefore, the size of a block is 2j
bytes). With this notation, we can partition the memory address in two fields:
• block-offset address identifies the byte in the block and is composed from the bits
0 ... j − 1 of the memory addres;
• block-frame address identifies the block in that level of the hierarchy and is composed
from the bits j ... m − 1 of the memory addres.
The Block-frame address is the higher-order piece of the memory address that identifies
a block. The Block-offset address is the lower-order piece of the address and it identifies
24 Memory Synthesis Using AI Methods
an item within the line. An item is the quantum of information that can be accessed by
the processor’s instructions. It may be a byte, or a word, but as mentioned we consider
byte-addressable machines.
5.4 Cache Organization
Cache is the name chosen to represent the level of memory hierarchy between the CPU
and main memory. Caches may also be used as an upper level for a disk or tape memory
at the lower level, but we shall restrain our discussion to caches of main memory, called
CPU caches. Therefore, the term cache is used in this report instead of CPU cache.
The organization of the cache determines the block placement and identification. As men-
tioned, data is stored in cache in lines (also called blocks), which represent the minimum
unit of information that can be present in the cache (and the minimum unit transfered
between cache and main memory).
In general, a number of n lines are grouped into a set. A set is a collection of elements
building an associative memory. Data in a set are content-addressable data: every line of
the set has associated to Data an address Tag that identifies data. Therefore, finding in
which line of the cache a block is found, is a two-step process:
• first, the block-frame address is mapped onto a set number, and the block can be
placed anywhere within this set. If there are n blocks in a set, the cache placement
is called n-way set associative.
• second, an associative search is performed within the set to find the line, by com-
paring the tags of the lines in the set with the tag of the memory address currently
accessed.
The number of lines in a set, n, is also referred as the set size (synonyms: degree of
associativity, or associativity).
The method used to map a block-frame address onto a set is called the set-selection
algorithm. There are two basic set-selection algorithms:
• bit selection: the number of the selected set, denoted by Index, is the rest of dividing
the block-frame addres by the number of sets in the cache:
Index = (Block − frame address) modulo (Number of sets in cache) (16)
Generally, the number of sets is chosen to be a power of 2, say 2k. In this case,
the index is computed simply by selecting the lower-order k bits of the Block-frame
address.
• hashing: The block-frame address bits, that is, bits j ... m−1 of the memory address
are grouped into k groups (where 2k is the number of sets), and within each group
an EXCLUSIVE OR is performed. The resulting k bits then designate a set.
Memory Synthesis Using AI Methods 25
Bit-selection is very simple and is the general set selection algorithm for cache memories;
The address mapping scheme is such that a Block-frame address selects a set of lines, and
only if n = 1 a block-frame address identifies a unique line.
A method is needed to identify to which memory address a given line in the cache corre-
ponds. To do this we need to store on each line an address Tag. Because using bit-selection
all lines in the set number Index correspond to a memory address with the value Index in
the lower order k bits of the block-frame address, storing the Index is redundant. Thus,
the address Tag consists of the bits j + k ... m − 1 of the memory address. Therefore, to
each line of the cache two additional pieces of information are attached:
• the Address Tag contains the address bits that identify the line;
• the Valid bit marks information in the line as either valid or invalid.
One can view this as the cache consisting of elements, each element of the cache having
three fields : the Data field, that is the cache block, the Tag field, and the Valid field. If the
Valid bit is viewed as another part of the Tag, then a cache element may be considered as
formed from two fields: Data and Tag. One Tag is required for each line. If the total size
of the cache and the line size are kept constant, then increasing associativity increases the
number of blocks per set, thereby decreasing the number of sets, which means decreasing
the number of Index bits and increasing the number of Address Tag bits and therefore the
cost. The degree of associativity also determines the number of comparators needed to
check the Tag against a given memory address, the complexity of the multiplexer required
to select a line from the matching set, and increases the hit time. These aspects should be
considered because the size of the Tag memory (the product between the size of the Tag
field and the number of lines) affects the total cost of the cache. Physically, the Data and
Tag fields may be stored in the same storage, or in separate “data array” and “address
array” respectively.
To summarize, the cache structure is described by three parameters:
• line size: the number of bytes stored in one line and also the number of bytes
transferred between the cache and main memory at one memory reference;
• associativity (synonym: set size): represents the number of lines in a set;
• number of sets in the cache.
There are two boundary conditions of the cache organization:
• if the associativity is equal to the total number of lines, then one line of main memory
may be found in any line of cache. This organization is called a fully associative
cache. This organization provides maximum flexibility for data placement, but it
incurs increased complexity;
• if n = 1, that is, one-way set associative, then there is only one line per set so that one
main memory line may be placed only into one line of the cache. This organization
is known as direct mapping.
26 Memory Synthesis Using AI Methods
5.5 Line Placement and Identification
In this section we shall answer the first two questions of Section 5.3. Given a memory
address, that we call the Target Address, we must find in which set of the cache it can
reside and identify if it is actually in the cache. First, if the memory system employs
virtual memory, then the Virtual Address generated by the CPU is translated into a Real
Address or Physical Address. Depending on whether the Tag field contains the Virtual
or the Real Address, a cache is called a Virtual address cache or a Real address cache,
respectively. We shall consider in this report Real Address caches. A cache access begins
with presenting the CPU-generated Virtual Address to the cache.
(1) First, the Virtual address is translated into a Real Address. In this purpose, the virtual
adress is passed to the Translator (which is part of the S-unit) and to an associative
memory called Translation Lookaside Buffer (TLB) which holds (“caches”) the most recent
translations. A TLB is a small associative memory, each of its elements consisting of a pair
(Virtual Address, Real Addres). Thus the TLB contains the most recent translations. The
TLB receives as input the Virtual Address, randomizes (hashes) it, and uses the hashed
number to select a certain set. That set is then searched associatively for a match to the
Virtual Address. If a match is found, the corresponding Real Address is passed along to
the cache itself. If the TLB does not contain the required translation information, then
the Real Address provided by the translator is waited for.
(2) Second, using the Target address, the set to which the Target maps is found: employing
the bit selection method, the set Index is extracted from the Target address, that is, the
set in which the block can be present is found. The Data and Tag fields of the blocks in
the selected set are accessed.
(3) Third, the set is searched associatively over its n elements to check if there is an
Address Tags that is matching the Tag portion of the Target. Because speed is of essence,
all posible Tags are searched in parallel. If a match is found —hit— then the the data field
of the element containing the matching Tag is presented at the cache output. Otherwise,
there is a miss.
(4) Because usually a line contains more words, using the block offset portion of the Target,
the desired word is selected and presented to the processor.
The first three steps are called cache lookup. The cache lookup time can be reduced if
the steps (1) and (2) are done in parallel. This is possible if the number of bits in the
Page-offset portion of the Virtual Address is greater than or equal to the sum between the
number of Block-offset bits and the number of Index bits. The reason for this is that if
this condition holds, and taking into account that the Page-Offset bits are not translated,
the Index can be extracted directly from the Virtual address.
5.6 Line Replacement
When a cache miss occurs and a new line is brought into cache from main memory, then,
using bit selection, a set into which the line is to be placed is found. The lines of the
Memory Synthesis Using AI Methods 27
target set may be either in the Valid (i.e., already contain a line) or Invalid state. There
are two possibilities:
— there is an Invalid line in the set; then the newly brought line is replacing an invalid
line;
— all set lines are valid; in this case one of the lines containing valid information should
be selected as the victim that will be replaced with the newly brought line.
A method to select a line for replacement, also called an allocation method is therefore
necessary when a new line is brought into the cache and all the lines in the target set are
valid. For a direct mapped cache, there is only one line in a set and there is no choice:
that line must be replaced by the new line. With set-associative and fully associative
organizations, there are several lines to choose from on a miss. There are three strategies
employed for selecting which block to replace:
• First-in-first-out (FIFO) — The block that has been used n unique accesses before
(where n is the associativity) is discarded, independent of its reference pattern in
the last n − 1 references. This method is simple, but it is not exploiting temporal
locality.
• Random — To spread allocation uniformly, candidate blocks are randomly selected.
Usually a pseudorandomizing scheme is used for spreading data across a set of blocks.
• Least-recently used (LRU) — To reduce the chance of throwing out information that
will be needed soon, accesses to blocks are recorded. The block replaced is the one
that has been unused for the longest time. This makes use of a corollary of temporal
locality: If recently used blocks are likely to be used again, then the best candidate
for disposal is the least recently used.
Random replacement generally outperforms FIFO and is easier to implement in hardware.
LRU outperforms random but, as the number of blocks to keep track of increases, LRU
becomes increasingly expensive. Frequently, for high associativiy, LRU is only approxi-
mated.
LRU implementation
For a set size of two, only a hot/cold (toggle) bit is required. For a set size n, n ≥ 4, one
creates an (n × n) upper-left triangular matrix, whose elements are denoted by R(i, j),
with the diagonal and the elememnts below the diagonal equal to zero. When a line i,
1 ≤ i ≤ n is referenced, row i of R is set to 1 and column i of R is set to 0. The LRU line is
the one for which the row is entirely equal to 0 and for which the column is entirely 1. This
algorithm can be easily implemented in hardware and executes rapidly. The number of
storage bits required by matrix R is n(n − 1)/2, that is, the storage requirement increases
with the square of the set size. For n ≥ 8 this may be unacceptable. If this is considered
too expensive, then an approximation to LRU is implemented in the following way:
1. lines are grouped into p = n/2 pairs (i.e., one pair has two lines);
2. if p > 4, then the pairs are repeatedly grouped into other pairs until the number of
groups is equal to 4.
28 Memory Synthesis Using AI Methods
For example, if n = 8, the 8 lines of a set form a group of 4 pairs, and if n = 16, a
set is made up from 4 groups of two pairs. The LRU approximation is based on LRU
management at the level of each group. All but the upper group contain 2 elements,
therefore only a hot/cold LRU bit is required per group. The upper group contains 4
elements, and it uses 6 bits for LRU. The LRU approximation works as follows:
1. the LRU group is selected from the upper group:
2. the LRU of the selected group (which contains two elements) is repeatedly selected
until a line is selected.
For example, if n = 8, first the LRU pair is selected from the 4 pairs, then the LRU line
of the pair is selected. If n = 16, first, the LRU group of two pairs is selected from the
4 groups, second the LRU pair is selected from the two pair group, third the LRU line of
that pair is selected. This algorithm requires only 10 bits for n = 8 rather than the 28 bits
needed for full LRU, and 18 bits for n = 16, as opposed to 120 bits needed for full LRU.
FIFO implementation
FIFO is implemented by keeping a modulo n counter (n is the associativity) for each
set; the counter is incremented with each replacement and points to the next line for
replacement.
Random implementation
One Random implementation is to keep a single modulo n counter, incremented in a
variety of ways: by each clock cycle, each memory reference, or each replacement anywhere
in the cache. Whenever a replacement is to occur, the value of the counter is used to
indicate the replaceable line within the set.
As it is apparent from the implementation methos presented, Random provides the sim-
plest implemetation and LRU requires the most complex implemetation. Because random
replacement generally outperforms FIFO, the choice is to be made only between LRU and
Random.
5.7 Write Strategy
Reads are more frequent than Writes cache accesses because all instruction accesses are
reads and not every instruction is writing to memory. Making the common case fast
(Amdahl’s Law) means optimizing caches for reads, but high-performance designs cannot
neglect the speed of writes.
The common case, Read, is made fast by reading the line at the same time that the tag is
read and compared, so the line read begins as soon as the block-frame address is available.
If the read is a hit, the block is passed on to the CPU immediately. If it is a miss, there
is no benefit — but also no harm. Write accesses are posing several problems. First, the
processor specifies the size of the write and only that portion of a line can be changed. In
general, this means a Read-Modify-Write (RMW) sequence of operations on the line: read
the original line, modify one portion, and write the new block value. Moreover, modifying
Memory Synthesis Using AI Methods 29
a line cannot begin until the tag is checked to see if it is a hit. Because tag checking
cannot occur in parallel, writes normally take longer than reads. There are two basic
write policies:
• Write through (or store through) — The information is written to both the block in
the cache and the block in main memory.
• Write back (also called copy back or store in) — The information is written only to
the block in cache. The modified cache block (also called dirty block) is written into
main memory only when it is replaced.
Another categorization of writes is made with respect to whether a line is fetched when a
write miss occurs:
• Write allocate (also called fetch on write) — The line is loaded (this is similar to
a read miss), then the write-hit actions are performed with either write through or
write back.
• No write allocate (also called write around) — The line is modified in the lower level
memory and not loaded into the cache.
When the CPU is waiting for writes to complete during write throughs, or when a read
miss requires a modified line to be replaced for write back strategy, the CPU is said to
write stall. A common optimization to reduce write stalls is a write buffer. A write buffer
allows the CPU to continue while the memory is updated, and it can be used for both
write back and write through strategies:
• In a write through cache, both for hit and miss, data must be written to lower-level
memory. When a write buffer is used, the CPU has to wait only for the buffer to be
not full, then data and the address is written into the buffer, and the CPU continues
working while the write buffer writes data to memory.
• In a write back cache, the write buffer is used to store the dirty block (i.e., the
modified block) that must be replaced with another block brought from memory.
After the new data is loaded into the line, the CPU continues execution and in
parallel the buffer writes the dirty block in memory.
The problem with write buffers is that they complicate the handling of read misses, as
discussed in Section 6.7.
For the write back strategy it is necessary to keep track of whether a block in cache is
modified but not yet written into main memory (i.e., the block is dirty). For this purpose,
a feature called dirty bit is commonly used. This is a status bit associated with each cache
line that indicates whether or not the line was modified while in the cache. If it wasn’t,
the line is said to be clean and it is not written when replaced, since the lower-level has
the same information as the cache. The dirty bit is also useful for the memory coherence
protocol (Section 9.4).
30 Memory Synthesis Using AI Methods
Both write back and write through have their advantages. With write through, read misses
don’t result in writes to the lower level as may happen with write back when a dirty line
is replaced. Write through keeps the cache and main memory consistent, that is, the main
memory has the current copy of the data. This is important for I/O and multiprocessors,
because it supports memory-coherence (Section 5.13). On the other hand, with write
back, write-hits occur at the speed of the cache memory, and multiple writes within a line
require only one write to main memory. Since not every write is going to memory, write
back uses less memory bandwidth, which is an important aspect in multiprocessors.
Even though either fetch on write or write around could be used with write through or
write back, generally write-back caches use fetch on write (hoping that subsequent writes
to that block will be captured by the cache) and write-through caches often use write
around (since subsequent writes to that block will still have to go to memory).
5.8 The Sources of Cache Misses
An intuitive model of cache behavior attributes all misses to one of three sources:
• Compulsory —The first access to a line is not in the cache, so the line must be
brought into the cache. These are also called cold start misses or first reference
misses.
• Capacity —If the cache cannot contain all the blocks needed during execution of a
program, capacity mises will occur due to blocks being discarded and later retrieved.
• Conflict —If block placement strategy is set-associative or direct mapped, conflict
misses (in addition to compulsory and capacity misses) will occur because a block
can be discarded and later retireved if too many blocks map to its set. These are
also called collision misses or interference misses.
Having identified the three sources of misses, what can a computer designer do about
them?
For a given line size, the compulsory misses are independent of cache size. Compulsory
misses may be reduced by increasing the line size, but this can increase conflict misses.
There is little to be done about capacity misses, except to use larger cache size. When the
cache is much smaller than is needed for a program, and a significant percentage of the
time is spent moving data between the two levels of the hierarchy (i.e., cache and main
memory), the memory hierarchy is said to thrash. Thrashing means that, because so many
replacements are required, the machine runs close to the speed of the lower-level memory,
or maybe even slower, due to the miss overhead.
Conflict misses could be conceptually eliminated: fully associative placement avoids all
conflict misses. However, associativity is expensive in hardware and may slow access time
leading to lower overall performance. Conflict misses may also be decreased by increasing
the cache size.
Memory Synthesis Using AI Methods 31
This simple model of miss causes has some limits. For example, increasing cache size re-
duces capacity misses as well as conflict misses, since a larger cache spreads out references.
Thus, a miss might move from one category to the other as parameters change.
5.9 Line Size Impact on Average Memory-access Time
Let us analize the effect of the block size on the Average memory−access time (equation
(15), Section 5.2) by examining the effect of the line size on the Miss rate and Miss penalty.
We assume that the size of the cache is constant.
Larger block sizes reduce compulsory misses, as the principle of spatial locality suggests.
At the same time, larger block sizes increase conflict misses, because they reduce the
number of blocks in the cache. Reasoning in terms of the two aspects of the principle of
locality, we say that increasing line size lowers the miss rate until the reduced misses of
larger blocks (spatial locality) are outweighted by the increased misses as the number of
blocks shrinks (temporal locality), because larger block sizes means fewer blocks in cache.
Let us examine the effect of line size on the Miss penalty. The Miss penalty is the sum of the
access latency and the transfer time. The access-latency portion of the miss penalty is not
affected by the block size, but the transfer time does increase linearly with the block size.
If the access latency is large, initially there will be little additional miss penalty relative
to access time as block size increases. However, increasing the line size will eventuaslly
make the transfer-time become an important part of the miss penalty.
Since a memory hierarchy must reduce the Average memory-access time, we are interested
not in the lowest Miss rate, but in the lowest Average access time. This is related to
the product of Miss rate by the Miss penalty, according to equation (15), Section 5.2.
Therefore, the “best” line size is not that which minimizes Miss rate, but that which
minimizes the product between the Miss rate and the Miss penalty. Measurements on
different cache organizations and computer architectures have indicated that the lowest
average memory-access time is for line sizes ranging from 8 to 64-bytes ([6],[20]).
Of course, overall CPU performance is the ultimate performance test, so care must be taken
when reducing Average memory-access time to be sure that changes to Clock cycle time
and CPI improve overall performance as well as average memory-access time.
5.10 Operating System and Task Switch Impact on Miss Rate
When the Miss rate for a user program is analized (for example, by using a trace-driven
simulation), one should take into account that the real miss rate for a running program,
including the operating system code invoked by the program, is higher. The miss rate can
be broken into three components:
• the miss rate caused by the user program;
• the miss rate caused by the operating system code;
• the miss rate caused by the conflicts between the user code and the system code.
32 Memory Synthesis Using AI Methods
In fact, the operating system has a greater impact on the actual miss rate. Due to task
switching, the miss rate of a program increases. Using the model of miss sources from
Section 5.8, the rationale is that task-switching increases the compulsory misses.
A possible solution to the miss rate due to task-switching is to use a cache that has been
split into two parts, one of which is used only by the supervisor and the other of which
is used primarily by the user state programs – this organization is called User/Supervisor
Cache. If the scheduler were programmed to restart, when possible, the same user program
running before an interrupt, then the user state miss rate would drop appreciably. Further,
if the same interrupts recur frequently, the supervisor state miss rate may also drop. The
supervisor cache may have a high miss rate due to its large working set. However, if the
total cache size is split evenly among the user and supervisor cache, then the miss rate in
the supervisor state is likely to be worse than with an unified cache since the maximum
capacity is no longer available to the supervisor. Moreover, the information used by the
user and the supervisor are not entirely distinct, and cross-access must be permitted. This
introduces the coherence problem (Section 5.13) amomg the user and supervisor cache.
5.11 An Example Cache
We shall consider the organization of the VAX-11/780 cache as an example. The cache
contains 8-KB of data, is two-way set-associative with 8-byte blocks, uses random replace-
ment, write through with a one-word write buffer, and no write allocate on a write miss.
Figure 6 shows the organization of this cache.
A cache hit is traced through the steps of a hit a labeled in Figure 6, the five steps being
shown as circled numbers. The address coming into the cache is divided into two fields:
the 29-bit block-frame address and 3-bit block-offset. The block-frame address is further
divided into an address tag and set index. Step 1 shows this division.
The set index selects the set to be tested to see if the block is in the cache. A set is one
block from each bank. The size of the index depends on cache size, block size, and set
associativity. In this case, a 9-bit index results:
Blocks/Bank = Cache size/(Block size ∗ Associativity) = 8192/(8 ∗ 2) = 29
The index is sent to both banks (because of the 2-way set-associative organization) and
the address tags are read — step 2.
After reading an address tag from each bank, the tag portion of the block frame address
is compared to the tags. This is step 3 in the figure. To be sure the tag contains valid
information, the valid bit must be set, or the results of the comparison are ignored.
Assuming one of the tags does match, a 2:1 multiplexer (step 4) is set to select the block
from the matching set. It is not possible that both tags match because the replacement
algorithm makes sure that an address appears in only one block. To reduce the hit time,
the data is read at the same time as the address tags; thus by the time the block multiplexer
is ready, the data is also ready. This step is needed in set-associative caches but it can be
omitted from direct-mapped caches since there is no selection to be made. The multiplexer
Memory Synthesis Using AI Methods 33
Tag< 20 > Index< 9 > Offset< 3 >
CPU
Address
Data
In Out
=?
=?
Valid< 1 > Tag< 20 > Data< 64 >
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
2:1
Mux
Memory
Write buffer
1
2
2
3
3
4
5
Bank 0
Bank 1
Figure 6: Organization of a 2-way set associative cache
34 Memory Synthesis Using AI Methods
used in step 4 is on the critical timing path, affecting the hit time.
In step 5, the word is sent to the CPU. All five steps occur within a single CPU cycle.
On a miss, the cache sends a stall signal to the CPU telling it to wait, and two words (eight
bytes) are read from memory. That takes 6 clock cycles on the VAX 11/780, ignoring bus
interference. When the data arrives, the cache must pick a block to replace, and one block
is selected at random. Replacing a block means updating the data, the address, and the
valid bit. Once this is done, the cache goes through a regular read hit cycle and returns
the data to the CPU.
Writes are involving additional steps. When the word to be written is in the cache, the
first four steps are the same. The next step is to write the data in the block, then write
the changed-data portion into the cache. Because no write allocate is used, on a write
miss the CPU writes “around” the cache to main memory and does not affect the cache.
Because write-through is used, the word is also sent to a one-word write-buffer. If the
write buffer is empty, the word and its address are written in the buffer and the cycle is
finished — the CPU continues working while the write buffer writes the word to memory.
If the buffer is full, the cache and CPU must wait until the buffer is empty.
5.12 Multiprocessor Caches
We have seen that increasing memory bandwidth and decreasing the access latency has
a great impact on system performance. For shared-memory multiprocessors, bandwidth
should be analyzed in a special context: several processors may try to access simultane-
ously a level of the memory hierarchy. This gives rise to the contention problem, that is,
the conflict between accesses from different processors.
The access latency is related to the existence of a gap between processor and memory
speeds; this aspect is present both in uniprocessor and multiprocessor architectures. Due
to this gap, the memory access time introduces memory-stalls in CPU time. When the
memory can’t keep up with the processor’s speed, it becomes a “bottleneck”.
A common approach to solving both access latency and contention problems that occur
in shared-memory multiprocessors is to use cache memories. Caches moderate a multipro-
cessors’s memory traffic by holding copies of recently used data, and provide a low-latency
access path to the processor. Caches may be attached to each CPU —private caches or
to the shared-memory —shared cache. Private caches alleviate the contention problem:
each processor has a high-speed cache connected to it that maintains a local copy of a
memory block and is able to supply instructions and operands at the rate required by
each processor. Because of locality in the memory access patterns of multiprocessors, the
cache satisfies a large fraction of the processor accesses, thereby reducing both the average
memory latency and the communication bandwidth requirements imposed on the system’s
interconnection network. The architecture of a shared-memory system with private caches
is shown in Figure 7.
The key to using interconnection networks in multiprocessors is to send data over the net-
works rather rarely. This is because a reduced network traffic tends to reduce contention,
Memory Synthesis Using AI Methods 35
Cache
Processor 1
Cache
Processor 2
Cache
Processor n
. . .
Interconnection Network
Memory
Module 1
Memory
Module 2
Memory
Module m. . .
Figure 7: Shared-memory system with private caches
and, as the use of network per processor diminishes, the number of processors that can
be served increases. A cache memory provides an effective means for maintaining local
copies of data and reduces the need to traverse a network for remote data. For example,
if a cache misses only 10 percent of the time, and remote fetches occur only on misses,
then the number of processors supportable on the interconnection network is ten times
greater than for a cacheless processor. The smaller the miss ratio, the greater the number
of supported processors.
Unfortunately, private caches give rise to the cache-coherence problem: multiple copies of
data may exist in different private caches. This represents the coherence problem among
private caches. That is, multiple copies of the same memory word must be kept consistent
in different caches in the context of sharing of writable data and of process migration from
processor to processor. Cache coherence schemes must be employed to maintain a uniform
state for each cached block of data: a store to a data word present in a different cache
must be reflected in all other caches containing the word either in the form of invalidation
or update.
5.13 The Cache-Coherence Problem
Because of caches, data can be found in memory or in the cache. As long as there is only
one CPU and it is the sole device changing or reading the data, there is little danger in
the CPU seeing the old or stale copy.
36 Memory Synthesis Using AI Methods
However, due to input/output, and to the existence of several private caches in multipro-
cessors, the opportunity exists for other devices to cause copies to be inconsistent (i.e.,
different values of the same data item) or for other devices to read the stale copies. This
aspect of preventing all devices to access stale-data is referred to as the cache-coherence
problem. The coherence problem is for a processor to have exclusive access to write an
object and to have the most recent copy when reading an object.
This problem applies to I/O as well as to shared-memory multiprocessors. However,
unlike I/O, where multiple data copies is a rare event and can be avoided as shown in the
next subsection, a process running on multiple processors will want to have copies of the
same data in several caches. Performance of a multiprocessor program depends on the
performance of the system when sharing data.
5.13.1 Cache Coherence for I/O
Let A and B be two data items in memory, and A and B their cached copies. Let us
assume an initially coherent state, say:
A = A = 100 & B = B = 200
Inconsistency can occur in two cases. In one case, if the write strategy is write-back and
the CPU writes, say the value 133, into A , then A will have the updated value, but the
value in memory is the old, stale value of 100. If an output to I/O is issued, it uses the
value of A from memory, therefore it gets the stale data:
A = 133 & A = 100; A = A (A stale)
In the another case, if the I/O system inputs, say the value 331, into the memory copy of
B, then B in the cache will have the old, stale data:
B = 200 & B = 331; B = B (B stale)
In both cases the memory coherence condition as defined by Censier and Feautrier (Section
4.5) is not met, i.e., the value returned on a READ or INPUT instruction is not the value
given by the latest WRITE or OUTPUT instruction with the same address.
An architectural solution to the cache-coherence problem caused by I/O is to make I/O
occur between the I/O device and the cache, instead of main memory. If input puts data
into the cache and output reads data from the cache, both I/O and the CPU see the same
data, and there is no problem. The difficulty with this approach is that it interferes with
the CPU : I/O competing with the CPU for cache access will cause the CPU to wait for the
I/O. Moreover, when the I/O device inputs data, it brings into the cache new information
that is unlieky to be accessed by the CPU soon, whereas it replaces some information from
cache that may be needed soon by the CPU. For example, on a page fault, the I/O inputs
a whole page, while the CPU may need to access only a portion of the page.
Memory Synthesis Using AI Methods 37
The problem with the I/O system is to prevent stale-data while interfering with the CPU
as little as possible. Many systems, therefore, prefer that I/O occur directly to main
memory, acting as an I/O buffer. If a write-through cache is used, then memory has an
up-to-date copy of the information, and there is no stale-date issue for output to I/O. This
is the reason many machines use write-through. Input from I/O requires some overhead
in order to prevent I/O to input data to a memory location that is cached. The software
solution is to guarantee that no blocks of the I/O buffer designated for input from I/O
are in the cache. This can be done in two ways. In one approach, a buffer page is marked
as noncaheable; the operating system always inputs to such a page. In another approach,
the operating system flushes, i.e., invalidates, the buffer addresses from the cache after the
input occurs. The hardware solution is to check the I/O addresses on input to see if they
are in the cache, using for example a snooping protocol (Section 9.4) and to invalidate the
cache lines when their addesses match I/O addresses. All these approaches can also be
used for I/O output with write-back caches.
5.13.2 Cache-Coherence for Shared-Memory Multiprocessors
Caches in a multiprocessor must operate consistently or coherently, that is, they must
obey the memory coherence condition for all copies of any data item. The coherence
problem is related to two types of events: sharing writable data among several processors,
or program migration between processors. In both cases, access to a stale copy of data
must be prevented.
The first type of coherence problem occurs when two or more processors try to update a
datum simultaneously. Then, it must be treated in a special way so that its value can be
updated successfully regardless of the instantaneous location of the most recent version of
the datum. To illustrate this, let’s examine two examples.
When a processor, let’s call it P1, updates the variable, the current value of the shared
variable moves from memory to P1. While P1 holds this value and updates it, another
processor, let’s call it P2, accesses shared memory. But the current value of the variable
is no longer in the shared memory, because it has moved to P1. However, P2’s request is
not redirected and it erroneously goes to the normal place for storing the shared variable.
This example assumes that P1 updates the shared variable and immediately returns it
to memory, but in a cache-based system, P1 may hold the variable indefinitely in the
cache, so that the failure exhibited in the example becomes much more likely. The failure
interval is not limited to a very brief update period, but it can happen for any access to
the variable in shared memory while that variable is held in P1’s cache.
There is a second failure mode for shared writable data that has to be considered too.
If P2 copies a shared variable to its cache and updates that variable both in cache and
in shared memory, then problems can arise if the values in cache and in shared memory
do not track each other identically. Suppose, for example, that after P2 has updated the
variable both in its cache and in shared memory, processor P1 requests the value of the
variable. If P1 has already a copy of the variable in its cache, it ignores altogether the
change in the variable from the update performed by P2. Thus, processor P1 accesses a
stale copy of the data held in cache, instead of accessing the fresh data held in shared
38 Memory Synthesis Using AI Methods
memory.
With respect to the second type of failure —associated with program migration—, let us
suppose that processor P1 is running a program that leaves in the cache the value 0 for
variable X. Then the program shifts to a different processor P2 and writes a new value
of 1 for variable X in the cache of that processor. Finally, the program shifts back to
processor P1 and attempts to read the current value of X. It obtains the old, stale value
of 0 when it should have obtained the new, fresh value 1 for X. Note that X does not
have to be a shared variable for this type of error to occur. The cause of this mode of
failure is ([7]) that a program’s footprint —that is, data associated with the program—
was not flushed completely from cache when the program has moved from P1 to P2 and
when it got back to P1 it has found there stale data.
The protocols that maintain cache coherence for multiple processors are called cache-
coherence protocols. This subject has been studied by many authors, among them being
Censier and Feautrier ([1]), Dubois and Briggs ([3],[4]), Archibald and Baer ([5]), Agarwal
([10]), Lenovski, Laudon, Garachorloo, Gupta and Hennessy ([17]), who have explored a
variety of cache-coherence protocols and examined their performance impact. Chapter 9
covers issues related to cache coherence. The implementation of cache in multiprocessors
may enforce coherence either totally by the hardware, or may enforce coherence only at
explicit synchronization points.
5.14 Cache Flushing
When a processor invalidates data in its cache, this is called flushing or purging. Sometimes
(Sections 5.13.1 and 9.6), it is necessary to invalidate the contents of more lines in cache,
i.e., to set the invalid bit for more lines. If this is done one line at time, the required time
would become excessive. Therefore, an INVALIDATE instruction should be available in the
processor if a coherence scheme based on flushing is used. If one chooses to flush the entire
cache, then resettable static random-access memories for Valid bits can be used allowing
the INVALIDATE to be accomplished in one or two clocks.
Memory Synthesis Using AI Methods 39
6 IMPROVING CACHE PERFORMANCE
6.1 Cache Organization and CPU Performance
The goal of the Cache memory designer is to improve performance by decreasing the
CPUExecution time. As equations (14) and (15) (Section 5.2) show, the CPU time is
not linearly dependent on the Average memory access time, but it depends on the two
components of the Average memory access time:
Hit time; this must be small enough not to affect the CPU clock rate and CPIExecution;
Miss rate ∗ Miss penalty; this product affects the number of memory-stall clock cycles
and therefore increases the CPI.
After making some easy decisions in the beginning, the architect faces a threefold dilemma
when attempting to further reduce average access time by changing the cache organization
or size:
Increasing line size does not improve average access time because the lower miss rate
doesn’t offset the higher miss penalty;
Making the cache bigger would make it slower, jeopardizing the CPU clock rate;
Making the cache more associative would also make it slower, again jeopardizing the
CPU clock rate.
Example
This example shows that a two-way set-associative cache may decrease the
average memory-access time as compared to a direct mapped cache of the
same capacity, but this does not mean better performance, because the CPU
time is larger for the two-way set-associative memory. We assume that the
clock cycle time is 20ns, the average CPI is 1.5 and there are 1.3 memory
references per instruction. The cache size is assumed to be 64 KB, the Miss
rate of the direct mapped cache is 3.9% and the Miss rate of the two-way set-
associative cache is 3.0%. The hit time of the two-way set-associative cache
is larger and this causes an 8.5% increase of the clock cycle time. The Miss
penalty is considered to be 200 ns for either cache organization. Let us first
compute the average memory access time for the two cache organizations using
equation (15), Section 5.2:
Average memory − access time1−way = 20 + 0.039 ∗ 200 = 27.8ns (17)
Average memory−access time2−way = 20∗1.085+0.030∗200 = 27.7ns (18)
Let us compute also the performance of each organization, as given by equation
(12), Section 5.2. We substitute 200ns for (Miss penalty ∗ Clock cycle time)
for either cache organization, even though in practice it must be rounded to
an integer number of clock cycles. Because the Clock cycle time corresponding
to a two-way set-associative cache is 20 ∗ 1.085, we obtain :
CPU time1−way = IC ∗ (1.5 ∗ 20 + 1.3 ∗ 0.039 ∗ 200) = 40.1 ∗ IC (19)
40 Memory Synthesis Using AI Methods
CPU time2−way = IC ∗ (1.5 ∗ 20 ∗ 1.085 + 1.3 ∗ 0.030 ∗ 200) = 40.4 ∗ IC (20)
The result obtained shows that even though this direct-mapped cache has
greater miss rate and average access-time than the 2-way set associative cache,
it leads to a slightly better performance than the 2-way set-associative cache.
There are some other methods to improve Hit time, Miss rate, and Miss penalty. The fol-
lowing sections in this chapter will present the most important performance improvement
methods.
6.2 Reducing Read Hit Time
As mentioned in Section 5.5, the read hit time can be reduced if the cache lookup does
in parallel the virtual-to-real address translation through the TLB and the set selection.
However, this limits the size of the cache. Let p be the number of bits in the memory
address that represent the page-offset, j be the number of bits for the byte-offset within
a line, and k the set index bits (i.e., there are 2k
sets). For a TLB lookup to be made in
parallel with set selection, the following condition must be met:
j + k ≤ p (21)
This limits the cache size, C, to the value:
C = n ∗ 2j+k
≤ n ∗ 2p
(22)
where n is the degree of associativity. For a direct mapped cache the limitation is that
its size can be no bigger than the page size. Increasing the associativity is a solution, but
increasing the associativity slows the cache.
One scheme for fast cache hits without this size restriction is to use a more pipelined
memory access where the TLB is one step of the pipeline. The TLB can be easily pielined
because it is a distinct unit that is smaller than the cache. Pipelining the TLB doesn’t
change memory latency but achieves higher memory bandwidth based on the efficiency of
the CPU pipeline.
An alternative would be to eliminate the TLB and its associated translation time from the
cache access path by storing in the Tag memory the virtual addresses. Such caches are
called virtual address caches or virtual caches. There are three major problems with virtual
caches that, in our opinion, make virtual caches not a very good choice for multiprocessors.
The first is that every time a process is switched, the virtual addresses refer to different
physical addresses, requiring the cache to be flushed (or purged). But purging the cache
causes an increase in miss rate. A solution to this problem is to extend the width of the
address Tag with a process-identifier tag (PID), to have the operating system assign PIDs
to processes and to flush the cache only when a PID is reused. Another problem is that
the user programs and the operating system may use two different virtual addresses for the
same physical address, that is, a data item may have different virtual addresses that are
called synonyms or aliases. The effect of synonyms in a virtual cache is that two (or more)
copies of the same data are present in the cache and thus a coherence problem occurs:
Memory Synthesis Using AI Methods 41
if one copy is modified, the other will have the wrong value. Hardware schemes, called
anti-aliasing, that guarantee every cache line a unique physical address may be employed
to solve this problem, but software solution are less expensive. The idea of the software
solution is to force aliases to share a number of address bits so that the cache can not
accomodate duplicates of aliases. For example, for a direct mapped cache that is 256 KB,
that is 218KB, and if the operating system enforces that all aliases are identical in the last
18 bits of their addresses, then no two aliases can be simultaneously in cache. The third
problem is that I/O typically uses physical addresses and thus requires mapping to virtual
addresses to interact with a virtual cache in order to maintain coherence.
6.3 Reducing Read Miss Penalty
Because reads dominate cache accesses. it is important to make read misses fast. There
are several methods to reduce the read miss penalty. In the first method, called fetch
bypass or out-of-order fetch, the missed word is requested first, regardless of its position
in the line, data requested from memory is transmitted in parallel to the CPU and cache,
and the CPU waits only for the requested data, while in the second method, called early
restart, the line that contains the requested data is brought from memory starting with
the left-most byte, but the CPU continues execution as soon as the requested data arrives.
With fetch bypass, the missed word is requested first from memory and sent to the CPU as
soon as it arrives, bypassing the cache; the CPU continues execution while filling the rest
of the words in the block. Because the first word requested by the CPU may not be the
first word of a line, this strategy is also called out-of-order fetch or wrapped fetch. Usually,
the cache is loaded in parallel when the processor reads data from main memory (i.e., fetch
bypass with simultaneous cache fetch) in order to overlap fetching of the specified data for
CPU and for cache. When the transfer begins with a byte that is not the left-most byte
of the line, the transfer should wrap around the right-most byte of the line and transfer
the left-most bytes of the line that have been skipped in the first place.
This methods provide a reduction of the read miss penalty by obviating the need for the
processor to wait for the cache to load the entire line. Unfortunately, not all the words of
a line have an equal likelihood of being accessed first. If that were true, with a line size
of L bytes, the average line entry point would be L/2. However, due to sequential access,
the left side of the line is more likely to be accessed first. For example, Hennessy and
Patterson have determined [6] for some architecture that the average line entry point for
instruction fetch is at 5.6 bytes from the left-most byte in a 32-byte line. The left-word
of a block is most likely to be accessed first due to sequential accesses from prior blocks
on instruction fetches and sequentially stepping through arrays for data accesses. This
effect of spatial locality limits the performance improvement obtained with out-of-order
fetch. Spatial locality also affects the efficiency of early restart, because it is likely that the
next cache request be to the same line. The reduction in the read miss penalty obtained
with these methods should be compared to the increased complexity incurred by handling
another request while the rest of one line is being filled.
42 Memory Synthesis Using AI Methods
6.4 Reducing Conflict Misses in a Direct-Mapped Cache
As described in Section 5.8, conflict misses may appear when two addresses map into the
same cache set. Consider referencing a cache with two addresses, ai and aj. Using the bit
selection method described in Section 5.4, these two addresses will map into the same set
if and only if they have identical Index fields. Denoting by b the bit selection operation
performed on the addresses to obtain the index, then the two addresses will map into the
same set iff:
b[ai] = b[aj] (23)
Two addresses that satisfy this equation are called conflicting addresses because they may
potentially cause conflicts. Assume the following access pattern:
ai aj ai aj ai aj ai aj . . .
where addresses ai and aj are conflicting addresses. A 2-way set-associative cache will not
suffer a miss if the processor issues this adddress pattern because data referenced by ai
and aj can co-reside in a set. In contrast, in a direct-mapped cache, the reference to aj
will result in an interference (or conflict) miss because the data from ai occupies the same
selected line.
The percentage of misses that are due to conflicts varies widely among different applica-
tions, but it is often a substantial portion of the overall miss rate.
6.4.1 Victim Cache
The victim cache scheme has been proposed by Jouppi [12]. A victim cache is a small,
fully-associative cache that provides some extra cache lines for data removed from the the
direct-mapped cache due to misses. Thus, for a reference stream of conflicting addresses,
such as
ai aj ai aj ai aj ai aj . . .,
the second reference, aj, will miss and force the data indexed by ai out of the set. The
data that is forced out is placed in the victim cache. Consequently, the third reference, ai,
will not require accessing the main memory because the data can be found in the victim
cache. Fetching a conflicting datum with this scheme requires two or three clock cycles:
1. the first clock cycle is needed to check the primary cache;
2. the second cycle is needed to check the victim cache;
3. a third cycle may be needed to swap the data in the primary cache and victim cache
so that the next access will likely find data in the primary cache;
This scheme has several disadvantages: it requires a separate, fully-associative cache to
store the conflicting data. Not only does the victim cache consume extra area, but it can
also be quite slow due to the need for an associative search and for the logic to maintain a
least-recently-used replacement policy. For adequate performance a sizeable victim cache
is required in order for the victim cache to be able to store all conflicting data blocks. If
the size of the victim cache is fixed relative to the primary direct-mapped cache, then it
is not very effective at resolving conflicts for large primary caches.
Memory Synthesis Using AI Methods 43
6.4.2 Column-Associative Cache
The challenge is to find a scheme that minimizes the conflicts that arise in direct-mapped
accesses by allowing conflicting addresses to dynamically choose alternate mapping func-
tions, so that most of the conflicting data can reside in the cache. At the same time,
however, the critical hit access path (which is an advantage of direct-mapped organiza-
tion) must remain unchanged. The method presented is called Column-Associativity and
has been invented by A. Agarwal and S.D. Pudar [11].
The idea is to emulate a 2-way set-associative cache with a direct-mapped cache by map-
ping two conflicting addresses to different sets instead of referencing another line in the
same set as the 2-way set-associativity does. Therefore, conflicts are not resolved within
a set but within the entire cache, which can be thought of as a column of sets —thus the
name column associativity. The method uses two mapping functions (also called hashing
functions) to access the cache. The first hashing function is the common bit selection,
that is, an address ai is mapped into the set with the number:
b[ai] (24)
The second hashing function is a modified bit-selection, which gives the same value as
the bit-selection function except for the highest-order bit, which is inverted. We call this
hashing function bit flipping and denote it by f. For example, if b[a] = 010, then applying
the bit flipping function to the address a yields f[a] = 110. Therefore, the function f
applied to an address aj will always give a set number which is different from that given
by the function b:
b[aj] = f[aj] (25)
The scheme works as follows:
1. the bit selection function b is applied to a memory address ai. If b[ai] indexes to
valid data, a first-time hit ocurs, and there is no time penalty;
2. if the first access has missed, then the bit flipping function f is used to access the
cache. If f[ai] indexes to valid data, then a second-time hit occurs and data is
retrieved.
3. if a second-time hit has occured, then the two cache lines are swapped so that the
next access will likely result in a first-time hit.
4. if the second access misses, then data is retrieved from main memory, placed in the
cache set indexed by f[ai], then it is swapped with the data indexed by b[ai] with
the goal of making the next access likely to be a first-time hit.
The first and second step each require one clock cycle, while swapping requires two clock
cycles. The second-time hit, including swapping, is then four cycles but can be reduced
to only three cycles using an extra buffer for the cache: Given this buffer, the swap need
not involve the processor, which may be able to do other useful work while waiting for the
cache to become available again. If this is the case half of the time, then the time wasted
44 Memory Synthesis Using AI Methods
by a swap is only one cycle. Therefore, it can be considered that a swap adds only one
cycle to the execution time, and hence the second-time hit is 3 clock cycles.
Using two hashing functions mimics 2-way set-associativity because for two conflicting
addresses, ai and aj, rehashing aj with f resolves the conflict with a high probability: from
equations (23) and (25) it results that the function f applied to aj will give a set different
from b[ai]:
b[ai] = b[aj] = f[aj] (26)
The difference is that a second-time hit takes three clock cycles, while in a 2-way set-
associative cache the two lines of a set can be retrieved in a clock cycle. However, the
clock cycle of the two-way set-associative cache is longer that the clock cycle of the direct-
mapped cache.
A problem that must be solved for column-associative caches is the problem of possible
incorrect hits. Consider two addresses, ai and ak, that map with bit-selection to indexes
that differ only in the highest-order bit. In this case, the index obtained by applying bit-
selection mapping to one address is the same as the index obtained by applying bit-flipping
mapping to the other address:
b[ak] = f[ai] and b[ai] = f[ak] (27)
These two addresses are distinct, but they may have identical tag fields. If this is the case,
when a rehash occurs for the address ai and data addressed by ak is already in cache at
location b[ak], then the bit-flipping mapping f[ai] results in a hit with a data block that
should only be accessed by b[ak]. For example, if
b[ak] = 110, b[ai] = 010, and Tag[ak] = Tag[ai] (28)
and assuming that data line addressed by ak is cached in the set with the index b[ak], then
when the address ai is presented to the cache, this address will be rehashed to the same
set as ak (i.e., f[ai] = 110) and will cause a second-time hit (a false hit) because the two
addresses have the same Tag. This is incorrect, because a data-line must have a one-to-one
correspondence with a unique memory address. The solution to this problem is to extend
the Tag with the highest-order bit of the index field. In this case, the rehash with f[ai]
will correctly fail because information about ai and ak having different indexes is present
in the Tag. In this way, the data line stored in the set with the number b[ak] = f[ai] is
put into correspondece with an unique index, and hence a unique address.
Another problem is that storing conflicting data in another set is likely to result in the
loss of useful data, and this is referred to as clobbering. The source of this problem
is that a rehash is attempted after every first-time miss, which can replace potentially
useful data in the rehashed location, even when the primary location had an inactive
line. Clobbering may lead to an effect called secondary thrashing that is presented in the
following paragraph.
Consider the following reference pattern:
ai aj ak aj ak aj ak . . . ,
where the addresses ai and aj map into the same cache location with bit selection, and ak
Memory Synthesis Using AI Methods 45
is an address which maps into the same location with bit-flipping, that is:
b[ai] = b[aj], b[ak] = f[ai] and f[ak] = b[ai] (29)
After the first two references, the data referenced by aj (which will be called j for brevity)
and the data i will be in the non-hashed and rehashed locations, respectively (because of
swapping). When the next address, ak, is encountered, the algorithm attempts to access
b[ak] (bit selection is tried first), which contains the rehashed data i; when the first-time
miss ocurs, the algorithm tries to access f[ak] (bit flipping is tried second), which results
in a second-time miss and the clobbering of the data j. This pattern continues as long
as aj and ak alternate: the data referenced by one of them is clobbered as the inactive
data block i is swapped back and forth but never replaced. This effect is referred to as
secondary thrashing.
The solution to this problem is finding a method to inhibit a rehash access if the location
reached by the first-time access itself contains a rehashed data block, that is, with the
previous notation, when the location referenced by ak with bit-selection (b[ak]) already
contains a rehashed data (data i is rehashed to f[ai]). This condition can be satisfied by
adding to each cache set an extra bit that indicates whether the set is a rehashed location,
that is, whether the data in that set is indexed by f[a]. This bit that indicates a rehashed
location is called the rehash bit, denoted by Rbit, and it makes possible to test if a first-
time miss occurs on a rehashed data and thus to avoid rehashing a first-time miss to a
set that contains rehashed data. Therefore, the scheme for column associativity is the
following (step 2 of the the basic scheme is modified to avoid clobbering):
1. the bit-selection hashing function b is applied to a memory address a. If b[a] indexes
to valid data, a first-time hit ocurs, and there is no time penalty;
2. if the first access is a miss, then the action taken depends on the value of the rehash
bit of the set indexed by b[a]:
(a) if the rehash bit has been set to one, then no rehash access will be attempted,
but the data retrieved from memory will be placed in the location obtained by
bit-selection. Then the rehash bit for that set will be reset to zero to indicate
that the data in this set is indexed by bit-selection and the access is completed.
(b) if the rehash bit is already a zero, then the bit-flipping function f is used to
access the cache. If f[a] indexes to valid data, then a second-time hit occurs
and data is retrieved;
3. if a second-time hit has occured, then the two cache lines are swapped so that the
next access will likely result in a first-time hit.
4. if the second access misses, then data is retrieved from main memory, placed in the
cache set indexed by f[a], then it is swapped with the data indexed by b[a] with the
goal of making the next access likely to be a first-time hit.
Note that if a second-time miss occurs, then the set whose data will be replaced is again a
rehashed location, as desired. At start-up (or after a cache flush), all of the empty cache
46 Memory Synthesis Using AI Methods
locations should have their rehash bits set to one. The reason that this scheme correctly
replaces a location that has the Rbit set to one immediately after a first-time miss is based
on the the relationship between bit selection and bit-flipping mapping: given two addresses
ai and ak, if f[ai] = b[ak] then f[ak] = b[ai]. Therefore, if ai accesses a location using b[ai]
whose rehash bit bit is set to one, then there are only two possibilities:
1. The accessed location is an empty location from start-up, or
2. there exists a non-rehashed location at f[ai] (that is, b[ak]) which previously encoun-
tered a conflict and placed the data in its rehashed location, f[ak].
In both cases replacing the location reached during first-time access that has the Rbit set
to one is a good action, because data at location b[ai] is less useful than data at location
f[ai] = b[ak].
The rehash bits limit the rehash accesses and the clobbering effect, and lower the proba-
bility of secondary thrashing. For the mentioned reference stream:
ai aj ak aj ak aj ak . . . ,
the third reference accesses b[ak], but it finds the rehash bit set to one, because this loca-
tion contains the data referenced by ai. Therefore, the data i is replaced immediately by
k, the desired action. Even though the column-associative cache can present secondary
thrashing if three or more conflicting addresses alternate, as in the pattern:
ai aj ak ai aj ak ai aj . . . ,
this case is much less probable than two alternating addresses.
6.5 Reducing Read Miss Rate
When the processor makes a memory reference that misses in the cache, then the line
corresponding to that memory address is fetched from memory. If no line is fetched until
it is referenced by the processor, one calls this demand fetching, that is no line is fetched
in advance from memory.
When a line is fetched from memory and brought into cache before it is requested by
the processor, one calls this a prefetch operation. The purpose of prefectch is to bring in
advance information that will soon be needed by the processor, and in this way to decrease
the miss rate. A prefetch algorithm guesses what information will soon be needed and
fetches it. When a prefetch algorithm decides to fetch a line from memory, it should
interrogate the cache to see if that line is already resident in cache. This is called prefetch
lookup and may interfere with the actual cache lookups generated by the processor. Given
that a prefetch may require to replace an existing line, this interference consists not only
in cycles lost by the CPU when waiting for the prefetch lookup cache accesses, or in cache
cycles used to bring in the prefetched line and perhaps to move out a line from cache, but
also in a potentially increase in miss ratio when lines that are more likely to be referenced
are expelled by a prefetch. This problem is called memory pollution and its impact depends
on the line size. Small line sizes generally result in a benefit from prefetching, while large
line sizes lead to the ineffectiveness of prefetch. The reason for this is that when the line
Memory Synthesis Using AI Methods 47
is large, a prefetch brings in a great deal of information, much or all of which may not be
needed, and removes an equally large amount of information, some of which may still be
in use.
The fastest hardware implementation (which is a major design criterion) is provided by
prefetching the line that is the immediately sequential to a referenced line. That is, if line
i is referenced, only line i + 1 is considered for prefetching. This method is known as one
block lookahead (OBL).
A prefetch may potentially be initiated for every memory reference and there are two
strategies to decide when to do prefetching:
1. always prefetch — means that on every memory reference, accesws for line i (for all
i) implies a prefetch access for line i + 1.
2. prefetch on misses — implies that a reference to a line i causes a prefetch to line
i + 1 if and only if the reference to line i was a miss.
Prefetching has several effects: it (presumably) reduces the miss ratio, increases the mem-
ory traffic and introduces cache lookup accesses. Always prefetch provides a greater de-
crease in miss ratio than prefetch on misses, but it also introduces greater memory and
cache overhead. The advantage of prefetching depends very strongly on the effectiveness
of the implementation. Prefetching should not use too many cache cycles if an acceptable
interference with normal program accesses to the cache is to be maintained. This can be
accomplished in several ways:
1. by instituting a second, parallel port to the cache;
2. by defering prefetches until spare cache cycles are available;
3. by not repeating recent prefetches: this can be done by remembering the addresses
of the last n prefetches in a small auxiliary cache, testing a potential prefetch against
this buffer and not issuing the prefetch if the address is found.
Another scheme that may help is buffering the transfers between the cache and the main
memory required by a prefetch and making them during otherwise idle cache cycles. The
memory traffic caused by prefetch seems unavoidable, but it is tolerable for one-block look
ahead.
A prefetch operation may be thought of as a nonblocking read of two lines, that is, when
a read miss occurs, the processor does not have to wait until both lines are fetched from
memory, but it can proceed immediately after the requested line has been brought into
the cache. Thereafter, while the processor proceeds, the cache is fetching the adjacent
line from memory. One block lookahead prefetch with a line size L outperforms demand
fetching with a line size 2L, due to the overlapping of the line fetch and CPU execution.
48 Memory Synthesis Using AI Methods
6.6 Reducing Write Hit Time
Write hits take usually more than one cycle because the Tag must be checked before
writing the data, and because when the processor modifies only a portion of a line, that
line must first be read from cache in order to get the unmodified portion. There are two
ways to do faster writes: pipelining the writes, and subblock placement for write-through
caches.
6.6.1 Pipelined Writes
This technique pipelines the writes to cache. The two steps of a cache write operation
—tag comparison and write data— are pipelined in a two-stage scheme:
• the first stage compares the Target Address and the Tags;
• the second stage makes a write to the cache using the address and data from the
previous write hit.
The idea is that when the first stage compares the Tag with the Target address, the
second stage accesses the cache using the address and data from the previous write. This
scheme requires that the Tag and Data can be addressed independently, that is, they must
be stored in separate memory arrays. Therefore, when the CPU issues a write and the
first stage produces a hit, the CPU does not have to wait for the write to the cache that
will be made in the second stage. In this way, a write to the cache takes only one clock
cycle. Moreover, this technique does not affect read hits: the second stage of a write hit
occurs during the first stage of the next write hit or during a cache miss.
6.6.2 Subblock Placement
This scheme may be applied to direct mapped caches with write-through policy. The
scheme maintains a valid bit on units smaller than the full block, called subblocks. The
valid bits specify some parts of the block as valid and some parts as invalid. A match of
the tag doesn’t mean the word is necessarily in the cache, as the valid bits for that word
must also be on. For caches with subblock placement a block can no longer be defined
as the minimum unit transferred between cache and memory, but rather as the unit of
information associated with an address tag.
Subblock placement was invented with a twofold goal: to reduce the long miss penalty of
large blocks (since only a part of a large block need to be read on a miss) and to reduce
the tag storage for small caches. The discussion below demonstrates the usefulness of this
method for writes. Subblock placement may be used for writes by extending it in the
following mode: a word is always written into the cache no matter what happens with the
tag match, the valid bit is turned on, and then the word is sent to memory. This trick
improves both write hits and misses and works in all cases, as shown below:
Memory Synthesis Using AI Methods 49
1. Tag match and valid bit already set. Writing the block was the proper action, and
nothing was lost by setting the valid bit on again.
2. Tag match and valid bit not set. The tag match means that this is the proper block;
writing the data into the block makes it appropriate to turn the valid bit on.
3. Tag mismatch. This is a miss and will modify the data portion of the block. However,
as this is a write-through cache, no harm was done; memory still has an up-to-date
copy of the old value. Only the tag to the address of the write need be changed
because the valid bit has already been set. If the block size is one word and the
STORE instruction is writing one word, then the write is complete. When the block
is larger than a word or if the instruction is a byte or halfword store, then either the
rest of the valid bits are turned off (allocationg the subblock without fetching the
rest of the block) or memory is requested to send the missing part of the block (i.e.,
write allocate).
This scheme can’t be used with a write-back cache because the only valid copy of the data
may be in the block, and it could be overwritten before checking the tag.
6.7 Reducing Write Stalls
Write stalls may occur at every write for write through or when a dirty line is replaced
for write back strategy, and they can be avoided using a write buffer (described in Section
5.7) of a proper size. Write buffers, however, introduce additional complexity for handling
misses because they might have the updated value of a location needed on a read miss.
For write through, the simplest solution to solve this problem is to delay the read until
all the information in the write buffer has been transmitted to memory, that is, until the
write buffer is empty. But, since a write buffer usually has room for a few words, it will
almost always have data not yet transferred, that is, it will not be empty, which incurs
an increase in the Read Miss Penalty. This increase may reach as much as 50% for a
four-word buffer as stated by Hennessy and Patterson in [6]. An alternative approach is
to check the contents of the write buffer on a read miss, and if there are no conflicts and
the memory system is available, let the read miss continue.
For write back, the buffer (whose size is one line in this case) may contain a dirty line that
has been purged from cache to make room for a new line but has not yet been written
into memory that still contains the old data. When a read miss occurs, there are also
two approaches: either to wait until the buffer is empty, or to check if it contains the
referenced line and to continue with the memory access if there is no conflict.
6.8 Two-level Caches
6.8.1 Reducing Miss Penalty
The gap between the CPU and main memory speeds is increasing due to CPUs getting
faster and main memories getting larger, but slower relative to the faster CPUs. The
50 Memory Synthesis Using AI Methods
question arising is whether the cache should be made faster to keep pace with the speed
of CPU or larger to reduce the miss rate. These two conflicting choices can be reconciled
by adding another level of cache between the original cache and main memory:
• the first-level cache is small enough to match the clock cycle time of the CPU;
• the second-level cache is large enough to capture many accesses that would go to
main memory, that is, misses in the first-level cache.
This is a two-level cache, with the first-level cache closer to the CPU. The average memory-
access time for a two-level cache may be computed through the steps below, where the
subscripts L1 and L2 refer to the first-level and the second-level cache respectively:
Average memory − access time = Hit timeL1 + Miss rateL1 ∗ Miss penaltyL1 (30)
and
Miss penaltyL1 = Hit timeL2 + Miss rateL2 ∗ Miss penaltyL2 (31)
therefore,
Average memory − access time =
= Hit timeL1 + Miss rateL1 ∗ (Hit timeL2 + Miss rateL2 ∗ Miss penaltyL2) (32)
For a two-level cache, one should make a distinction between the local miss rate and the
global miss rate:
• Local miss rate — is the number of misses in the cache divided the the total number
of memory accesses to that cache; these are Miss rateL1 and Miss rateL2;
• Global miss rate — is the number of misses in the cache divided by the total number
of memory accesses generated by the CPU; this is Miss rateL1 ∗ Miss rateL2.
The second-level cache reduces the miss penalty of the first-level cache (equation (31)) and
allows the designer to optimize the second-level cache for lowering this parameter, while
the first-level cache is optimized for low hit time.
6.8.2 Second-level Cache Design
The most important difference between the two levels of the cache is that the speed of
the first-level cache affects the clock rate of the CPU, while the speed of the second level
cache only affects the miss penalty of the first-level cache (equation (31)). Hence, for
the second-level cache there is a clear design goal: lower the miss penalty (equation (12),
Section 5.2), where for a two-level cache the miss penalty is Miss penaltyL1 (equation
(31)).
Capacity of second-level cache
The choice of the size for the second-level cache is so that it is bigger than the first-level
Memory Synthesis Using AI Methods 51
cache, because the information in the first-level cache should be likely to be also in the
second-level cache. If the second-level cache is just a little bigger than the first-level cache,
then its local miss rate will be high; if it is much larger than the first-level cache (this
usually means above 256 KB), then the global miss rate is about the same as for a single-
level cache of the same size. Typical values for second-level cache sizes are from 256 KB
to 4 MB.
Associativity of second-level cache
Unlike first-level cache, where the associativity is limited by the impact on clock cycle
time (Section 6.1), high associativity for second-level cache may be helpful because here
the sum expressed by equation (31) matters. Therefore, as long as increasing associativity
has a small impact on the second-level hit time, Hit timeL2 but has a great impact on
Miss rateL2, it is worthwhile to increase it. However, for very large caches the benefits
of associativity diminish because the larger size has eliminated many conflict misses, that
is, the decrease in Miss rateL2 does no more outweigh the increase in Hit timeL2.
Line size of second-level cache
As shown is Section 5.9, increasing block size reduces the compulsory misses as long as
spatial locality holds, but does not preserve temporal locality, leading to an increase in
conflict misses. Because second-level caches are large, increasing the line size has a small
efect on conflict misses, which favors larger line size. Moreover, if the access time of the
main memory is relatively long, then the effect of large line size on the Miss penalty (i.e.,
increased transfer time) is tolerable. Therefore, second-level cache have larger line sizes,
usually from 32 to 256 bytes.
6.9 Increasing Main Memory Bandwidth
The Miss penalty is the sum of the Access latency and Transfer time. As shown in Section
5.9, increasing the line size may decrease the Miss ratio, but the increase of line size is
limited by the associated increase in Miss penalty. The organization of main memory
has a direct impact on the Miss penalty because an improvement of the main memory
bandwidth (i.e., decrease in transfer time) allows cache line size to increase without a
corresponding increase in the Miss penalty.
6.9.1 Wider Main Memory
Let us consider a basic main memory organization in which a word has b bytes, described
by the following parameters:
m1 — is the number of clock cycles to send the address to main memory;
m2 — is the number of clock cycles for the access time per word;
m3 — is the number of clock cycles to send a word of data;
w — is the width of the memory (and of the bus) in words;
52 Memory Synthesis Using AI Methods
We can compute the memory bandwidth Bw (i.e., the number of data bytes transferred
in a clock cycle) corresponding to a bus width of w words:
Bw =
b ∗ w
m1 + m2 + m3
(33)
We can compute the access latency for one word (assuming that a line contains a multiple
of w words):
Access latencyw =
b
B
=
m1 + m2 + m3
w
(34)
and the Miss penalty time is then computed using equation (9), Section 5.1:
Miss penaltyw =
L ∗ (m1 + m2 + m3)
bw
(35)
If the memory width were one word (i.e., w = 1), then the Miss penalty would be:
Miss penalty1 =
L ∗ (m1 + m2 + m3)
b
(36)
Therefore, incresing the width of main memory w times increases the Memory bandwidth
(equation (33)) and decreases the Miss penalty (equations (35) and (36)) by the factor w,
allowing larger line sizes. An wider bus poses, however, some problems. First, because
the CPU accesses the cache one word at a time, a multiplexer is needed between the cache
and the CPU —and the multiplexer is on the critical timing path. Another problem is
that, because usually memories have error correction, writing only a portion of a word
imposes a Read-Modify-Write (RMW) sequence in order to compute the error correction
code. When error correction is done over the full width, the frequency of partial block
writes will increase as compared to an one word width and hence the frequency of RMW
sequences will increase. This can be remedied if the error correction codes are associated
for every 32 bits of the bus width because most writes are that size.
6.9.2 Interleaved Memory
Another way to increase the memory bandwidth is to organize the memory chips in banks
that are one word wide. In this way the width of the bus to the cache is still one word, but
sending addresses to the banks simultaneously permits them all to read simultaneously.
If L is the line size in bytes, and b the number of bytes per word, then the number of
memory banks, Nbanks is:
Nbanks = L/b (37)
The mapping of addresses to banks determines the interleaving factor. Interleaved memory
means normally that word interleaving is used, that is, the following mapping function:
Bank − number = (Memory address) modulo (L/b) (38)
Word interleaving optimizes sequential memory accesses and is ideal for read miss penalty
reduction because when a word misses in the cache, the line that must be fetched from
memory is built up from words with sequential addresses which are hence located in
Memory Synthesis Using AI Methods 53
different banks. Write-back caches make writes as well as reads sequential, getting even
more efficiency from interleaved memory.
With the notations from the Subsection 6.9.1 and taking into account that the memory is
one-word wide, all the L/b words in a line are accessed in parallel and only data sending
is serial, one gets for the Memory bandwidth of interleaved memory, Bi, the expression:
Bi =
L
m1 + m2 + m3 ∗ (L/b)
(39)
The access latency has the expression:
Access latencyi =
b
B
=
b
L
∗ (m1 + m2 + m3 (L/b)) (40)
The Miss penalty for interleaved memory is computed using equation (9), Section 5.1:
Miss penaltyi = m1 + m2 + m3 ∗ (L/b) (41)
Interleaved memory provides a reduction of the Miss penalty as compared to the
Miss penalty1 (equation (36)) of the basic memory organization:
Miss penalty1 − Miss penaltyi = (L/b − 1) ∗ (m1 + m2) (42)
In an interleaved memory the maximum number of banks is limited by memory-chip cost
constraints. For example, consider a main memory of capacity 16-MB with 4-byte words
(i.e., the main memory is 4 mega words). With an one-word wide memory organization,
and using 4-Mbit DRAM chips, the number of memory chips needed is 32 chips. If a line
size is chosen to be 16 words (i.e., 64 bytes) then, using interleaved memory, the number
of banks in main memory must be 16 (equation (37)), and the capacity of a bank is 256
Kwords (1MB). Therefore, a bank contains 32 chips of 256-Kbit DRAMs, and a total of
512 DRAM chips are necessary, as opposed to only 32 chips of 4-Mbit DRAM needed in
the simplest design.
In conclusion, the number of banks increases linearly with the line size (equation (37)) and
the maximum number of banks is limited by the cost of the memory. The availability of
high-capacity DRAM chips causes an interleaved memory to be built from more memory
chips than an one-word memory (because banks are using smaller-capacity DRAM chips).
This leads to an increase in the cost of the interleaved memory and limits the maximum
number of memory banks that can be economically used, and thus limits the line size.
54 Memory Synthesis Using AI Methods
7 SYNCHRONIZATION PROTOCOLS
7.1 Performance Impact of Synchronization
In parallel applications, synchronization points are used for interprocess synchronization
and mutually exclusive access to shared data. According to the frequency of synchroniza-
tion points, applications fall into three broad categories:
1. coarse-grained applications —are applications in which parallel processes synchronize
infrequently;
2. medium-grained applications —are applicatios in which parallel processes synchro-
nize with a moderate frequency;
3. fine-grained applications —are applications in which parallel processes synchronize
frequently.
This aspect of the application behavior is referred to as the granularity of the parallelism.
Synchronization involves accesses to synchronization variables. These variables are prone
to becoming hot spots — variables frequently accessed by many processors. This in turn
causes memory traffic and may degrade the system performance up to the point of satu-
ration.
The inefficiency caused by synchronization is twofold: waiting times at synchronization
points and the intrinsic overhead of the synchronization operations. Reducing waiting time
is the responsibility of programmers (the characteristics of a parallel application determine
the amount of synchronization points and the waiting time), and reducing synchronization
overhead is a task for the computer architect.
The memory-consistency model (Chapter 8) influences the amount of synchronization
activity: in machines exhibiting the weak or release consistency models of behavior the
frequency of synchronization points is greater than in machines supporting sequential
consistency. The memory-consistency model and the cache-coherence protocol should be
taken into account when selecting how to implement a synchronization method.
7.2 Hardware Synchronization Primitives
7.2.1 TEST&SET(lock) and RESET(lock)
A lock is a variable on which two atomic operations can be performed:
1. lock (also called acquire) — a process locks a lock when the lock is free, that is,
a zero, and the process sets the lock to locked, that is, to one. A lock operation
is accomplished by reading the lock variable (which is a shared variable) until the
value zero is read, and then setting the lock to one (using for example an atomic
RMW operation).
Memory Synthesis Using AI Methods 55
2. unlock (also called release) — a process unlocks a lock when it frees the lock, that
is, it sets the lock variable to zero. An unlock operation is always associated with a
write to the lock variable.
Locks are useful in providing mutual exclusive access to shared variables. Two synchro-
nization primitives, called TEST&SET and RESET are a common means to implement a
lock. The TEST&SET primitive (which performs an RMW operation) provides atomic test of
a variable and sets the variable to a specified value. The semantics of TEST&SET is:
TEST&SET(lock)
{ temp = lock; lock = 1;
return temp; }
The value returned by the operation is the value before setting the variable. An acquire
operation is done usually by having the software repeat the TEST&SET until the returned
value is zero. This repeated check of a variable until it reaches a desired state is called busy
waiting on continual retry. Busy waiting ties up the processor in an idle loop, increases
the memory traffic and may lead to contention problems on the interconnection network.
This type of lock that forces the process to “spin” on the CPU while waiting for the lock
to be released is called a spin-lock. The RESET primitive is used to unlock (release) a lock
and has the semantics:
RESET(lock)
{ lock = 0; }
To avoid sppinning, interprocessor interrupts are used. A lock that relies on interrupts
instead of spinning is called a sleep-lock or suspend-lock. A sleep-lock is implemented as
follows: whenever a process fails to acquire the lock, it records its status in one field
of the lock and disables all interrupts except interprocessor interrupts. When a process
releases the lock, it signals all waiting processes through an interprocessor interrupt. This
mechanism prevents the excessive interconnection traffic caused by busy-waiting but still
consumes processor cycles: the processor is no more spinning, but it is sleeping!
7.2.2 FETCH&ADD
The FETCH&ADD primitive provides atomic incrementing (or decrementing) operation on
uncached memory locations. Let x be a shared-memory word and a its increment (or decre-
ment, if negative). When a single processor executes the FETCH&ADD on x, the semantics
are:
56 Memory Synthesis Using AI Methods
FETCH&ADD(x, a)
{ temp = x; x = temp + a;
return temp; }
When N processes attempt to execute FETCH&ADD on the same memory word x simultane-
ously, the memory is updated only once, by adding the sum of the N increments, and each
of the N processes receives a returned value that corresponds to an arbitrary serialization
of the N requests. From the processor point of view, the result is similar to a sequential
execution of N FETCH&ADD instructions, but it is performed in one memory operation. The
success of this primitive is based on the fact that its execution is distributed in the inter-
connection network using a combining interconnection network (Subsection 7.4.1) that is
able to combine more accesses to a memory location into a single access. In this way, the
complexity of an N-way synchronization on the same memory word is independent of N.
This method for incrementing and decrementing has benefits as compared with the use of a
normal variable protected by a lock to achieve the atomic increment or decrement, because
it involves less traffic, smaller latency and decreased serialization. The serialization of this
primitive is small because it is done directly at the memory site. This low serialization is
important when many processors want to increment a location, as happens when getting
the next index in a parallel loop. A multiprocessor using a combining network and this
primitive is the IBM RP3 computer ([3]). FETCH&ADD is useful for implementing several
synchronization methods such as barriers, parallel loops, and work queues.
7.2.3 Full/Empty bit primitive
Under this primitive, a memory location is tagged as empty or full.
LOADs of such words succeed only after the word is updated and tagged as full. After a
successful LOAD, the tag is reset to empty. Similarly, the STORE on a full memory word can
be prevented until the word has been read and the tag cleared.
This primitive relies on busy-waiting, and memory cycles are wasted on each trial: when
a process attempts to execute a LOAD on an empty-tagged location, the proces will spin on
the CPU while waiting for the location to be tagged as full. By analogy with the locks,
one says that the process spin-locks on the Full/Empty bit.
This mechanism can be used to synchronize processes, since a process can be made to wait
on an empty memory word until some other process fills it.
7.3 Synchronization Methods
In this section we present methos for achieving mutual exclusion and conditional synchro-
nization.
Memory Synthesis Using AI Methods 57
7.3.1 LOCK and UNLOCK operations
A LOCK operation on a lock variable changes the value of the lock variable from zero to one.
If several processes attempt to execute a LOCK, only one process is allowed to successfully
execute this operation and to proceed. All other processes that attempt to execute a
LOCK will be waiting until the process that has acquired the lock releases it via an UNLOCK
operation. An UNLOCK operation sets the lock variable to zero, signaling that the lock
is free. An UNLOCK operation may be implemented using the RESET(lock) primitive. A
LOCK operation may be implemented using the TEST&SET primitive, as shown in the code
segment:
LOCK(lock)
{ repeat
while(LOAD(lock)==1) ; // spin-lock with read cycle //
until (TEST&SET(lock)==0) ;} // test free lock and LOCK it//
A LOCK operation may be used to gain exclusive access to a set of data (Subsection 7.3.3).
7.3.2 Semaphores
A semaphore is a nonnegative integer variable, denoted by s, that can be accessed by two
atomic operations, denoted by P and V . The semantics of the P and V operations are:
P(s)
{ if (s > 0) then s = (s − 1);
else
{ Block the process and append it to the waiting list for s;
Resume the highest priority process in the READY LIST; }
}
V (s)
{ if (waiting list for s empty) then s = (s + 1);
else
{ Remove the highest priority process blocked for s;
Append it to the READY LIST; }
}
58 Memory Synthesis Using AI Methods
In these two algorithms shared lists are consulted and modified, namely, the READY LIST
and the waiting list for s. The READY LIST is a data structure containing the descriptors
of processors that are runable. These accesses as well as the test and modify of s have to
be protected by locks or FETCH&ADDs associated with semaphores and with the lists.
Semaphores that have possible values 0 and 1 are called binary semaphores. Those that
can take the values 0 to n are called general or counting semaphores. When a semaphore
has a value greater than 0, it is defined open; otherwise, it is closed. Counting semaphores
are useful when there are multiple instances of a shared resource.
In practice, P and V are processor instructions or microcoded routines, or they are oper-
ating system calls to the process manager. The process manager is the part of the system
kernel controlling process creation, activation, and deletion, as well as management of
the locks. Because the process manager can be called from different processors at the
same time, its associated data structures must be protected. Semaphores are particulary
well adapted for synchronization. Unlike spin-locks and sleep-locks, semaphores are not
wasteful of processor cycles while a process is waiting, but their invocation requires more
overhead. Note that locks are still necessary to implement semaphores. A drawback of
semaphore-based synchronization is that it puts the responsibility for controlling access
on the programmer or the parallelizing compiler, who must decide when to synchronize
and on what conditions.
7.3.3 Mutual Exclusion
Mutually exclusive access to shared variables is achieved by enforcing sequential execution
of critical sections of different processes. The common methods used to control access to
the critical section are the locks and the semaphores.
Mutual exclusive access using locks
If the machine supports an atomic TEST&SET primitive, mutual exclusive access can be
implemented as follows:
while (TEST&SET(lock)==1) ; // spin–lock with Read-Modify-Write cycles //
. . . . . . . . . // execute critical section //
RESET(lock); // unlock the lock and exit critical section //
This segment of code protects access to a critical section via a spin-lock. A variant of
implementation uses the LOCK operation (Section 7.3.1) to control access to the critical
section:
repeat
while(LOAD(lock)==1) ; // spin-lock with read cycle //
until (TEST&SET(lock)==0) ; // LOCK the lock//
Memory Synthesis Using AI Methods 59
. . . . . . // critical section//
RESET(lock); // unlock the lock and exit critical section//
The performance of these two approaches is examined in Section 7.5.
Mutual exclusive access using semaphores
To provide mutual exclusion, a binary semaphore s (Subsection 7.3.2) is associated with a
critical section and is used to guarantee sequential access to it. Before entering the critical
section, each process must execute P(s); upon exiting, it must execute V (s).
7.3.4 Barriers
Barriers are used for a conditional synchronization that requires that all synchronizing
processes reach a synchronization point called barrier before any processor is allowed to
continue. The BARRIER operation “joins” a number of parallel processes: all processes
synchronizing at a barrier must reach the barrier before any one of them can continue.
If there are N processes that must reach the barrier, then a barrier variable, denoted by
count —that is used as a process counter and has been initialized to zero— is used. The
BARRIER operation is defined as follows:
BARRIER(N)
{ count = count + 1;
if (count ≥ N) then
{ Resume all processes on barrier queue;
Reset count; }
else Block task and place in barrier queue;
}
The first N − 1 processes that execute the BARRIER operation are blocked and are put
in a barrier queue that can be implemeted with an Full/Empty tagged word in which
identifiers of blocked processes are written. The processes that are blocked spin-lock on
the Full/Empty bit. Upon execution of BARRIER by the N-th process, all N processes are
ready to resume; consequently, this process writes into the tagged memory location and
wakes up all blocked processes. A variant implemenation of the Barrier operation also uses
a barrier variable that is incremented by each process when it reaches the synchronization
point but, instead of using the Empty/Full bit, tests a barrier flag. After incrementing
the barrier variable, each processor spin-locks on a barrier flag. The Nth processor that
reaches the barrier increments the barrier variable to its final value, N, and writes into the
barrier flag, thereby releasing the spinning processors. This method has the disadvantage
of relying on busy waiting. Another variant is to use a sleep-lock for the barrier flag. The
atomicity of the increment and test operations on the barrier variable must be enforced
by some hardware synchronization mechanism, such as a lock. With regard to the barrier
variable, the FETCH&ADD primitive is a good choice —if available—, because it provides
the least contention thanks to the combining property of FETCH&AdDD.
60 Memory Synthesis Using AI Methods
7.4 Hot Spots in Memory
When accesses from several processors are concentrated to data from a single memory
module over a short duration of time, the access pattern is likely to cause hot spots in
memory. A hot spot is a memory location repeatedly accessed by several processors. Syn-
chronization objects such as locks and barriers, and loop index variables for parallel loops
are examples of shared variables that can become hot spots. Hot spots can significantly
reduce the memory and network throughput because they do not allow parallelism of
the machine architecture to be exploited as it is possible under unifrom memory access
patterns. Hot spots can cause severe congestion in the interconnection network, which
degrade the bandwidth of the shared-memory system.
7.4.1 Combining Networks
An widespread scheme to avoid memory contention is the combining network. The idea
is to incorporate some hardware in the interconnection network to trap and combine data
accesses when they are fanning in to a memory module that contains the shared variable.
By combining data accesses in the interconnection network the number of accesses to the
shared variable is decreased. The extra hardware required for this scheme is estimated in
[15] to increase the switch size and/or cost by a factor between 6 and 32 for combining
networks consisting of 2 × 2 switches. The extra hardware also tends to add extra net-
work delay which will penalize most of the ordinary data accesses that do not need these
facilities, unless the combining network is built separately.
7.4.2 Software Combining Trees
A software tree can be used to eliminate memory contention due to the hot-spot variable.
The idea is similar to the concept of a combining network, but it is implemented in
software instead of hardware. A software combining tree is used to do the combining of
data accesses. This technique, that has been proposed by Yew et al. in [16], does not
require expensive hardware combining, while providing comparable performance. The
principle of a software combining tree is first illustrated for a barrier variable.
Let us assume a multiprocessor architecture with N processors and N memory modules.
We define the fan-in of the accesses to a memory location as the number of accesses to that
location. We assume a that a hot-spot with a fan-in of N exists in the system, for example
when a barrier variable (Sectuion 7.3.4) is used to make sure that all processors are finished
with a given task before proceeding. Therefore, the barrier variable is addressed by the
N processors, causing a hot-spot with a fan-in of N. The barrier variable has initially the
value zero, and each of N processors has to increment this variable so that when when all
processors are finished, the value will be N. Assuming that N = 1000, we have a hot-spot
with 1000 accesses. The software combining tree is replacing the single variable with a
tree of variables, with each variable in a different memory module. For the example given,
if we decide to reduce the fan-in to 10, then for each group of 10 processors, a variable
is created in a different memory module, for a total of 100 variables (level 1 of the tree
Memory Synthesis Using AI Methods 61
. . . . . .
. . .
. . . . . . . . .
level 1
level 2
hot-spot location
. . . . . .
. . .
Figure 8: Software Combining Tree
in Figure 8). Therefore, we partition the processes into N/10 = 100 groups of 10, with
each group sharing one variable at level 1 of the tree. Then we partiton the 100 variables
into 10 groups of 10, with each group sharing a variable at the level 2 in the tree, thus
other 10 variables are created in different memory modules. Finally, for the 10 variables in
level 2 we associate a variable in another memory module that is the root of the tree and
corresponds with the old hot-spot. When the last process in each group increments its
variable, it then increments the variable in the parent node. We have therefore increased
the number of variables from one to :
100 + 10 + 1 = 111
and the number of memory accesses from 1000 to:
1000 + 100 + 10 = 1110
but instead of having one hot spot with 1000 accesses we have 111 hot spots with only 10
accesses each. This technique results in a significant improvement in throughput rate and
bandwidth even if we account for the increase in total accesses.
The idea of combining tree can be applied for implementing conditional synchronization
as well. Because this operation implies processors that are waiting for a shared variable to
change in some way, and the variable will be changed by another processor, the combining
tree is built here by assigning one processor to each node in the tree. Thus, each processor
monitors the state of its node by continually reading the node. When the processor
monitoring the root node detects the change in its node, it in turn changes the state of its
children’s nodes, and so on untill all processors have detected the change and are able to
proceed with the next task.
62 Memory Synthesis Using AI Methods
7.5 Performance Evaluation
The access patterns to locations used for synchronization may cause great performance
penalty. Spin-locks (that rely on busy waiting) should ideally meet the following criteria:
• minimum amount of traffic generated while waiting,
• low latency release of a waiting process, and
• low latency acquisition of a free lock.
We shall examine how do the locks satisfy these criteria making reference to the use of
locks for achieving mutual exclusive access (Subsection 7.3.3) in a shared-memory system
with caches and show that the repeated execution of the TEST&SET operation by a process
while the lock is acquired by another process causes the ping-pong effect.
Let us see what happens when mutual exclusion is achieved using locks (Subsection 7.3.3).
Assume there are N processors trying to enter the critical section and waiting that another
process that has already acquired the lock release it. In the first variant of code segment in
Subsection 7.3.3, the TEST&SET instruction repeatedly tests the value of the lock variable
and also writes the lock variable in the SET part of the operation. With this scheme, each
process contending for the lock continuously generates invalidations (due to the WRITEs to
the lock variable) to the other caches. As a result of the invalidation of the lock variable
by one process spinning on the lock, when the other N − 1 spinning processors read the
lock they do not have a copy of the lock in cache and they must get it via interconnection.
Consequently, each execution of TEST&SET by one processor causes N − 1 data transfers
— the ping-pong effect that introduces a significant amount of traffic. The second variant
of implementation of mutual exclusion with locks in Subsection 7.3.3 avoids the ping-pong
effect by repeatedly executing a LOAD on the lock variable until the lock is seen to be
free. After the first execution of a LOAD by a processor, the subsequent accesses to the
lock variable will be made to the cached copy from the private cache of each processor,
until this copy will be invalidated by the processor that releases the lock. In this way,
waiting for the lock to be released is done on the cached copy of the lock and no traffic
occurs because the waiting process does not write the lock anymore (thus, the criterion
“miminum amount of traffic while waiting” is met).
Let us examine now what happens when the lock is released by the processor that had
acquired it. When the lock is released, all the N cache copies of the lock variable are
invalidated. Therefore, the processors spinning on the lock must generate each a read
to the memory system to read the lock. Depending on the timing, it is possible that all
processors go to do the TEST&SET on the lock once they realize the lock is free, resulting in
further invalidations and rereads. Therefore, it results that, even with the second variant,
spin-locks may cause intense coherence activity if multiple processors are spinning when
the lock is released by another processor.
We can conclude that cached TEST&SET schemes are moderately successful in satisfying
the efficiency criteria for low-contention locks, but fail for highly-contended locks because
all the waiting processors rush to grab the lock when the lock is released.
Memory Synthesis Using AI Methods 63
8 SYSTEM CONSISTENCY MODELS
The memory consistency model is the set of allowable memory access (events) ordering.
Consistency models place requirements on the order that events from one process may be
observed by other processes in the machine. The memory consistency model defines the
logical behavior of the machine on the basis of the allowable order (sequence) of execution
within the same process and among different processes.
For the memory to be consistent, two necessary conditions must be met:
1. the memory must be kept coherent, that is, all writes to the same location are
serialized in some order and are performed in that order with respect to any
processor;
2. uniprocessor data and control dependences must be respected (this is the responsi-
bility of the hardware and the compiler) — this is required for local-level consistency.
Several memory consistency models have been proposed ([13]), such as sequential consis-
tency, processor consistency, weak consistency, and release consistency. Sequential consis-
tency is the strictest model and it requires the execution of a parallel program to appear as
some interleaving of the execution of the parallel processes on a sequential machine. This
model offers a simple conceptual programming model but limits the amount of hardware
optimization that could increase performance. The other models attempt to relax the con-
straints on the allowable event orderings, while still providing a reasonable programming
model for the programmer.
The architectural organization of a system may or may not inherently support atomicity
of memory accesses. While memory accesses are atomic in systems with a single copy of
data (a new value becomes visible to all processors at the same time), such atomicity may
not be present in cache-based systems. The lack of atomicity introduces extra complexity
in implementing consistency models. Caching of data also complicates the ordering of
accesses by introducing multiple copies of the same location. When the bus is not the
interconnection network, but general interconnection networks are used instead (non bus-
based systems), the invalidations sent by a cache may reach the other caches in the system
at different moments. Waiting for all nodes connected to the network to receive the
invalidation and to send the acknowledge would cause a serious performance penalty. As
a result of the distributed memory system and general interconnection networks being
used by scalable multiprocessor architectures (for example, the DASH architecture [9,17]
presented in Section 9.5.4), requests issued by a processor to distinct memory modules may
execute out of order. Consequently, when distributed memory and general interconnection
networks are used, performance would benefit if the consistency model allows accesses to
perform out of order, as long as local data and control dependences are observed.
The expected logical behavior of the machine must be known by the programmer in order
to write correct programs. The memory consistency has a direct effect on the complexity
of the programming model presented by a machine for the programmer.
64 Memory Synthesis Using AI Methods
8.1 Event Ordering Aspects
Because a memory consistency model specifies what event orderings are legal when several
processes are accessing a common set of locations, we first examine the stages a memory
request goes through, and present some formal definitions.
Access ordering has several aspects. Program order is defined as the order in which accesses
occur in the execution of a single process given that no reordering takes place. All ac-
cesses in the program order that are before the current access are called previous accesses.
Execution of a memory access has several stages: the access is issued by a processor, it is
performed with respect to some processor and is observed by other processors. We refine
the definition of performing a memory request given in Section 4.5, as follows, where Pi
refers to processor i:
Defintion: Performing a Memory Request
• A LOAD by Pi is considered performed with respect to Pk at a point in time when
issuing a STORE to the same address by Pk can not affect the value returned by the
LOAD.
• A STORE by Pi is considered performed with respect to Pk at a point in time when
an issued LOAD to the same address by Pk returns the value defined by this STORE
(or a subsequent STORE to the same location)
• An access is performed when it is performed with respect to all processors, and is
performing with respect to a processor when it has been issued but has not yet been
performed with respect to that processor.
• A LOAD is globally performed if it is performed and if the STORE that is the source of
the returned value has been performed.
The distinction between performed and globally performed LOAD accesses is only present
in architectures with non-atomic STOREs, because an atomic STORE becomes readable to
all processors at the same time and it is not allowed to perform while another access to the
same memory location is still performing. In architectures with caches and general inter-
connection networks a STORE operation is inherently non-atomic unless special hardware
mechanisms are employed to assure atomicity (for example, a cache-coherence protocol).
Total order means the order in which accesses occur as the result of executing the accesses
of all processes. A total order exists only if each processor is observing the same order of
occurence with respect to accesses issued by all other processors.
8.2 Categorization of Shared Memory Accesses
The knowledge from Chapter 7 allows making a general categorization of memory accesses,
which will provide a global image of shared-memory accesses and will set the bottom line
for the formulation of consistency conditions for different consistency models.
Memory Synthesis Using AI Methods 65
Conflicting accesses and Competing accesses
Two accesses are called conflicting if they are to the same memory location and at least
one of the accesses is a STORE (a Read-Modify-Write operation is treated as an atomic
access consisting of both a LOAD and a STORE). Consider a pair of conflicting accesses a1
and a2 on different processors. If no ordering is guaranteed for the two accesses, then
they may execute simultaneously thus causing a race condition. Such accesses a1 and a2
are said to form a competing pair. If an access is involved in a competing pair under any
execution, then the access is called a competing access.
A parallel program consisting of individual processes specifies the actions for each process
and the interaction among processes. These instructions are coordonated through accesses
to shared memory. For example, a producer process may set a flag variable to indicate
to the consumer process that a data record is ready. Similarly, processes may enclose all
updates to a shared data structure within LOCK and UNLOCK operations to prevent simul-
taneous access. All such accesses used to enforce an ordering among processes are
called synchronization accesses. Synchronization accesses have two distinctive character-
istics:
they are competing accesses, with one process writing a variable and the other reading
it; and
they are frequently used to order conflicting accesses (i.e., make them noncompeting)
For example, the LOCK and UNLOCK synchronization operations (defined in Subsection 7.3.1)
are used to order the non-competing accesses made inside a critical section.
Synchronization accesses can be further partitioned into acquire and release accesses. An
acquire synchronization acccess is performed to gain access to a set of shared locations. A
release synchronization access grants this permission. An acquire is usually accomplished
by reading a shared location until an appropriate value is read. Thus, an acquire is always
associated with a read synchronization access. Similarly, a release is always associated with
a write synchronization access. Examples of acquire synchronization accesses are: a LOCK
operation or a process spinning for a flag to be set. Examples of release synchronization
accesses are an UNLOCK operation or a process setting a flag. In fact, a LOCK operation
requires a Read-Modify-Write (RMW) access. Most architectures provide atomic RMWs (such
as the TEST&SET operation used to gain exclusive access to a set of data) for efficiently
dealing with competing accesses. An atomic RMW can be associated with a pair consisting
of an acquire access (for the read part of the operation) and of a release access (for the
write part of the operation), but other categorization of the RMW operation is possible, as
shown in the next Section.
A competing access is not necessarily a synchronization access. A competing access that
is not a synchronization access is called a non-synchronization competing access. The
categorization of shared writable memory accesses is depicted in Figure 9(a).
The categorization of shared accesses into these groups allows more efficient implemen-
tation by using this information to relax the event ordering restrictions. The tradeoff is
how easily that extra information can be obtained from the compiler or the programmer
and what incremental performance benefits it can provide. For example, the purpose of a
66 Memory Synthesis Using AI Methods
acquire release
synchronization non-synchronization
competing non-competing
shared access
acqL relL
syncL nsyncL
specialL ordinaryL
sharedL
(a) (b)
Figure 9: Shared writable memory accesses: (a) Categorization ; (b) Labeling
release access is to inform other processes that accesses that appear before it in program
order have completed. On the other hand, the purpose of an acquire access is to delay
future access to data until informed by another process. These two remarks are used in
the definition of the release consistency model (Section 8.7).
8.3 Memory Access Labeling and Properly-Labeled Programs
While categorization of shared writable accesses refers to the intrinsic properties of an
access, the programmer or the compiler are asserting some categorization of accesses.
The categorization of an access, as provided by the programmer or the compiler is called
labeling for that access. The labelings for the memory accesses in a program are shown
in Figure 9(b). The subscript L denotes that these are labels. Access labeling is usually
done so that a label corresponds to the category of access that has the same position in
the categorization tree of Figure 9(a). The labels acqL and relL refer to the acquire and
release accesses respectively. The labels at the same level are disjoint, and a label at a
leaf implies all its parents are labels (e.g., an access labeled as acqL is also labeled
as syncL, specialL, and sharedL).
For consistency models that use the information conveyed by the labels, the labels need
to have a proper relationship to the actual category of accesses to ensure correctness.
Although labeling normally corresponds to the categorization, sometimes it can be more
Memory Synthesis Using AI Methods 67
conservative than categorization. For example, the ordinaryL label asserts that an access
is non-competing. Since hardware may exploit the ordinaryL label to use less strict event
ordering, it is important that the ordinaryL label be used only for non-competing accesses.
However, a non-competing access can be labeled conservatively as specialL (thus, in some
cases a distinction between the categorization and the labeling of an access is made).
To ensure that accesses labeled ordinaryL are indeed non-competing, it is important that
enough competing accesses (i.e., accesses labeled as specialL) be labeled as acqL and relL.
The difficulty of ensuring enough syncL labels for a program depends on the amount of
information about the category of accesses, but it is shwon farther in this section that the
problem can be solved by following a conservative labeling strategy. Because labels at the
same level are disjoint and a label at a leaf implies all its parent labels, it follows that:
(1) an acqL or relL label implies the syncL label, and
(2) any specialL access that is not labeled as syncL is labeled as nsyncL, and
(3) any sharedL access that is not labeled as specialL is labeled as ordinaryL.
The LOAD and STORE accesses in a program are labeled based on their categorization. The
atomic read-modify-write (such as the TEST&SET primitive) provided by most architectures
is labeled by seeing it as a combination of a LOAD access and a STORE access and by labeling
separately each access based on its categorization. The common labeling for a TEST&SET
primitive is an acqL for the LOAD access and a nsyncL for the STORE access, because the
STORE access does not function as a release. If the programmer or the compiler cannot
categorize an RMW appropriately, then the conservative label for guaranteeing correctness
is: acqL for the LOAD and relL for the STORE part of the RMW.
When all accesses in a program are appropriately labeled, the program is called a properly-
labeleled (PL) program. The conditions that ensure that aprogram is properly labeled are
given in [13]:
Condition for Properly-Labeled (PL) Programs
A program is properly-labeled if the following hold:
(shared access) ⊆ sharedL, competing ⊆ specialL,
and enough special accesses are labeled as acqL and relL.
There is no unique labeling to make a program a PL program, that is, there are several
labelings that respect the previous subset properties. Given perfect information about the
category of an access, the access can be easily labeled making the labels (Figure 9(b)) to
correspond to the categorization of accesses (Figure 9(a)). When perfect information to
make labeling is not available, proper labeling can still be provided by being conservative.
The three possible labeling strategies (from conservative to aggressive) are:
1. If competing and non-competing accesses can not be distinguished, then all reads
can be labeled as acqL and all writes can be labeled as relL.
2. If competing accesses can be distinguished from non-competing accesses but syn-
chronization and non-synchronization accesses can not be distinguished, then all
accesses distinguished as non-competing can be labeled as ordinaryL and all com-
peting accesses are labeled as acqL and relL (as in strategy (1)).
68 Memory Synthesis Using AI Methods
3. If competing and non-competing accesses are distinguished and synchronization and
non-synchronization accesses are distinguished, then all non-competing accesses can
be labeled as ordinaryL, all non-synchronization accesses are labeled as nsyncL and
all synchronization accesses are labeled as acqL and relL (as in strategy (1)).
There are two practical ways for labeling accesses to provide properly-labeled (PL) pro-
grams. The first involves parallelizing compilers that generate parallel code from sequential
programs. Since the compiler does the parallelization, the information about which ac-
cesses are competing and which accesses are used for synchronization is known to the
compiler and can be used to label the accesses properly. The second way of producing PL
programs is to use a programming methodology that leads itself to proper labeling. For
example, a large class of programs are written such that accesses to shared data are pro-
tected within critical sections. Such programs are called synchronized programs, whereby
writes to shared locations are done in a mutually exclusive manner. In a synchronized pro-
gram, all accesses (except accesses that are part of the synchronization constructs) can be
labeled as ordinaryL. In addition, since synchronization constructs are predefined, the ac-
cesses within them can be labeled properly when the constructs are first implemented. For
this labeling to be proper, the programmer must ensure that the program is synchronized.
8.4 Sequential Consistency Model
The strictest consistency model is called Sequential Consistency (SC) and has been defined
by Lamport [2] as follows:
A system is sequentially consistent if the result of any execution
of a program is the same as if the operation of all the processors
were executed in some sequential order, and the operations of each
individual processor appear in this sequence in the order specified
by its program.
In other words, the sequential consistency model requires execution of the parallel program
to appear as some interleaving of the execution of the parallel processes on a sequential
machine. An interleaving that is consistent with the program order is called legal inter-
leaving. Application of the above definition requires a specific interpretation of the terms
operations and result. Operations are memory accesses (reads, writes, and read-modify-
writes) and result refers to the union of values returned by all the read operations in the
execution and the final state of memory. The definition of sequential consistency can be
translated in the following two conditions:
(1) all memory accesses appear to execute atomically in some total order, and
(2) all memory accesses of each processor appear to execute in an order specified by its
program, that is, in program order.
Speaking in event ordering terms, a sequential consistent memory assures that the execu-
tion of processes is such that there is a total order of memory accesses that is consistent
with the program order of each process. Under sequential consistency, identification of
accesses that form a competing pair can be achieved with the following criterion: Two
Memory Synthesis Using AI Methods 69
conflicting accesses a1 and a2 on different processes form a competing pair if there exists
at least one legal interleaving where a1 and a2 are adjacent. Assuming the SC model, the
following criterion (given in [13]) may be used for determining whether enough accesses
are labeled as syncL (i.e., as acqL and relL)
Condition for enough syncL labels
Pick any two accesses u on processor Pu and v on processor Pv (Pu not the same as Pv),
such that the two accesses conflict and at least one is labeled as ordinaryL. If v appears
after (before) u under any interleaving consistent with the program order, then there needs
to be at least one relL (acqL) access on Pu and one acqL (relL) on Pv separating u and v,
such that the relL appears before the acqL. There are enough accesses labeled as syncL
—that is, relL and acqL labeled accesses— if the above condition holds for all possible
pairs u and v.
The SC model ignores all access labelings past sharedL. In systems that are sequentially
consistent we say that events are strongly ordered: the order in which events are generated
by a processor is the same as the order in which all the other processors observe the
events, and events generated by two different processors are observed in the same order
by all other processors.
8.4.1 Conditions for Sequential Consistency
Necessary and Sufficient Conditions for SC in Systems with Atomic Accesses
It has been shown by Dubois et al. ([3]) that the necessary and suficient condition for a
system with atomic memory accesses to be sequentially consistent is that memory accesses
be performed in program order.
In architectures with caches and general interconnection networks, where accesses are
inherently non-atomic, special hardware and software mechanisms must be employed to
assure sequential consistency. Here are the suficient conditions for sequential consistency
(as given in [13]) in systems with non-atomic accesses:
Sufficient Conditions for SC in Systems with Non-Atomic Accesses
(1) before a LOAD is allowed to perform with respect to any other processor, all previous
LOAD accesses must be globally performed and all previous STORE accesses must be per-
formed, and
(2) before a STORE is allowed to perform with respect to any other processor, all previ-
ous LOAD accesses must be globally performed and all previous STORE accesses must be
performed.
8.4.2 Consistency and Shared-Memory Architecture
Let us examine the consistency model for the common shared-memory architectures:
shared-bus systems without caches, shared-bus systems with caches, systems with gen-
eral interconnection networks without caches, and systems with general interconnection
networks with caches.
70 Memory Synthesis Using AI Methods
Shared-bus systems without caches
In this case, if memory accesses (i.e., LOADs, STORE and RMW) cycles are atomic (i.e., data
elements are accessed and modified in indivisible operations), then each access to an
element applies to the latest copy. Simultaneous accesses to the same element of data are
serialized by the hardware. Sequential consistency is thus guaranteed for this architecture
if the hardware assures that accesses of a processor are issued in program order and if
reads are not allowed to bypass writes in write buffers (to maintain atomicity of accesses).
Shared-bus systems with caches
In these systems, accesses are inherently non-atomic. To guarantee sequential consistency,
in addition to the requirements that accesses be issued in program order and reads do not
bypass writes, special hardware mechanisms must be employed to make STOREs and LOADs
appear to be atomic. A snooping cache protocol (as presented in Subsection 9.4.2) can
be used for this purpose. The protocol exploits the simultaneous broadcast capability of
buses: when a processor executes a STORE, it generates an invalidation signal on the bus;
all cache controllers (and possibly the memory controller) are latching simultaneously the
invalidation generated by a STORE request. As soon as each controller has taken the proper
action on the invalidation, the access can be considered performed.
Systems with general interconnection netorks without caches
In these systems, accesses are inherently non-atomic because the time taken for an access
to reach the target memory module depends on the path of the access and is generally
inpredictible. To guarantee sequential consistency, in addition to the requirements that
accesses be issued in program order and reads do not bypass writes, special hardware
mechanisms must be employed to assure that accesses perform in program order.
Systems with general interconnection networks with caches
In these these systems, accesses are inherently non-atomic because of the caches and
the general interconnection network properties. To guarantee sequential consistency, in
addition to the requirements that accesses be issued in program order and reads do not
bypass writes, special hardware mechanisms must be employed to assure that accesses
perform in program order and appear to execute atomically.
8.4.3 Performance of Sequential Consistency
Sequential consistency, while conceptually offering a simple programming model, imposes
severe restrictions on the outstanding accesses that a process may have and prohibits
many hardware optimizations that could increase performance. As it will be aparent from
the cache-coherence protocols guaranteeing sequential consistency (for example, the Full-
map directory protocol, Subsection 9.5.2) this strict model limits the performance. For
many applications, such a model is too strict, and one can do with a weaker notion of
consistency. As an example, consider the case of a processor updating a data structure
within a critical section. If the computation requires several STOREs and the system is
sequentially consistent, then each STORE will have to be delayed until the previous STORE
is complete. But such delays are unnecessary because the programmer has already made
sure that no other process can rely on that data structure being consistent until the critical
section is exited. Given that all synchronization points are identified, the memory needs
Memory Synthesis Using AI Methods 71
only be consistent at those points.
Several memory consistency models that attempt to relax the constraints on the allowable
event orderings have been proposed and they are called relaxed consistency models. The
most prominent relaxed consistency models are the processor consistency, weak consis-
tency, and release consistency models. The larger latencies found in a distributed system,
as compared to a shared-bus system, favor the relaxed consistency models because they
allow more performant implementations than those allowed by the sequential consistency
model.
8.5 Processor Consistency Model
The processor consistency (PC) model requires that all writes issued from a processor may
be only observed in the order in which they were issued, but allows that the order in which
writes from two processors occur, as observed by themselves or a third processor, may not
be identical. The conditions for processor consistency are defined in [13] as follows:
Conditions for Processor Consistency
(1) before a LOAD is allowed to perform with respect to any other processor, all previous
LOAD accesses must be performed, and
(2) before a STORE is allowed to perform with respect to any other processor, all previous
accesses (LOADs and STOREs) must be performed.
The above conditions allow reads following a write to bypass the write. To avoid deadlock,
the implementation should guarantee that a write that appears previously in program order
will eventually perform. The PC model ignores all access labelings aside from sharedL.
8.6 Weak Consistency Model
The weak consistency model and the release consistency model (next Section) employ the
categorization and labeling of memory accesses (Section 8.2 and 8.3) to relax the event
ordering restrictions on the basis of extra information (provided by the programmer or
the compiler) on the type of memory access.
The weak consistency model proposed by Dubois et al. ([3]) is based on the idea that the
interaction between parallel processes manifests itself through synchronization accesses
that are used to order events and through ordinary shared accesses. If synchronization ac-
cesses can be recognized, and sequential consistency is guaranteed only for synchronization
accesses, then the ordinary accesses might proceed faster because they need to be ordered
only with respect to synchronization accesses. This improves performance because ordi-
nary accesses are more frequent than synchronization accesses. As an example, consider a
processor updating a data structure within a critical section. If updating the structure re-
quires several writes, each write in a sequentially consistent system will stall the processor
untill all other cached copies of that location have been invalidated. But these stalls are
unnecessary, as the programmer has already made sure that no other process can rely on
the consistency of that data structure until the critical section is exited. If the synchro-
72 Memory Synthesis Using AI Methods
nization points can be identified, then the memory need need only be consistent at those
points. The weak consistency model exploits this idea and guarantees that the memory
is consistent only following a synchronization operation. The conditions that ensure weak
consistency are given in [13]:
Conditions for Weak Consistency
(1) before an ordinary LOAD or STORE access is allowed to perform with respect to any
other processor, all previous synchronization accesses must be performed,
(2) before a synchronization access is allowed to perform with respect to any other pro-
cessor, all previous ordinary (LOADs and STOREs) accesses must be performed, and
(3) synchronization accesses are sequentially consistent with respect to one another.
Speaking in terms of access labeling, under the weak consistency model only the la-
bels sharedL, ordinaryL, and specialL are taken into account, with an access labeled as
specialL being treated as a synchronization access and as both an acquire and a release.
In a machine supporting weak consistency (also called weak ordering of events [3,13,17]) the
programmer should make no assumption about the order in which the events that a process
generates are observed by other processes between two explicit synchronization points.
Accesses to shared writable data should be executed in a mutually exclusive manner,
controlled by synchronization operations, such as LOCK and UNLOCK. Only synchronization
accesses are guaranteed to be sequentially consistent. Before a synchronization access
can proceed, all previous ordinary accesses must be allowed to “settle down” (i.e., all
shared memory accesses made before the synchronization point was encountered must be
completed before the synchronization access can proceed). In such systems we say that
events are weakly ordered.
The advantage of the weak consistency model is that it provides the user with a reasonable
programming model, while permitting multiple memory accesses to be pipelined, and thus
allowing high-performance. For example, consider a multiprocessor with a buffered, mul-
tistage, and packet-switched interconnection network. If strong ordering is to be enforced,
then the interface between the processor and the network can send global memory requests
only one at a time. The reason for this is that in such a network the access time is vari-
able and unpredictible because of conflicts; in many cases waiting for an acknowledgement
from the memory controller is the only way to ensure that global accesses are performed
in program order. In the case of weak ordering the interface can send the next global
access directly after the current global access has been latched ih the first stage of the
interconnection network, resulting in better processor efficiency. However, the frequency
of synchronization operations (such as LOCKs) will be higher in a program designed for a
weakly ordered system. Therefore, weak consistency is expected to be more performant
than sequential consistency in systems that do not synchronize frequently.
The disadvantage of the weak consistency model is that the programmer or the com-
piler must identify all synchronization accesses in order to support mutually exclusive
access to shared writable data. Moreover, the synchronization accesses must be hardware-
recognizable to enforce that they are sequentially consistent.
Memory Synthesis Using AI Methods 73
8.7 Release Consistency
The release consistency model (RC) is an extension of the weak consistency model, in
which the requirements on synchronization accesses and ordinary accesses ordering are
relaxed. The release consistency model exploits the information conveyed by the labels at
the leaves of the labeling tree, that is, the labelings ordinaryL, nsyncL, acqL, and relL
are considered by the model. Basically, RC guarantees that memory is consistent only
when a critical section is exited. The conditions for ensuring release consistency are given
in [13] as follows:
Conditions for Release Consistency
(1) before an ordinary LOAD or STORE access is allowed to perform with respect to any
other processor, all previous acquire accesses must be performed,
(2) before a release access is allowed to perform with respect to any other processor, all
previous ordinary (LOADs and STOREs) accesses must be performed, and
(3) special accesses are processor consistent with respect to one another.
The ordering condition stated by the weak consistency model for synchronization accesses
is extended under the release consistency model to special accesses, that include all compet-
ing accesses, both synchronization and non-synchronization accesses. On the other hand,
four of the ordering restrictions in weak consistency are not present in release consistency:
1. First, ordinary LOAD and STORE accesses following a release access do not have to wait
for the release access to be performed. Because the release synchronization access
is intended to signal that previous LOAD and STORE accesses in a critical section are
complete, it is not related to the ordering of the future accesses. Of course, the local
dependences within a processor must still be respected by LOADs and STOREs.
2. Second, an acquire synchronization access need not be delayed for previous ordinary
LOAD and STORE accesses to be performed. Because an acquire access is intended
to prevent future accesses by other processors to a set of shared locations, and is
not giving permission to any other process to access the previous pending locations,
there is no reason for the acquire to wait for the pending accesses to complete.
3. Third, a non-syncronization special access does not wait for previous ordinary ac-
cesses and does not delay future ordinary accesses; therefore, a non-synchronization
access does not interact with ordinary accesses.
4. Fourth, the special accesses are only required to be processor consistent and not
sequentially consistent. The reason for this is that, provided that the applications
meet some restrictions, sequential consistency and processor consistency for special
accesses give the same results. The restrictions that allow this relaxed requirement
on special accesses are given in [13] and have been verified there to hold for the
parallel applications available at the time the study has been conducted.
Essentially, RC guarantees that the memory is consistent when a critical section is exited,
by requiring that all ordinary memory operations be performed before the critical section
74 Memory Synthesis Using AI Methods
is released. The reason that this requirement suffices is that when a processor is in its
critical section modifying some shared data, no other process can access that data until
this section is exited.
The RC model provides the user with a reasonable programming model, since the pro-
grammer is assured that when the critical section is exited, all other processors will have a
consistent view of the modified data. The relaxed requirements on access ordering allows
RC implementations to hide or mask the effects of memory access latency, that is, the
effects of the memory access latency is delayed until the selected synchronization access
occurs.
8.8 Correctness of Operation and Performance Issues
The programmer must know the consistency model to be able to write correct programs
because memory consistency determines the programming model presented by a machine
for the programmer. In addition, the method for identifying an access as a competing
access depends on the consistency model and is generally difficult. For example, it is
possible for an access to be competing under processor consistency and non-competing
under sequential consistency.
Consistency models differ from the point of view of exploiting the information conveyed
by the access labels. The sequential and processor consistency models ignore all labels
aside from sharedL. The weak consistency model ignores all labelings past ordinaryL and
specialL; in weak consistency an access labeled as specialL is treated as a synchronization
access and as both an acquire and a release. In contrast, the release consistency model
exploits the information conveyed by the labels at the leaves of the labeling tree (i.e.,
ordinaryL, nsyncL, acqL, and relL). Labeling the accesses to provide a properly labeled
program may be done either using a parallelizing compiler or requiring the programmer
to design synchronized programs (as shown in Section 8.3).
The conditions for satisfying each consistency model have been formulated in Sections 8.4
– 8.7 such that a process needs to keep track of only requests initiated by itself. Thus, the
compiler and hardware can enforce ordering on a per process(or) basis.
The memory consistency model supported by an architecture directly affects the efficiency
of the implementation (e.g., the amount of buffering and pipelining that can take place
among the memory requests). Sequential consistency presents a simple programming
paradigm but it reduces potential performance, especially in a machine with a large number
of processors or long delays in the interconnection network. While weak consistency and
release consistency models allow potentially greater performance, they require however
that a proper labeling of memory accesses is provided (that is, extra information about
the labeling of memory accesses is required from the programmer or the compiler) to
exploit that potental.
The correctness of a multiprocessor operation is related to the expected model of behavior
for the machine. A programmer who expects a system to behave in a sequentially consis-
tent manner will perceive the system to behave incorrectly if the system allows its processes
Memory Synthesis Using AI Methods 75
to execute accesses out of program order. For non-sequential consistent machines to pro-
duce the same results as SC, the program must include synchronization operations that
order competing accesses. Synchronization allows a program to give results independent
of the execution rates of processors.
Thus, the consistency model has a direct effect on the complexity of the programming
model presented for the programmer and on performance. The challenge is to find the bal-
ance between providing a reasonable programming model to the programmer and achieving
high performance by providing freedom in the ordering among memory requests.
76 Memory Synthesis Using AI Methods
9 CACHE COHERENCE PROTOCOLS
9.1 Types of Protocols
Caches in multiprocessors must operate coherently. The coherence problem is related to
two types of events: two (or more) processors trying to update the value of a shared
variable, or program migration between processors.
Caches will operate consistently if for each processor its memory accesses are directed
to the current active location of any variable whose true physical location can change.
Solutions of different complexity are possible, but in general the simpler the solution,
the greater will be the performance penalty incurred. A simple architectural solution is
to disallow private caches and have only shared caches that are associated with the main
memory modules. A network interconnects the processors to the shared cache modules and
every data access is made to the shared cache. Because with this solution the advantage
of caches reducing memory traffic is lost, it is not considered as a performant method.
A cache-coherence protocol consists of the set of possible states in the local caches, the
states in the shared memory, and the state transitions caused by the messages transported
through the interconnection network to keep memory coherent. There are three classes of
protocols followed to maintain cache coherence:
• Snooping — Every cache that has a copy of the data from a block of physical memory
also has information about it, but this information does not specifiy where other copies of
that block are. Accesses to caches are broadcast on the interconnection network, so that
all other caches can check the block address and determine whether or not they have a
copy of the shared block. The caches are usually on a shared-memory bus, and all cache
controllers monitor or snoop on the bus. Depending on what happens on a write, snooping
protocols are of two types:
1. Write invalidate — the writing processor causes all copies in other caches to be
invalidated (by broadcasting the address of data) before changing its local copy.
This scheme allows multiple caches to read a data, but only only one cache can
write it: this type of sharing is called Multiple Readers Single Writer (MRSW).
2. Write update (also called write broadcast) — the writing processor broadacasts the
new data over the bus so that all copies are updated with the new value. This type
of sharing is called Multiple Readers Multiple Writers (MRMW).
• Directory based — Information about the state of every block in physical memory is kept
partially in a directory entry and partially in every cache:
1. A directory entry is associated with every memory block and it is composed of a
state bit together with a vector of pointers. The state bit indicates whether the line
is not cached by any cache (uncached), shared in an unmodified state in one or more
caches, or modified in a single cache (dirty). The pointers give the location of the
caches that have a copy of the line.
Memory Synthesis Using AI Methods 77
2. An additional status bit, called the private bit, is appended to every cache line and,
together with the valid bit, indicates the state of the cache block in this cache. A
cache block in a processor’s cache, just as a memory block, may also be in one
of three states: invalid, shared, or dirty. The shared state implies that there may
be other processors caching that location. The dirty state implies that this cache
contains an exclusive copy of the memory block, and the block has been modified in
this cache and nowhere else.
• Compiler-directed — Compile-time analysis is used to obtain information on accesses to
a given line by multiple processors. Such information can allow each processor to manage
its own cache without interprocessor runtime communication.
Depending on which action is taken on a write to shared data — invalidate or update —
cache-coherence protocols are categorized as write-invalidate or write-update protocols.
The correctness of a coherence protocol is a function of the memory consis-
tency model adopted by the architecture. The selection of a cache coherence protocol
is related to the type of interconnection network. For a shared-memory bus architecture,
snooping can be easily implemented because buses support the basic mechanism for broad-
cast: bus transaction automatically assures that all receivers are listening to the bus when
the transmitting processor gains access to the bus. Thus, any memory access made by
one device connected to the bus can be “seen” by all other devices connected to the bus.
Buses, although suited for broadcast, have the flaw that they can not support a heavy
broadcast traffic which is likely to appear as the number of processors increases (an exam-
ple is given in the next Section). For general scalable interconnection networks, such as
Omega networks, or k-nary n-cubes, neither efficient broadcast capabiltiy nor convenient
snooping mechanism are provided. To achieve high-performance, the coherence commands
should be sent to only those caches that have a copy of that block. Because directory pro-
tocols maintain for each memory block information about which caches have copies of the
block, they are suited for general interconnection networks systems. However, directory
coherence protocols may be used in bus-based systems as well, usually for multiprocessors
with a large number of processors (about 100), where snooping protocols can not scale.
An architecture is scalable if it achieves linear or near-linear performance growth as the
number of processors increases. Since snooping schemes distribute information about
which processors are caching which data items among the caches, they require that all
caches see every memory request from every processor. This inherently limits the scal-
ability of these machines because the individual processor caches and the common bus
eventually saturate. With today’s high-performance RISC processors this saturation can
occur with just a few processors. Directory structures avoid the scalability problems by
removing the need to broadcast every memory request to all processor caches. This is
because the directory maintains pointers to the processor caches holding a copy of each
memory block, and since only the caches with copies can be affected by an access to the
memory block, only those caches need to be notified of the access. Thus, the processor
caches and interconnection network will not saturate due to coherence requests. Further-
more, directory-based coherence is not dependent on any specific interconnection network
like the bus used by most snooping schemes.
78 Memory Synthesis Using AI Methods
9.2 Rules enforcing Cache Coherence
When a processor is writing a shared datum in its cache, the coherence protocol must
locate all the caches that share the datum. The consequence of a write to shared data is
either to invalidate all other copies or to broadcast the write to the shared copies in order
to update them. When write-back strategy is used, the coherence protocol must also help
read misses determine who has the most up-to-date value, because for this strategy the
shared-memory may not have the current copy of data, but the current value of a data
item may be in any of the caches. The two basic conditions that must be met to maintain
cache coherence are:
1. If a read operation for a shared datum misses in the cache, then a means must exist
to identify whether other cache(s) is (are) having the valid copy of the datum.
2. All write operations to a shared datum for which the processor does not have exclu-
sive access must force copies of that datum in all other caches to be invalidated or
updated.
Observing these rules may introduce significant performance penalties, due to increasing
coherence cache-accesses and network contention. For example, in a snooping protocol,
all other caches in the system are checked using bus-broadcast interrogation both for read
misses and for writes to shared data. The first rule requires a broadcast of the interrogation
over the interconnection network to all caches followed by a cache read in every cache
in the system. That tends to increase network contention and reduce available cache
bandwidth. Since this operation takes place only on misses to shared data, its frequency
should be just a few percent of the reads on any single processor. As the number of
processors increases, however, the load on the communication network and cache traffic
quickly approaches saturation. For example, a 1 percent miss ratio on shared data in each
of 100 processors of a multiprocessor can generate 100 x 0.01 = 1 broadcast request and
one cache read per clock cycle. This broadcast will saturate the communication system
and the individual caches of all processors. The second rule can cause potentially greater
degradation for a write-update snooping protocol, given that it requires a communication
overhead on every write to a shared datum. Directory protocols try to avoid this by
keeping information about which line is shared in which cache and avoid communication
with caches that do not share the line, but generally hot spots accesses can appear. If two
or more processes attempt to access and modify the same shared variable several times
over a brief period of time, and if the requests by each processor are interleaved in some
order, then the cache coherence protocol generally causes heavy traffic due to the access
pattern that progressively moves the datum from one cache to another as it is read and
modified repeatedly. This behavior appears in multiprocessor systems for barrier and lock
variables (Section 7.4 and 7.5).
9.3 Cache Invalidation Patterns
Knowledge of the access pattern to shared variables enables keeping memory system la-
tency and contention as low as possible. The two write policies that can be used for
Memory Synthesis Using AI Methods 79
coherence protocols —write invalidate and write update— exhibit performance dependent
on the sharing pattern. Snooping protocols may use either write invalidate or write update,
while directory-based protocols use write invalidate. Write-invalidate schemes maintain
cache coherence by invalidating copies of a memory block when the block is modified by a
processor. For snooping-based protocols, the invalidation is broadcast and all caches are
checking if they have a copy of the line that must be invalidated, while for directory-based
protocols only the caches that actually share the line receive the invalidation message.
The sharing pattern is characterized by several parameters, of particular importance be-
ing the number of caches sharing a data object and the write-run. The write-run has
been defined by Eggers and Katz [18] as the length of the uninterrupted sequence of write
requests interspread with reads to a shared cache line by one processor. A write-run is
terminated when another processor reads or writes the same cache line. The length of
the write-run is the number of writes in that write-run. Every new write-run requires an
invalidation and data transfer. When write-runs are short, the write-invalidate scheme
generates frequent invalidations and the write update scheme generates equally frequent
updates. Since the total time cost for invalidations and data transfer is higher than the
cost of updating one word, write-invalidate schemes are inferior for this sharing pattern.
On the other hand, for long write-runs, the write update scheme generates many updates
that are redundant, given the length of the write-run. Therefore, write invalidate performs
better for long write-runs because only the first write in a write-run causes invalidation
of the shared copies of the written line. Furthermore, a write invalidate scheme in a
directory-based protocol sends one invalidation request per write-run only to the caches
that actually share the line. A study conducted on a simulated 32-processor machine [9]
shows that, for a large number of applications, most writes cause invalidations to only a
few caches, with only about 2% of all shared writes causing invalidation of more than 3
caches. Write invalidate protocols perform fairly well for a broad range of sharing pat-
terns. However, there exist some sharing patterns for which unnecessary invalidations are
generated. A notable example is the invalidation overhead associated with data structures
that are accessed within critical sections. Typically, processors read and modify such data
structures one at a time. Processors that access data this way cause a cache miss followed
by an invalidation request being sent to the cache attached to the processor that most
recently exited the critical section. This sharing behavior, denoted migratory sharing has
been previously shown to be the major source of single invalidations (i.e., invalidation of
one cache) by Gupta and Weber in [14]. An extension of the write-invalidate protocol
that effectively eliminates most single invalidations caused by migratory sharing has been
proposed by Stenstr¨om et al. in [19]. This scheme improves performance by reducing the
shared access penalty and the network traffic.
9.4 Snooping Protocols
9.4.1 Implementation Issues
A bus is a convenient device for ensuring cache coherence because it allows all processors
in the system to observe ongoing memory transactions. In a snooping protocol each cache
snoops on the transactions of other caches. When the cache controller sees an invalidation
80 Memory Synthesis Using AI Methods
or update message broadcast over the bus, it takes the appropriate action on the local copy
of the line. Snooping protocols allow all data to be cached and coherence is maintained by
hardware. Sharing information is added to the valid bit in a cache line. This information
is used in monitoring bus activities. Snooping protocols have the avantage that because
the sharing information about a memory block is kept in the caches that have a copy
the block, the amount of memory required to keep this information is proportional to the
number of blocks in the cache, as opposite to the directory protocols where the directory
memory is proportional to the number of blocks in main memory. A write update protocol
broadcasts writes to shared data while write invalidate deletes all other copies so that there
is only one local copy for subsequent writes.
On a read miss on the bus, all caches check to see if they have a copy of the requested
line and take the appropriate action, such as supplying the data to the cache that missed.
Similarly, on a write miss on the bus, all caches check to see if they have a copy and if they
find out that they have a copy of the written data they invalidate their copy or change it
to the new value (depending on whether write invalidate or write broadcast is used).
Write update protocols usually allow cache lines to be tagged as shared or private. Only
shared data need to be updated on a write. If this information about data sharing is
available, a write update protocol acts like a write-through cache for shared data (broad-
casting to other caches) and a write-back cache for private data (the modified data leaves
the cache only on a miss).
Wite invalidate protocols maintain a state bit for each cache block, that in conjuction with
the valid bit defines the state of the block. A block can be in one of the following three
states:
1. clean (also called Read Only) —the copy of the block in cache is also in the main
memory; that is, the block has not been modified in cache, or the modification has
been updated in main memory;
2. dirty (also called Read/Write) —the block has been modified in cache, but not in
main memory;
3. invalid —not valid data block.
Most cache-based multiprocessors use write-back caches in order to reduce the bus traffic
and allow more processors on a single bus. The dirty bit used by the write-back policy
is also used by the cache-coherence protocol to define the state of the cache block, as
described above. There is no obvious choice on which snooping protocol (write-invalidate
or write-broadcast) is superior, because the performance of both variants is dependent on
the sharing pattern of the application, as shown on Section 9.3.
9.4.2 Snooping Protocol Example
Let us build the finite-state machine that implements a write-invalidation protocol based
on write-back policy. The finite-state transition diagram is depicted in Figure 10.
Memory Synthesis Using AI Methods 81
Read/Write (dirty)
Invalid (not valid
cache block)
Read only (clean)
CPU write
CPU write
miss
CPU Read miss
CPU Read miss
write back
dirty block
CPU Write (hit or miss)
send invalidate if hit
Cache state transitions us-
ing signals from CPU
Read/Write (dirty)
Invalid (not valid
cache block)
Read only (clean)
Read miss or
write miss on bus
for this block
Invalidate or
write miss on bus
for this block
Cache state transitions us-
ing signals from bus
Figure 10: A Write-Invalidate Snooping Cache-Coherence Protocol
82 Memory Synthesis Using AI Methods
There is only one state-machine in a cache, with stimuli coming either from the attached
CPU or from the bus, but the figure shows the three states of the protocol in duplicate
in order to distinguish the transitions based on CPU actions, as opposed to transitions
based on bus operations. Transitions happen on read misses, write misses, or write hits;
read hits do not change cache state.
When the CPU has a read miss, it will change the state of that block to Read only and
write back the old block if it was in the Read/Write state (dirty). All the caches snoop
on the read miss to see if this block is in their cache. If one cache has a copy and it is
in the Read/Write state, then the block is written to memory and is then changed to the
Invalid state (as shown in this protocol) or Read only.
When a CPU writes into a block, that block goes to the Read/Write state. If the write
was a hit, an invalidate signal goes out over the bus. Because caches monitor the bus, all
check to see if they have a copy of that block; if they do, they invalidate it. If the write
was a miss, all caches with copies go to the invalid state. For simplicity, write to clean
data may be treated as a “write miss”, so that there is no separate signal for invalidation,
but the same bus signal as for write miss is used.
9.4.3 Improving Performance of Snooping Protocol
Reducing Interference Between Broadcasts and CPU Operation
Since every bus transaction checks cache-address tags, it would interfere with the CPU
accesses to cache if only a copy of the address tag were accessed both by CPU and snooping.
To remove this problem, the address tag portion of the cache is duplicated so that an extra
read port is available for snooping; these two identical copies of the address tag are called
snoop tag and normal tag respectively. In this way, snooping interferes with the CPU’s
accesses to the cache only when the tags must be changed, that is, when the CPU has a
miss or when a coherence operation occurs. On a miss, the CPU arbitrates with the bus
to change the snoop tags as well as the normal tags (to keep the address tags coherent).
When a coherence operation occurs in the cache, the CPU will likely stall, since the cache
is unavailable.
Reducing Invalidation Interference
Some designs ([20]) are queuing the invalidation requests. A list of the addresses to be
invalidated is maintained in a small hardware-implemented queue called Buffer Invalida-
tion Address Stack (BIAS). The BIAS has a high priority for cache cycles, and if the
target line is found in the cache, it is invalidated. To reduce the interference between the
invalidation accesses to cache and the normal CPU accesses, a BIAS filter memory ([20])
may be used. A BFM is associated with each cache and works by filtering out repeated
requests to invalidate the same block in a cache.
Snooping protocols are fairly simple and not expensive. For multiprocessors with a small
number of processors they perform well. The disadvantage is that the snooping protocols
are not scalable. Buses don’t have the bandwidth to support a large number of processors.
The coherence traffic quickly increases with the number of processors because snooping
protocols require that all caches see every memory request from every processor. The
Memory Synthesis Using AI Methods 83
shared bus and the need to broadcast every memory request to all processor caches inher-
ently limit the scalability of snooping protocol-based machines, because the common bus
and the individual processor caches eventually saturate.
9.5 Directory-based Cache Coherence
9.5.1 Classification of Directory Schemes
A directory is a list of the locations of the cached copies for each line of shared data. A
directory entry is associated with each memory block and contains a number of pointers
to specify the locations of copies of the block and a state bit to specify whether ot not a
unique cache has permission to write that line.
Depending on the amount of information stored in a directory entry, directory protocols
fall in two categories:
• full-map directories —the directory stores for each block in global memory informa-
tion about all caches in the system, so that every cache can simultaneously have a
copy of any block of data. In this case, the pointers in a directory entry are simply
presence bits associated with each cache. This type of protocol has the advantage
of allowing full-sharing of any memory block, but is not scalable with respect to
memory overhead. Indeed, assuming that the amount of shared memory increases
linearly with the number of processors, N, then, because the size of a directory entry
is proportional to the number of processors, and the number of entries is equal to
the number of blocks that is proportionat to the memory size, it results that the size
of directory is Θ(N) ∗ Θ(N) = Θ(N2).
• limited directories —each directory entry has a fixed number of pointers, regardless
of the number of caches in the system; they have the disadvantage of restricting the
number of simultaneously cached copies of a memory block, but have the advan-
tage of limiting the growth of the directory to a constant factor of the number of
processors.
When only an unique directory is accessed by all the caches in the system, the directory
structure is called centralized directory. The architecture of a system using a full-map
centralized-directory coherence protocol is shown in Figure 11. The main memory (and
the directory) can be made up from several memory modules.
9.5.2 Full-Map Centralized-Directory Protocol
The classical centralized-directory protocol is a full-map protocol and has been first pro-
posed by Censier and Feautrier [1]. For each block of shared-memory there is a directory
entry that contains:
• one presence bit per processor cache. The set of presence bits is a bit vector called
the presence vector;
84 Memory Synthesis Using AI Methods
Cache
Processor 1
Cache
Processor 2
Cache
Processor n
. . .
Interconnection Network
Memory
Data
Directory
. . .
State bit Presence bits
Figure 11: Full-Map Centralized-Directory architecture
Memory Synthesis Using AI Methods 85
• one state bit (also called dirty bit) that indicates whether the block is uncached (not
cached by any cache; all presence bits are equal to zero), shared in multiple caches,
or held exclusively by one cache. In the latter case, the block is called dirty. When
the state bit shows a dirty block, only one presence bit is set, which indicates the
cache that holds the current copy of data, that is, the owner of the block. Otherwise,
the block is said to be clean.
Every cache in the system maintains two bits of state per block. One bit indicates whether
a block is valid; the other state bit indicates whether a valid block may be written, and
is called the private bit. If the private bit is set in a cache, then that cache has the only
valid copy of that line, that is, the line is diry; the corresponding directory entry has the
dirty bit and the presence bit for that cache set. The cache that has the private bit set
for a line is said to own the line. The cache-coherence protocol must keep the state bits in
the directory (i.e., presence and dirty bits) and those in the caches (i.e., valid and private
bits) consistent.
Using the state and presence bits, the memory can tell which caches need to be invalidated
when a location is written. Likewise, the directory indicates whether memory’s copy of
the block is up-to-date or which cache holds the most recent copy.
The full-map centralized-directory protocol presented below can be applied to systems
with general interconnection networks and guarantees sequential consistency. Write back
strategy is assumed. The initial state of a directory entry associated with a line X is when
none of the caches in the system has a copy of that line. Therefore, the valid and private
bits for line X are reset to zero in all caches, the directory entry for line X has all the
presence bits and the dirty bit reset to zero—that is, the line is clean and not cached.
When a cache, denoted by C1, misses the read of line X, it requests the line from main
memory. The main memory sends the line to the cache and sets the presence bit for C1
in the directory entry to indicate that C1 has a copy of line X. The cache C1 fetches the
line and sets the corresponding valid bit. Similarly, when another cache, C2, requests a
copy of line X, the presence bit for C2 is set in the directory entry, C2 fetches the line,
and sets the valid bit.
Let us examine what happens when the processor P2 issues a WRITE to a word belonging
to line X:
1. Cache C2 detects that the word belongs to line X that is valid, but does not have
permission to write the block, because the private bit in the cache is not set;
2. Cache C2 issues a write request to the main memory and stalls processor P2;
3. The main memory issues an invalidate request to cache C1 that contains a copy of
line X;
4. Cache C1 receives the invalidate request, resets the valid bit for line X to indicate
that the cached information is no longer valid, and sends acknowledgement back to
the main memory;
86 Memory Synthesis Using AI Methods
5. The main meory receives the acknowledgement, sets the dirty bit, clears the presence
bit for cache C1, and sends write permission to cache C2;
6. Cache C2 receives the write permission message, updates line X, sets the private
bit, and reactivates processor P2.
If processor P2 issues another write to a word in line X and cache C2 still owns line X,
then the write takes places immediately into the cache.
If processor P1 attempts to read a word in line X, after P2 has got ownership of line X,
then the following events will occur:
1. Cache C1 detects that the line X containing the word is in invalid state;
2. Cache C1 issues a read request to the main memory and stalls processor P1;
3. The main memory checks the dirty and presence bits in the directory entry for line
X and finds out that line X is dirty and cache C2 has the only valid copy of line X.
4. The main memory issues a read request for line X to cache C2;
5. Cache C2 receives the read request for line X from main memory, clears the private
bit, and sends the line to the main memory;
6. The main memory receives the line X from cache C2, clears the dirty bit, sends line
X to cache C1, and sets the presence bit for C1;
7. Cache C1 fetches the line X, sets the valid bit, and reactivates processor P1.
The disadvantage of this directory scheme is that it is not scalable with respect to the
directory overhead because directory size is Θ(N2
), where N is the number of processors,
that is, the memory overhead scales as the square of the number of processors.
9.5.3 Limited-Directory Protocol
The limited directory protocol, as proposed by Agarwal et al. [10] is designed to solve the
directory size problem, by allowing a constant number of caches to share any block, so
that the directory entry size does not change as the number of processors in the system
increases. A directory entry in a limited-directory protocol contains a fixed number of
pointers —denoted by i— which indicate the caches holding a copy of the line, and a
dirty bit (with the same meaning as for the full-map directory) for the line. The limited-
directory protocol is similar to the full-map directory, except in case when more than i
caches request read copies of a particular line of data.
An i-pointer directory may be viewed as an i-way set-associative cache of pointers to
shared copies. When the i+1 -th cache requests a copy of line X, the main memory must
invalidate one copy in one of the i caches currently sharing the line X, and replace its
pointer with the pointer to the cache that will share the line. This process of pointer
Memory Synthesis Using AI Methods 87
replacement is called eviction. Since the directory acts as a set-associative cache, it must
have a pointer replacement strategy. Pseudorandom eviction requires no extra memory
overhead and is a good choice for replacement policy. A pointer in the directory entry
encodes binary processor (and cache) identifiers, so that for a system with N processors
a pointer requires log2N bits of memory. Therefore a directory entry requires i ∗ log2N
bits and under the assumption that the amount of memory (and hence the number of
memory lines) increases linearly with the number of processors it follows that the memory
overhead of the limited-directory protocol is:
Θ(i ∗ log2N) ∗ Θ(N) = Θ(N ∗ log2N)
Because the memory overhead grows approximately linearly with the number of proces-
sors, this protocol is scalable with respect to memory overhead. This protocol works well
for data that are not massively shared, but for highly shared data, as a barrier synchroniza-
tion variable, pointer thrashing occurs because many processors spin-lock on the barrier
variable.
Both limited-directory and full-map directory protocols present the drawback of using
a centralized directory which may become a bottleneck when the number of processors
increases. If the memory and directory are partitioned into independent units and con-
nected to the processors by scalable interconnect, the memory system can provide scalable
memory bandwidth — this is the idea of distributed directory and memory, presented in
the next Subsection.
9.5.4 Distributed Directory and Memory
The idea is to achieve scalability by partitioning and distributing both the directory and
main memory, using a scalable interconnection network and a coherence protocol that
can suitably exploit distributed directories. The architecture of such a system is shown
in Figure 12, that depicts the Stanford DASH Multiprocessor [9] architecture. The name
DASH is an abbreviation for Directory Architecture for Shared Memory. The architecture
provides both the ease of programming of single-address-space machines (with caching
to reduce memory latency) and the scalability that was previously achievable only with
message-passing machines but not with cache-coherent shared-address machines.
The DASH architecture consists of a set of clusters (also called processing nodes) connected
by a general interconnection network. Each cluster consists of a small number (e.g., eight)
of high-performance processors and a portion of the shared memory interconnected by a
bus. Multiprocessing within the cluster may be viewed either as increasing the power of
each processing node or as reducing the cost of the directory and network interface by
amortizing it over a larger number of processors. The Dash architecture removes the scal-
ability limitation of centralized-directory architectures by partitioning and distributing
the directory and main memory, and by using a new coherence protocol that can suitably
exploit distributed directories. Distributing memory with the processors is essential be-
cause it allows the system to exploit locality. All private data and code references, along
with some of the shared references, can be made locally to the cluster. These references
avoid the longer latency of remote references and reduce the bandwidth demands on the
global interconnection.
88 Memory Synthesis Using AI Methods
Memory
Cache
Processor
Cache
Processor
Directory
Snooping Bus
. . .
...
Memory
Cache
Processor
Cache
Processor
Directory
Snooping Bus
. . .
I
n
t
e
r
c
o
n
n
e
c
t
i
o
n
N
e
t
w
o
r
k
Figure 12: Distributed-Directory architecture
Memory Synthesis Using AI Methods 89
The DASH architecture is scalable in that it achieves linear or near-linear performance
growth as the number of processors increases from a few to a few thousand. The memory
bandwidth scales linearly with the number of processors because the physical memory is
distributed and the interconnection network is scalable. Distributing the physical memory
among the clusters provides scalable memory bandwidth to data objects residing in local
memory, while using a scalable interconnection network provides scalable bandwidth to
remote data. The scalability of the network is not compromised by the cache coherence
traffic because the use of distributed directories removes the need for broadcasts and the
coherence traffic consists only of point-to-point messages between the processing nodes
that are caching that location. Since these nodes must have originally fetched the data,
the coherence traffic will be within some small constant factor of the original data traf-
fic. The scalability may be potentially disrupted due to the nonuniform distribution of
accesses across the machine. This happens when accesses are concentrated to data from
the memory of a sinlge cluster over a short duration of time — this access patterns cause
hot spots (Section 7.4) in memory. Hot spots can significantly reduce the memory and
network throughput because the distribution of resources provided by the architecture is
not exploited as it is under uniform access patterns. Many of the data hot spots can be
avoided through caching of shared writable data and Dash allows caching of these data.
Other hot spots are removed by software techniques; for example, the hot spot generated
by the access to a barrier variable may be removed by using a software combining tree
(Section 7.4.2).
The issue of memory access latency becomes more prominent as an architecture is scaled
to a larger number of nodes. There are two complementary approaches to reduce latency:
1. caching shared data —this significantly reduces the average latency for remote ac-
cesses because of the spatial and temporal locality of memory accesses. Hardware-
coherent caches provide this latency reduction mechanism. For references not sat-
isfied by the cache, the protocol attempts to minimize latency using a memory
hierarchy, as shown farther;
2. latency-hiding mechanisms —these mechanisms are intended to manage the inher-
ent latencies of a large machine corresponding to interprocess communication; tech-
niques used range from support of a relaxed memory consistency model — release
consistency (Section 8.7) — to support of nonblocking prefetch operations.
Regarding the scalability of the DASH machine with respect to the amount of directory
memory required, assume that the physical memory in the machine grows proportionally
with the number of processing nodes:
M = N × Mc Mbit
where N is the number of clusters, Mc is the megabits of memory per cluster, and M
is the total physical memory (expressed in megabits) of the machine. Using a full-map
directory (that is, a presence presence-bit vector to keep track of all clusters caching a
memory block) requires a total amount of directory memory, denoted by D:
D = N × M/L = N2
× Mc/L Mbit
where L is the cache line-size in bits. Thus, the directory overhead is growing as N2/L
with the cluster memory size, or as N/L with the amount of total memory. For small and
90 Memory Synthesis Using AI Methods
medium N, this growth is tolerable. For examle, consider a machine in which a cluster
contains 8 processors and has a cache line-size of 32 bytes. For N = 32, that is, 256
processors, the overhead for directory memory is only 12.5 percent of physical memory,
which is comparable with the overhead of supporting an error-correcting code on memory.
For larger machines, where the overhead would become intolerable with a the full-map
directory, one can use the following approach to achieve scalability: the full-map directory
is replaced with a limited directory that has a small number of pointers (Subsection 9.5.3)
and, in the unusual case when the number of pointers is smaller the the number of clusters
caching a line, invalidations are broadcast on the interconnection network. The reason that
limited directory can be used is based on the data sharing and write-invalidate patterns
exhibited by most applications. For example, it is shown in [9] that most writes cause
invalidations to only a few caches, with only about 2% of shared writes causing invalidation
of more than 3 caches.
To the memory present within each cluster a directory memory is associated. Each di-
rectory memory is contained in a Directory Controller (DC). The DC is responsible for
maintaining the cache coherence across the clusters and serving as interface to the in-
terconnection network. The clusters and their associated portion of main memory are
categorized in three types, according to the role played in a given transaction:
1. the local cluster — is the cluster that contains the processor originating a given
request; local memory refers to the main memory associated with the local cluster.
2. the home cluster — is the cluster that contains the main memory and directory for
a given physical memory address.
3. a remote cluster — is any other cluster; remote memory is any memory whose home
is not the local.
Therefore, the Dash memory system can be logically broken into four levels of hierarchy,
as shown in Figure 13.
States in Directory and in Caches
The directory memory is organized as an array of directory entries. There is one entry for
each memory block of the corresponding memory module. A directory entry contains the
following pieces of information:
1. a state bit that indicates whether the clusters have a read (shared) or read/write
(dirty) copy of the data.
2. a presence bit-vector, which contains a bit for each of the clusters in the system.
If the state bit indicates a read copy and none of the presence bits is set to one, then the
block is said to be uncached. A memory block can be in one of three states, as indicated
by the associated directory entry:
1. uncached-remote, that is, not cached by any remote cluster;
Memory Synthesis Using AI Methods 91
2. shared-remote, that is, cached in an unmodified state by one or more remote clusters;
3. dirty-remote, that is, cached in a modified state by a single remote cluster.
As with memory blocks, a cache block in a processor’s cache may also be in one of three
states: invalid, shared, and dirty. The shared state implies that there may be other
processors caching that location. The dirty state implies that this cache contains an
exclusive copy of the memory block, and the block has been modified.
The Dash coherence protocol is an invalidation-based ownership protocol that uses the
information about the state of the memory block, as indicated by the directory entry
associated with each block. The protocol maintains the notion of owning cluster for each
memory block. The owning cluster is nominally the home cluster. However, in the case
that the memory block is present in the dirty state in a remote cluster, that cluster is the
owner. Only the owning cluster can complete a remote reference for a given block and
update the directory state. While the directory entry is always maintained in the home
cluster, a dirty cluster initiates all changes to the directory state of a block when it is the
owner (such update messages also indicate that the dirty cluster is giving up ownership).
The order that operations reach the owning cluster determine their global order.
The directory does not maintain information concerning whether the home cluster itself is
caching a memory block because all transactions that change the state of a memory block
are issued on the bus of the home cluster, and the snoopy bus protocol keeps the home
cluster coherent. Issuing all transactions on the home cluster’s bus does not significantly
degrade performance since most requests to the home cluster also require an access to
main memory to retrieve the actual data.
To illustrate the directory protocol, we shall consider in turn how read requests and write
requests issued by a processor traverse the memory hierarchy.
Read request servicing
• Processor level — If the requested location is present in the processor’s cache, the
cache simply supplies the data. Otherwise, the request goes to the local cluster.
• Local cluster level — If the data resides within one of the other caches within the
local cluster, the data is supplied by that cache and no state change is required at
the directory level. If the request must be sent beyond the local cluster level, it goes
first to the home cluster corresponding to that address.
• Home cluster level — The home cluster examines the directory state of the memory
location while simultaneously fetching the block from main memory. If the block is
clean, the data is sent to the requester and the directory is updated to show sharing
by the requester. If the location is dirty, the request is forwarded to the remote
cluster indicated by the directory.
• Remote cluster level — The dirty cluster replies with a shared copy of the data,
which is sent directly to the requester. In addition, a sharing write-back message is
92 Memory Synthesis Using AI Methods
Processor caches in
remote clusters
Remote cluster level
Directory and main memory
associated with a given ad-
dress
Home cluster level
Other processor caches
within local cluster
Local cluster level
Processor cache
Processor level
Figure 13: Memory Hierarchy of Dash
Memory Synthesis Using AI Methods 93
sent to the home level to update main memory and change the directory state to
indicate that the requesting and remote cluster now have shared copies of the data.
Having the dirty cluster respond directly to the requester, as opposed to routing it
through the home cluster, reduces the latency seen by the requesting processor.
Write request servicing
• Procesor level — If the location is dirty in the writing processor’s cache, the write
can complete immediately. Otherwise, a Read-exclusive request is issued on the local
cluster’s bus to obtain exclusive ownership of the line and retrieve the remaining
portion of the cache line.
• Local cluster level — If one of the caches within the cluster already owns the cache
line, then the read-exclusive request is serviced at the local level by a cache-to-cache
transfer. This allows processors within a cluster to alternatively modify the same
memory block without any intercluster interaction. If no local cache owns the block,
then a read-exclusive request is sent to the home cluster.
• Home cluster level — The home cluster can immediately satisfy an ownership request
for a location that is in the uncached or shared state. In addition, if a block is in the
shared state, then all cached copies must be invalidated. The directory indicates the
clusters that have the block cached. Invalidation requests are sent to these clusters
while the home concurrently sends an exclusive data reply to the requesting cluster.
If the directory indicates that the block is dirty, then the read-exclusive request must
be forwarded to the dirty cluster, as in the case of a read.
• Remote cluster level — If the directory had indicated that the memory block was
shared, then the remote clusters receive an invalidation request to eliminate their
shared copy. Upon receiving the invalidation, the remote clusters send an acknowl-
edgement to the requesting cluster. If the directory had indicated a dirty state, then
the dirty cluster receives a read-exclusive request. As in the case of the read, the
remote cluster responds directly to the requesting cluster and sends a dirty-transfer
message to the home cluster indicating that the requesting cluster now holds the
block exclusively.
When the writing cluster receives all invalidation acknowledgements or the reply from the
home or dirty cluster, it is guaranteed that all copies of the old data have been purged from
the system. If the processor delays completing the write until all acknowledgements are
received, then the new write value will become available to all other processors at the same
time. However, invalidations involve round-trip messages to multiple clusters, resulting in
potentially long delays. Higher processor utilization can be obtained by allowing the write
to proceed immediately after the ownership reply is received from the home. This leads
to the memory model of release consistency.
94 Memory Synthesis Using AI Methods
9.6 Compiler-directed Cache Coherence Protocols
Software-based coherence protocols require compiler assistance. Compile-time analysis is
used to obtain information on accesses to a given line by multiple processors. Such infor-
mation can allow each processor to manage its own cache without interprocessor runtime
communication. Compiler-directed management of of caches implies that a processor has
to issue explicit instructions to invalidate cache lines.
To eliminate the stale-date problem caused by process migration the following method can
be used:
Cache-flushing
To eliminate the stale-data problem for cacheable, nonshared data, the processor can flush
its cache each time a program leaves a processor. This guarantees that main memory
becomes the current active location for each variable formerly held in cache. The cache
flush approcah can also be used for the I/O cache-coherence problem (Section 5.13.1).
While this solution prevents the stale-data from being used, the cache invalidations caused
by flushes may have as effect an increase in Miss Rate.
Regarding the coherence problem caused by accesses to shared data, there are two simple
coherence schemes:
Not caching shared data
Each shared datum can be made noncacheable to eliminate the difficulty in finding its
current location among caches and main memory. Data can be done noncacheable by
several methods, for example, by providing a special range of addresses for noncacheable
data, or by using special LOAD and STORE instructions that do not access cache at all.
Not caching shared writable data
This is an improvement of the previous method. Since for performance consideration it is
desirable to attach a private cache to each CPU, one can prevent data inconsistency by
not caching shared writable data, that is, by making such data noncacheable. Examples
of shared writable data are locks, shared data structures such as process queues, and any
other data protected by critical sections. When shared writable data are not cached, no
coherence problem can occur. This solution is implemented using programmer’s directives
that instruct the compiler to allocate shared writable data to noncacheable regions of
memory. The drawback of this scheme is that large data structures cannot be cached,
although most of the time it would be safe to do so.
While these solutions have the advantage of simple implementation, they have a negative
effect on performance because they reduce the effective use of cache. As pointed out in
many works (for example, in [8]), shared-data accesses acount for a large portion of global
memory accesses. Therefore, allowing shared writable data to be cached when it is safe to
do so is crucial for performance.
Efficient Compiler-directed Cache Coherence
We shall present an efficient compiler-directed scheme following the ideas from Cheong
and Veidenbaum [8]. Basically, caching is allowed when it is safe, and cache flush is made
when coherence problems occur. The operating environment of the coherence algorithm
Memory Synthesis Using AI Methods 95
is:
• Parallel task-execution model
The execution of a parallel program is represented by tasks, each executed by a single pro-
cessor. Task migration is not allowed. Tasks independent of each other can be scheduled
for parallel execution. Dependent tasks will be executed in the order defined by program
semantics. The execution order of dependent tasks is enforced through synchronization.
The execution order is described by the dependence relationship among tasks, which can
be modeled by a directed graph, G = {E, T}, where T is a set of nodes and E is a set of
edges. A node, Ti ∈ T, represents a task, and a directed edge, eij ∈ E, represents that
some statements in in Tj depend on other statements in Ti. Ti is called a parent node and
Tj is called a child node. Task nodes are combined into a single node using the following
criterion: two nodes Ti and Tj connected by an edge eij can be combined into one node if
Ti is the only parent of Tj, and Tj is the only child of Ti. The task graph can be divided
into levels L = {L0, . . ., Ln}, where each Li is a set of tasks such that the longest directed
path from T0, the starting node, to each of the tasks in the set has i edges and tasks on
each level are not connected by any directed edges. Therefore, tasks on the same level
perform no write accesses or read-write accesses to the same data by different processors.
Such tasks can be executed in parallel without interprocess synchronization.
• Program Model
Parallelism in a program is assumed to be expressed in terms of parallel loops. A parallel
loop specifies starting execution of iterations of the loop by multiple processors. In a Doall
type of parallel loop, all such iterations are independent and can be executed in any order.
In a Doacross type loop, there is a dependence between iterations. In terms of tasks, one
or more iterations of a Doall loop are bundled into a task, while in a Doacross loop, one
iteration is a task and synchronization exists between tasks.
• System Model
An weakly ordered system model is assumed; while it does not guarantee sequential consis-
tency, the program model is quite simple and allows performance higher than for strongly
ordered systems. In terms of the task-execution model, this implies that the values written
in a task level must be deposited in the shared memory before the task boundary can be
crossed. Parallel execution without intertask synchronization is assumed. The memory
references of a program consist of instruction fetches, private-data accesses, and shared-
data accesses. Private data may only become a problem if task migration is considered.
It is assumed that instructions, private data, and shared read-only data accesses can be
recognized at runtime and will not be affected by the cache coherence mechanism. The
value in the shared memory is assumed to be always current. Incoherence is defined as the
condition when a processor performs a memory fetch of a value X, and a cache hit occurs,
but the cache has a value different from that in main memory; otherwise, the fact that
the memory and the cache have diferent values is not an error. The following instructions
are assumed to be available for cache management:
• Invalidate. This instruction invalidates the entire contents of a cache. Using reset-
table static random-access memories for valid bits, this can be accomplished in one
or two cycles with low hardware cost.
96 Memory Synthesis Using AI Methods
• Cache-on. This instruction causes all global memory references to be routed through
the cache.
• Cache-off. This instruction causes all global memory references to bypass the cache
and go directly to memory.
The cache state, on or off, must be part of the processor state and must be saved/restored
on a context switch. Processes are created in a cache-off state.
• Cache Management Algorithm
The necessary conditions for the cache incoherence to occur on a fetch of X require that:
(1) a value of X is present in the cache of processor Pj, and
(2) a new value has been stored in the shared memory by another processor after the
access by Pj that brought X into the cache.
The above conditions can be formulated in terms of data dependences, and a compiler can
then check for a dependence structure that might result in coherence violations. However,
this would be complex because first, the test will have to be performed for every read
reference, and second, data dependence information does not specify whether the references
involved are executed by different processors. Therefore, the compiler performs data
dependency analysis to determine the loop type, and processor assignement is part of the
loop execution model. By definition, any dependence between two statements inside a
Doall loop is not across iterations. It follows that a statement Si in a Doall dependent on
a statement Sj in the same loop is executed on the same processor as Sj. On the other
hand, cross-iteration dependences are present in a Doacross loop. In a Doacross loop, two
statements with a cross-iteration dependence are executed on different processors, whereas
statements with a dependence on the same iteration are executed on the same processor.
The algorithm uses loop types for its analysis as follows:
(1) A Doall loop has no dependences between statements executed on different pro-
cessors. Therefore, any shared-memory access in such a loop can be cached. Caching is
turned on.
(2) A serial loop is executed by a single processor, and shared-memory accesses can be
cached. Caching is turned on.
(3) Doacross or recurence loops do have cross-iteration dependences. Therefore, condi-
tions for incoherence can be true. Caching is turned off.
(4) An Invalidate instruction is executed by each processor entering a Doall or Doacross.
The processor continuing execution after a Doall also executes an Invalidate instruction.
In terms of a task graph, these points are equivalent to task-level boundaries.
Consider the program example in Figure 14. At the beginning, every processor executes
the Cache-on instruction. Cache management instructions inserted in parallel loops are
executed once by every participating processor, not on every iteration of such a loop.
The correctness of the algorithm is proven by showing that the conditions necessary for
an incoherence to occur are not satisfied in programs processed by the algorithm. The al-
gorithm preserves temporal locality at each task level. This algorithm eliminates runtime
Memory Synthesis Using AI Methods 97
Cache on
Doall i=1,n
Invalidate
Y (i) = ...
= W(i)...Y (i)
= ...X(i)
...
enddo
Invalidate
...
Doall j=1,n
Invalidate
...
= W(j)...Y (j)
X(j) = ...
...
enddo
Invalidate
...
Doall k=1,n
Invalidate
= W(k)
= ...X(k)
...
= ...Y (k)
...
enddo
Invalidate
Doserial i=1,n
...
= X(i)
...
= X(f(i))
...
enddo
Figure 14: Program Example
98 Memory Synthesis Using AI Methods
communication for coherence maintenance, and keeps the time cost of invalidate indepen-
dent of the number of invalidated lines. However, it does not allow caching for Doacross
loops and requires each processor to execute an Invalidate instruction when entering a
Doall or a Doacross, thereby increasing the Miss rate.
9.7 Line Size Effect on Coherence Protocol Performance
Line size plays an important role in cache coherence. For example, consider that a single
word is alternatively written and read by two processors. Whether a snooping protocol or
a directory protocol is used, a line size of only a word has an advantage over a larger line
size because it involves invalidation only for the data really changing. Therefore, smaller
line sizes can decrease the coherence overhead.
Another problem with large line size is the effect called false sharing. This effect appears
when two different shared variables are located in the same cache block. This situation
causes the block to be exchanged between the processors even though the processors are
in fact accessing different variables. Compiler technology is important in allocating data
with high processor locality to the same blocks and thereby reducing cache miss rate and
avoiding false sharing. Success in this this field could increase the desirability of larger
blocks for multiprocessors. Measurements to date indicate that shared data has lower
spatial and temporal locality than observed for other types of data, independent of the
coherence protocol.
Memory Synthesis Using AI Methods 99
10 MEMORY SYSTEM DESIGN AS A SYNERGY
10.1 Computer design requirements
Computer architects must design a computer to meet functional as well as price and
performance goals. Often, they also have to determine what the functional requirements
are, and this can be a major task. The requirements may be specific features, inspired
by the market. For example, the presence of a large market for a particular class of
applications might encourage the designers to incorporate requirements that would make
the machine competitive in that market. A classification of the functional requirements
that need to be considered ([6]) when a machine is designed:
Application area —this aspect refers to the target of the computer:
• Special purpose : typical feature is higher performance for specific applications;
• General purpose : typical feature is balanced performance for a range of tasks;
• Scientific : typical feature is high-performance for floating point;
• Commercial : typical features are support for data bases, transaction processing and
decimal aritmetic.
Level of software compatibility —this aspect determines the amount of existing soft-
ware for the machine:
• At programming language : this type of compatibility is the most flexible for designer
but requires new compilers;
• Object code or binary compatible: in this type of compatibility, the architecture
is completely defined —little flexibility— but no investment needed in software or
porting programs.
Operating system requirements — the necessary features to support a particular
operating system (OS) include:
• Size of address space : this is a very important feature; it may limit applications;
• Memory management : it may be flat, paged, segmented;
• Protection: the OS and applications may have different needs: page versus seg-
mented protection;
• Context switch : this feature supports process interrupt and restart;
• Interrupts and traps : the type of support for this features has impact on hardware
design and OS.
100 Memory Synthesis Using AI Methods
Standards —the machine may be required to comply with certain standards in the mar-
ketplace:
• Floating point: pertains to format and arithmetic; there are several standards such
as IEEE, DEC, IBM;
• I/O Bus: pertains to I/O devices for which standards such as SCSI,VME, Futurebus
are defined;
• Network : support for different networks such as Ethernet, FDDI;
• Programming languages : this is related to the support of standards such as ANSI
C, and affects instruction set.
Once a set of functional requirements has been established, the architect must try to
optimize the design. The optimal design may be considered the one that meets one of the
three criteria:
1. high-performance design — no cost is spared in achieving performance;
2. low-cost design — performance is sacrificed to achieve lowest cost;
3. cost/performance — balancing cost over performance. Optimizing cost/performance
is largely a question of where is the best place to implement some required function-
ality: hardware or software ? Balancing hardware and software will lead to the best
machine for the application domain.
The performance of the machine can be quantified by using a set of programs that are
chosen to represent that application domain. The measures of performance are:
CPUExecution time, which indicates the composed performance of CPU and Memory
hierarchy;
Response time, which is a measure of the entire system performance, taking into account
the operating system and Input/Output.
The design of the the memory system involves three components: Main Memory, Cache
Memory, and Interconnection network. The performance improvement of the CPU has
been and still is faster than that of the main memory. The CPU performance has improved
25% to 50% per year after 1985, while the DRAM memory performance has improved only
7% per year. Cache memories are supposed to bridge the gap between CPU and main
memory speeds, and thereby to decrease the CPU execution time. The elements that
must be considered when evaluating the impact of caches on CPU execution time include:
hit time, miss penalty, miss rate —and the effect of I/O and multiprocesing on miss
rate—, memory system latency and contention — which is dependent of the system archi-
tecture, memory consistency model supported by the machine, cache-coherence protocol,
synchronzation methods, application behavior.
Memory Synthesis Using AI Methods 101
10.2 General Memory Design Rules
In designing the memory hierarchy, one should take into account the pertaining rules of
thumb. These are [6]:
1. Amdahl/Case Rule: A balanced computer system needs about 1 megabyte of main
memory capacity and 1 megabit per second of I/O bandwidth per MIPS of CPU perfor-
mance.
2. 90/10 Locality Rule: A program executes about 90% of its instructions in 10% of its
code.
3. Address-Consumption Rule: The memory needed by the average program grows by
about a factor of 1.5 to 2 per year; thus, it consumes between 1/2 and 1 address bit per
year.
4. 90/50 Branch-Taken Rule: About 90% of backward-going branches are taken while
about 50% of forward-going branches are taken.
5. 2:1 Cache Rule: The miss ratio of a direct-mapped cache of size X is about the same
as a 2-way set associative cache of size X/2.
6. DRAM-Grwoth Rule: Density increases by about 60% per year, quadrupling in 3 years.
7. Disk-Growth Rule: Density increases by about 25% per year, doubling in 3 years.
Two important remarks are worthwhile. The first is that the improvement in DRAM
capacity (approximately 60% per year) has been faster in the recent past than the im-
provement in CPU performance (which has been from 25% to 50% per year). The second
fact is that the increase in DRAM speed is much slower than the increase in DRAM
capacity, reaching about 7% per year (these data are from Hennessy and Patterson [6]).
10.3 Dependences Between System Components
The optimum solution to the memory system design results from a synergy between a
multiprocessor’s software and hardware components. Some dependences that should be
considered when designing the memory hierarchy are:
1. The memory-consistency model supported by an architecture directly affects the
comlexity of the programming model and the performance;
2. The correctness of a coherence protocol is a function of the memory consistency
model adopted by the architecture.
3. The implementation of a synchronization method influences the memory and coher-
ence traffic;
4. The cache organization, the replacement strategy, the write policy, and the coherence
protocol influence the memory traffic.
102 Memory Synthesis Using AI Methods
5. The number of processors influences the application behavior; as the number of pro-
cessors concurrently executing an application increases, each processor is expected
to use a smaller amount of the address space but on the other side the interpro-
cess communication overhead increases. Therefore, the synchroniztion activity is
expected to increase the memory traffic as the number of processors increases.
6. The application behavior influences the memory traffic and therefore the latency of
memory accesses. The application behavior can be characterized by several proper-
ties:
(a) locality of accesses;
(b) size and number of shared data locations among processes:
(c) the ratio between READs and WRITEs of a shared data by a process;
(d) the length of the write-run;
(e) the number of processes that access each shared data: data may be widely
shared or shared by a small number of processes;
(f) the frequency of accesses to shared data by processes;
(g) the granularity of parallelism: coarse-grained, medium-grained, or fine-grained
applications;
7. Compilers for parallel application are important in achieving high performance. A
parallelizing compiler can improve access locality and it is also important for the
support of synchronization (Section 8.3).
The software support is of particular importance. The parallelizing compiler extracts par-
allelism from programs written for sequential machines and tries to improve data locality.
Locality may be enhanced by increasing cache utilization through blocking.
Therefore, the entire multiprocessor system must be studied when designing its compo-
nents, and within each component the dependences should be considered. It is worthwhile
to emphasize the importance of software and of the parallelism exhibited by applications
for achieving good performance on a highly parallel machine.
10.4 Optimizing cache design
Cache design involves choosing the cache size, organization (associativity, line size, number
of sets), write strategy, replacement algorithm, coherence protocol, and perhaps employing
several performance improvements schemes — such as those presented in Chapter 6. The
performance parameters depend of several design choices:
• cache cost — it is affected by all design choices;
• hit time — it is affected by the cache size, cache associativity, and write strategy;
• miss rate — it is affected by the cache size, associativity, line size, replacement
algorithm;
Memory Synthesis Using AI Methods 103
• miss penalty — it is affected by the line size, number of cache levels, cache coherence
protocol, and memory latency and bandwidth;
Subsections 10.4.1 to 10.4.6 show what criteria are taken into account when design choices
are made. Subsection 10.4.7 presents some design alternatives to improve performance.
The steps followed in the synthesis of the cache are then described in Section 10.5.
10.4.1 Cache Size
Increasing the cache size has the positive effect of reducing the miss rate, more specifically
the capacitiy misses and the conflict misses. However, the increase of cache size is limited
by several factors:
1. the hit time of the cache must be at most one CPU clock cycle ;
2. the page size and the degree of associativity (equation (22), Section 6.2);
3. the silicon area (or, alternatively, the number of chips) required to implement the
cache;
As shown in section 6.2, the requirement to perform a cache hit in one CPU clock cycle
imposes on the cache size, C, the restriction:
C = n ∗ 2j+k
≤ n ∗ 2p
(43)
where n is the degree of associativity, 2j is the line size, 2k is the number of sets, and 2p
is the page size. This is because the number of page-offset bits must be at least the sum
between the number of bits that select the set, k, and the block offset bits, j:
j + k ≤ p (44)
Another limitation is the cost of the Tag memory. The number of addres-Tag bits, tb,
required for each cache line is:
tb = m − (j + k) (45)
where m is the number of bits in the memory address. The memory required by the
address-tags, T, is:
T = n ∗ 2k
∗ (m − (j + k)) = C ∗ 2−j
∗ (m − (j + k)) (46)
Because the cache must be fast, it is built from SRAM chips, while main memory is
built from DRAM chips because it must provide large capacity. For memory designed
in comparable technologies, the cycle time (i.e., the minimum time between requests to
memory) of SRAMs is 8 to 16 times faster than DRAMs, while the capacity of DRAMs is
roughly 16 times that of SRAMs. Because the cache hit time is a lower bound for the CPU
clock cycle and the Miss penalty appears in the expression of CPU time (equation (12),
Section 5.2) expressed in CPU clock cycles and is affected by the main memory access
latency and bandwidth, it results that the ratio of the SRAM cycle time to the DRAM
cycle time directly affects the Miss penalty.
104 Memory Synthesis Using AI Methods
10.4.2 Associativity
The choice of associativity influences many performance parameters such as the miss rate,
the cache access time and the silicon area (or, alternatively, the number of chips). The
positive effect of increased associativity is the decrease in miss rate.
The degree of associativity affects the size of the Tag memory (equation (46), Subsection
10.4.1) which in turn affects the total cost of the cache. If the total size of the cache,
C, and the line size, 2j
, are kept constant, then increasing the associativity n, increases
the number of blocks per set, thereby decreasing the number of sets, 2k (as can be seen
from equation (43), Subsection 10.4.1). But if k decreases, the number of address-tag
bits per line increases (equation (45), Subsection 10.4.1) and the total size of the tag
memory increases (equation (46)), thus the cost increases. The degree of associativity also
determines the number of comparators needed to check the Tag against a given memory
address, the complexity of the multiplexer required to select the line from the matching
set, and increases the hit time. Associativity is therefore expensive in hardware, and may
slow access time leading to lower overall performance.
Therefore, increasing associativity —although having the benefic efect of reducing miss
rate — is limited by several constraints:
1. increasing associativity makes the cache slower, affecting the hit time of the cache
which must be kept lower than the CPU clock cycle;
2. increasing the associativity increases the cost of the cache and of the tag memory;
3. increasing associativity requires larger silicon area (or, alternatively, more chips) to
implement the cache;
Because direct-mapped caches allow only one data block to reside in the cache set specified
by the Index portion of the memory address, they have a miss rate worse than that of a
set-associative cache of the same total size. However, the higher miss rate is mitigated by
the smaller hit time: a set-associative cache of the same total size always displays a higher
hit time because an associative search of a set is required during each reference, followed by
the multiplexing of the appropriate line to the processor. As shown in Section 6.1, a direct-
mapped cache can sometimes provide a better performance than a 2-way set associative
cache of the same size. Furthermore, direct-mapped caches are simpler and easier to
design, do not need logic to maintain a least-recently-used replacement policy, and require
less area than a set-associative cache of the same size. Overall, direct-mapped caches are
often the most economical choice for use in workstations, where cost-performance is the
most important criterion.
Results obtained for many architectures and applications point that an associativity
greater than 8 gives little or no decrease in miss ratio. Because greater associativity
means slower cache and higher cost, the choice of the associativity is not usually greater
than 8.
Memory Synthesis Using AI Methods 105
10.4.3 Line Size and Cache Fetch Algorithm
The minimum line size is determined by the machine word length. The maximum line
size is limited by several factors such as: the miss rate, the memory bandwidth (which
determines the transfer time, and thereby the Miss penalty), and the impact of line size on
the performance of cache-coherent multiprocessors (Section 9.7). A larger line size reduces
the cost of the tag memory, T, as equation (46), Subsection 10.4.1, shows, where the line
size is 2j.
The line size influences other parameters in the following way:
1. increasing the line size decreases the compulsory misses (exploits spatial locality)
and increases the conflict misses (does not preserve temporal locality). Therefore,
for small line sizes increasing the line size is expected to decrease the miss rate —due
to spatial locality—, while for large line sizes increasing the line size is expected to
increase the miss ratio —because it does not preserve the temporal locality.
2. increasing the line size increases the miss penalty because the transfer time is in-
creasing.
3. increasing the line size decreases the cost of the Tag memory; the Tag overhead
becomes a smaller fraction of the total cost of the cache.
4. increasing the line size in multiprocessor architectures may increase the invalidation
overhead and may cause false data sharing if the compiler does not enforce that
different shared variables are located in different cache blocks (Section 9.7). On the
other side, because more information is invalidated at once for larger line size, the
frequency of the invalidations may decrease when the line size increases, but this is
depending in a large extent of the data sharing patterns (Section 9.3) and of the
compiler.
The first two effects of the line size must be considered together because they affect the
Average memory-access time (equation (15), Section 5.2). For example, Smith has found
([20]) that for the IBM 3033 (64-byte line, cache size 64 Kbyte) the line size that gives
the minimum miss rate lies in the range 128 to 256 bytes. The reason for which IBM
has chosen a line size of 64 bytes (which is not optimum with respect to the miss rate)
is almost certainly that the transmission time for longer lines increases the miss penalty,
and the main memory data path width required would be too large and therefore too
expensive. Measurements on different cache organizations and computer architectures
indicate ([6],[20]) that the lowest Average memory-access time is for line sizes ranging
from 8 to 64-bytes.
When choosing the line size for a multiprocessor architecture, the effect of the line size on
the overall traffic (that is, both data and coherence traffic) should be considered as well.
For example, A. Gupta and W. Weber have found ([14]) that for the DASH multiprocessor
architecture the line size that gives the minimum overall traffic is 32 bytes.
When some criteria (e.g., miss rate) point to using a larger line size and the transfer time
106 Memory Synthesis Using AI Methods
affects significantly the miss penalty, then the methods explained in Sections 6.9 may be
used to improve main memory bandwidth.
The two available choices for cache fetch are demand fetching and prefetching. The purpose
of prefetching (Section 6.5) is to reduce the miss rate for reads. However, prefecting
introduces some penalties: it increases the memory traffic and introduces cache lookup
accesses. The major factor in determining whether prefetching is useful is the line size.
Small line sizes generally result in a benefit from prefetching, while large line sizes lead
to the ineffectiveness of prefetching. The reason for this is that when the line is large, a
prefetch brings in a great deal of information, much or all of which may not be needed,
and removes an equally large amount of information, some of which may still be in use.
Smith ([20]) has found that for line sizes greater than 256 bytes prefetching brings no
improvement. The fastest hardware implementation of prefetching is provided by one
block lookahead (OBL) prefetch. Always prefetch provides a greater decrease in miss
ratio than prefetch on misses, but introduces also a greater memory and cache overhead.
As shown in Section 6.5, special steps must be taken to keep prefetching interference with
normal program accesses at an acceptable level.
As a general rule, one-block-lookahead prefetch with a line size of L bytes is a better
choice than demand fetch with a line size 2L bytes, because the former choice allows the
processor to proceed while the bytes L + 1 . . . 2L are fetched.
10.4.4 Line Replacement Strategy
Cache-line replacement strategy affects the miss rate and the memory traffic. The two
candidate choices are random, and LRU (or its approximation) replacement. FIFO re-
placement is not considered as a good choice because it has been shown to generally
perform poorer than random and its hardware cost is greater than that of random. The
replacement policy plays a greater role in smaller caches than in larger caches where there
are more choices of what to replace. Although LRU performs better than random, its edge
over random from the point of view of miss rate is less significant for large cache size and
the virtue of random of being simple to build in hardware may become more important.
LRU is the best choice for small and medium size caches because it preserves temporal
locality.
10.4.5 Write Strategy
For uniprocessor architectures, either write through or write back can be used. Write back
usually makes fetch-on-write — with the hope that the written line will be referenced again
soon, either by write or read accesses—, while write through usually does not make write
allocate — with the purpose of keeping room in the cache for data that is read and because
the subsequent writes to that block still have to go to memory. Using write through for
uniprocessors simplifies the cache-coherence problem for I/O because with this policy the
memory has an up-to-date copy of information and special schemes must be used to prevent
inconsistency only for I/O input but not for output (as shown in Subsection 5.13.1).
Memory Synthesis Using AI Methods 107
For multiprocessor architectures the typical write strategy is write-back with fetch-on-
write, in order to reduce the interconnection network traffic. With write back, write-hits
occur at the speed of the cache memory, and multiple writes within a line require only
one write to main memory. Since not every write is going to memory, write back uses less
memory bandwidth, which is an important aspect in multiprocessors.
10.4.6 Cache Coherence Protocol
The cache coherence protocol is selected depending on the type of interconnection network,
the number of processors present in the system, and the sharing pattern of the applica-
tions. In shared-bus systems, either snooping or directory protocols may be selected, and
generally the choice is related to cost and to the number of processors: snooping protocols
are less costly, but they are not able scale to many processors. For general interconnec-
tion networks and scalable multiprocessor architectures, directory protocols are the good
choice.
A design aspect of the coherence protocol is the write policy: write invalidate or write up-
date. Directory-based protocols are based on the write-invalidation strategy. For snooping
protocols, there is no clear hint whether write-invalidate or write-update is better. Some
applications perform better with write invalidate, others with write-update. This behavior
is due to the fact that the performance of both schemes is sensitive to the sharing pattern,
of particular importance being the write-run (Section 9.3). The choice can be made on the
basis of the sharing pattern of the applications for which the system is mainly targeted.
The length of the write-run points to the following choice for write policy:
— for long write-runs, write-invalidate is the good choice;
— for short write-runs, write-update is the good choice.
10.4.7 Design Alternatives
When only by adjusting the design parameters it is not possible to achieve the desired
performance, then performance improvement techniques (described in Chapter 6) can be
applied, depending on the problem:
1. Miss penalty — The read miss penalty can be reduced employing early restart or out
of order fetch (Section 6.3). For write-back caches, a write buffer (Section 6.7) can
be used. The write buffer is also useful for write-through caches because it reduces
the write stalls. Two-level caches (Section 6.8) also provide a reduction in the miss
penalty. Another approach is to reduce the transfer time by increasing the main
memory bandwidth (Section 6.9).
2. Miss rate — The miss rate can be reduced using prefetching (Section 6.5). For
direct-mapped caches a victim cache or a column associative scheme may be used
(Section 6.4).
108 Memory Synthesis Using AI Methods
3. Hit time — The read hit time for cache organizations that do not satisfy condition
(44) can be reduced by pipelining the TLB (Section 6.2). The write hit time can
be reduced by pipelining the writes or using subblock placement for write-through
direct-mapped caches (Section 6.6).
10.5 Design Cycle
The synthesis of the memory hierarchy is achieved by dividing the global design goal into
subgoals (i.e., subtasks) and achieving these subtasks. This process is called goal reduction
and is imposed by the complexity of the initial task. Knowledge is maintained by IDAMS
in a modular form, to reflect the knowledge base involved in solving the subtasks that
are carried out as part of the goal tree. Goal reduction is achieved using an agenda that
specifies the subtasks. Initially, the subtasks in the agenda may look like this:
1. Main Memory,
2. Cache Memory, and
3. Interconnection network
The subtasks are solved by agents; as the agents solve subtasks, they erase them from the
agenda and possibly replace them with other simpler tasks. Clear definiton of goals is an
important requirement on input to IDAMS in order to obtain relevant results. The steps
followed in designing the cache level of the memory hierarchy are:
1. input information analysis — this step collects information about the System under
Design (SUD) that affects the memory synthesis:
(a) uni/multiprocessor architecture;
(b) shared or distributed memory architecture;
(c) memory consistency model;
(d) type of interconnection network;
(e) instruction set;
(f) application programs;
(g) compiler technology.
2. extractions — this step extracts information from step 1 and translates the informa-
tion into more detailed parameters:
(a) abstract model of the System Under Design (SUD);
(b) application model behavior —working set size, frequency of shared accesses,
ratio between reads and writes accesses by one processor, length of the write-
runs.
(c) constraints for memory: such as available hit time, coherence protocol consis-
tent with the type of interconnection network and the consistency model.
Memory Synthesis Using AI Methods 109
3. main design part — adjustment of parameters using information from step 2 and
rules.
4. performance check — when analytical models are not available, simulation is used to
find out the performance; the achieved performance is checked against the required
performance. If the requirements are met, the design is complete; otherwise step 5
is taken, or the designer is asked to input additional information.
5. performance improvement — an analysis of some more specific parameters is per-
formed to determine on which aspects improvements should be made; specific pa-
rameters analyzed to choose an improvement include: compulsory misses, capacity
misses, conflict misses, miss rate, transfer time, miss penalty. Using the insight
provided by the analysis, some design parameters are changed (for example: if
great conflict rate then increase associativity) or some specifiec techniques (such as
column-associative cache, memory interleaving, prefetching, write pipelining, two-
level caches) are employed.
6. repeat step 4.
11 CONCLUSIONS
The report has addressesd issues related to the synthesis of the upper level of memory
hierarchy, the cache, tackling cache design choices — along with their internal and exter-
nal dependences—, and performance improvement strategies. The specialized knowledge
involved in the design of cache — both for uniprocessor and multiprocessor architectures
— has been provided, and the performance impact of design choices has been pointed out.
The domain specific knowledge presented for cache synthesis is to be applied in build-
ing the IDAMS tool. The steps followed by IDAMS in the synthesis of the cache have
been described. The results point out the importance of the generate and test strategy
to evaluate design alternatives. Of course, the rule-based system paradigm is used as a
problem solving strategy in building the IDAMS —check of design rules and constraints,
selection of design parameters and alternatives are following this strategy. Future work
for the IDAMS will address structuring the acquired knowldege and finding an adequate
representation for the knowledge.
110 Memory Synthesis Using AI Methods
REFERENCES
1. L.M. Censier and P. Feautrier, “A New Approach to Coherence Probblems in Multi-
cache Systems.” IEEE Trans. Computers, Vol. C-27,No.12, Dec.1978,pp.1112–1118.
2. L. Lamport, “How to Make a Multiprocessor Computer That Correctly Executes
Multiprocess Programs.” IEEE Trans. Computers, No.9, Sept. 1979, pp.690–691.
3. M. Dubois, C. Scheurich, and F.A. Briggs, “Synchronization, Coherence, and Or-
dering of Events in Multiprocessors.” Computer, Vol.21, No.2, Feb.1988, pp. 9–21.
4. M. Dubois and S. Thakkar, “Cache Architectures in Tightly Coupled Multiproces-
sors.” Computer, Vol. 23, No. 6, June 1990, pp. 9–11.
5. J. Archibald and J.L. Baer, “Cache Coherence Protocols: Evaluation using a Mul-
tiprocessor Simulated Model.” ACM Transaction on Computers, No.4, Nov 1986,
pp.273–298.
6. D.A. Patterson and J.L. Hennessy, “Computer Architecture A Quantitative Ap-
proach.” Morgan Kaufmann Publishers, Inc., San Mateo, CA, 1990.
7. H.S. Stone,“High-Performance Computer Architecture.” second edition, Addison-
Wesley, Reading, Mass., 1990.
8. H. Cheong and A.V. Veidenbaum, “Compiler-Directed Cache Management in Mul-
tiprocessors.” Computer, Vol. 23, No.6, June 1990, pp.39–47.
9. D. Lenovski, J. Laudon, W. Weber, A. Gupta, J .Hennessy, M. Horowitz. and M.
Lam “The Stanford Dash Multiprocessor.” Computer, Vol. 25, No.3, March 1992,
pp.63–79.
10. A. Agarwal et al., “An evaluation of Directory Schemes for Cache Coherence.” Proc.
The 15th Annual Intl. Symp. Computer Architecture, IEEE Computer Society Press,
Los Alamitos, California, June 1988, pp.280–289.
11. A. Agarwal and S.D. Pudar, “A Technique for Reducing the Miss Rate of Direct-
Mapped Caches.” Proc. The 20th Annual Intl. Symp. Computer Architecture, IEEE
Computer Society Press, Los Alamitos, Calif., May 1993, pp.179–189.
12. N.P. Jouppi, “Improving Direct-Mapped Cache Performance by Addition of a Small
Fully-Associative Cache and Prefecth Buffers.” Proc. The 17th Annual Intl. Symp.
Computer Architecture, IEEE Computer Society Press, Los Alamitos, Calif., May
1990,pp.364–373.
13. K. Garachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J.Hennessy,
“Memory Consistency and Event Ordering in Scalable Shared-Memory Multiproces-
sors.” Proc. The 17th Annual Intl. Symp. Computer Architecture, IEEE Computer
Society Press, Los Alamitos, Calif., May 1990, pp.15–26.
14. A. Gupta and W. Weber, “Cache Invalidation Patterns in Shared-Memory Multi-
processors.” IEEE Trans. Computers, Vol. 41, No.7, July 1992,pp.794–810.
Memory Synthesis Using AI Methods 111
15. G. Pfister and V. Norton, “Hot Spot Contention and Combining in Multistage Inter-
connection Networks.” IEEE Trans. Computers, Vol. C-34, Oct. 1985, pp.943–948.
16. P. Yew, N. Tzeng, and D. Lawrie, “Distributing Hot-Spot Addressing in Large-Scale
Multiprocessors.” IEEE Trans. Computers, Vol. C-36, No.4, April 1987, pp.388–
395.
17. D. Lenoski, J. Laudon, K. Garachorloo, A. Gupta, and J. Hennessy, “The Directory-
Based Cache Coherence Protocol for the DASH Multiprocessor.” Proc. The 17th
Annual Intl. Symp. Computer Architecture, IEEE Computer Society Press, Los
Alamitos, Calif., May 1990, pp.148–159.
18. S.J. Eggers and R.H. Katz, “A Characterization of Sharing in Parallel Programs
and its Application to Coherency Protocol Evaluation.” Proc. The 15th Annual
Intl. Symp. Computer Architecture, IEEE Computer Society Press, Los Alamitos,
Calif., June 1988, pp.373–382.
19. P. Stenstr¨om, M. Brorsson, and L. Sandberg, “An Adaptive Cache Coherence Pro-
tocol Optimized for Migratory Sharing.” Proc. The 20th Annual Intl. Symp. Com-
puter Architecture, IEEE Computer Society Press, Los Alamitos, Calif., May 1993,
pp.109–118.
20. A.J. Smith, “Cache Memories.” Computer Surveys, Vol.14, No.3, Sept. 1982,
pp.473–530.
View publication statsView publication stats

Memory synthesis using_ai_methods

  • 1.
    MEMORY SYNTHESIS USINGAI METHODS Gabriel Mateescu August 18, 1993 Research Project Report Universit¨at Dortmund European Economic Community Individual Fellowship Contract Number: CIPA-3510-CT-925978 i
  • 2.
    ii Memory SynthesisUsing AI Methods
  • 3.
    Memory Synthesis UsingAI Methods iii Contents 1 RESEARCH PROJECT GOALS 1 2 INTELLIGENT DESIGN ASSISTANT FOR MEMORY SYNTHESIS 3 2.1 High Level Organization of IDAMS . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Knowledge Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 PERFORMANCE AND COST 9 3.1 Performance Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Performance improving: Amdahl’s Law . . . . . . . . . . . . . . . . . . . . 10 3.3 CPU Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4 COMPUTER ARCHITECTURE OVERVIEW 12 4.1 An Architecure Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.2 Multiprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.3 Multiprocessing performance . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.4 Interprocess communication and synchronization . . . . . . . . . . . . . . . 16 4.5 Coherence, Consistency, and Event Ordering . . . . . . . . . . . . . . . . . 17 5 MEMORY HIERARCHY DESIGN 19 5.1 General Principles of Memory Hierarchy . . . . . . . . . . . . . . . . . . . . 19 5.2 Performance Impact of Memory Hierarchy . . . . . . . . . . . . . . . . . . . 21 5.3 Aspects that Classify a Memory Hierarchy . . . . . . . . . . . . . . . . . . . 23 5.4 Cache Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.5 Line Placement and Identification . . . . . . . . . . . . . . . . . . . . . . . . 26 5.6 Line Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.7 Write Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.8 The Sources of Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.9 Line Size Impact on Average Memory-access Time . . . . . . . . . . . . . . 31 5.10 Operating System and Task Switch Impact on Miss Rate . . . . . . . . . . 31 5.11 An Example Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.12 Multiprocessor Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
  • 4.
    iv Memory SynthesisUsing AI Methods 5.13 The Cache-Coherence Problem . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.13.1 Cache Coherence for I/O . . . . . . . . . . . . . . . . . . . . . . . . 36 5.13.2 Cache-Coherence for Shared-Memory Multiprocessors . . . . . . . . 37 5.14 Cache Flushing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6 IMPROVING CACHE PERFORMANCE 39 6.1 Cache Organization and CPU Performance . . . . . . . . . . . . . . . . . . 39 6.2 Reducing Read Hit Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 6.3 Reducing Read Miss Penalty . . . . . . . . . . . . . . . . . . . . . . . . . . 41 6.4 Reducing Conflict Misses in a Direct-Mapped Cache . . . . . . . . . . . . . 42 6.4.1 Victim Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.4.2 Column-Associative Cache . . . . . . . . . . . . . . . . . . . . . . . . 43 6.5 Reducing Read Miss Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 6.6 Reducing Write Hit Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.6.1 Pipelined Writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.6.2 Subblock Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.7 Reducing Write Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.8 Two-level Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.8.1 Reducing Miss Penalty . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.8.2 Second-level Cache Design . . . . . . . . . . . . . . . . . . . . . . . . 50 6.9 Increasing Main Memory Bandwidth . . . . . . . . . . . . . . . . . . . . . . 51 6.9.1 Wider Main Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.9.2 Interleaved Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 7 SYNCHRONIZATION PROTOCOLS 54 7.1 Performance Impact of Synchronization . . . . . . . . . . . . . . . . . . . . 54 7.2 Hardware Synchronization Primitives . . . . . . . . . . . . . . . . . . . . . . 54 7.2.1 TEST&SET(lock) and RESET(lock) . . . . . . . . . . . . . . . . . . . . 54 7.2.2 FETCH&ADD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 7.2.3 Full/Empty bit primitive . . . . . . . . . . . . . . . . . . . . . . . . 56 7.3 Synchronization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
  • 5.
    Memory Synthesis UsingAI Methods v 7.3.1 LOCK and UNLOCK operations . . . . . . . . . . . . . . . . . . . . . . . 57 7.3.2 Semaphores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 7.3.3 Mutual Exclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 7.3.4 Barriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 7.4 Hot Spots in Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 7.4.1 Combining Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 7.4.2 Software Combining Trees . . . . . . . . . . . . . . . . . . . . . . . . 60 7.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 8 SYSTEM CONSISTENCY MODELS 63 8.1 Event Ordering Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 8.2 Categorization of Shared Memory Accesses . . . . . . . . . . . . . . . . . . 64 8.3 Memory Access Labeling and Properly-Labeled Programs . . . . . . . . . . 66 8.4 Sequential Consistency Model . . . . . . . . . . . . . . . . . . . . . . . . . . 68 8.4.1 Conditions for Sequential Consistency . . . . . . . . . . . . . . . . . 69 8.4.2 Consistency and Shared-Memory Architecture . . . . . . . . . . . . . 69 8.4.3 Performance of Sequential Consistency . . . . . . . . . . . . . . . . . 70 8.5 Processor Consistency Model . . . . . . . . . . . . . . . . . . . . . . . . . . 71 8.6 Weak Consistency Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 8.7 Release Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 8.8 Correctness of Operation and Performance Issues . . . . . . . . . . . . . . . 74 9 CACHE COHERENCE PROTOCOLS 76 9.1 Types of Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 9.2 Rules enforcing Cache Coherence . . . . . . . . . . . . . . . . . . . . . . . . 78 9.3 Cache Invalidation Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 9.4 Snooping Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 9.4.1 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 79 9.4.2 Snooping Protocol Example . . . . . . . . . . . . . . . . . . . . . . . 80 9.4.3 Improving Performance of Snooping Protocol . . . . . . . . . . . . . 82 9.5 Directory-based Cache Coherence . . . . . . . . . . . . . . . . . . . . . . . . 83
  • 6.
    vi Memory SynthesisUsing AI Methods 9.5.1 Classification of Directory Schemes . . . . . . . . . . . . . . . . . . . 83 9.5.2 Full-Map Centralized-Directory Protocol . . . . . . . . . . . . . . . . 83 9.5.3 Limited-Directory Protocol . . . . . . . . . . . . . . . . . . . . . . . 86 9.5.4 Distributed Directory and Memory . . . . . . . . . . . . . . . . . . . 87 9.6 Compiler-directed Cache Coherence Protocols . . . . . . . . . . . . . . . . . 94 9.7 Line Size Effect on Coherence Protocol Performance . . . . . . . . . . . . . 98 10 MEMORY SYSTEM DESIGN AS A SYNERGY 99 10.1 Computer design requirements . . . . . . . . . . . . . . . . . . . . . . . . . 99 10.2 General Memory Design Rules . . . . . . . . . . . . . . . . . . . . . . . . . 101 10.3 Dependences Between System Components . . . . . . . . . . . . . . . . . . 101 10.4 Optimizing cache design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 10.4.1 Cache Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 10.4.2 Associativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 10.4.3 Line Size and Cache Fetch Algorithm . . . . . . . . . . . . . . . . . 105 10.4.4 Line Replacement Strategy . . . . . . . . . . . . . . . . . . . . . . . 106 10.4.5 Write Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 10.4.6 Cache Coherence Protocol . . . . . . . . . . . . . . . . . . . . . . . . 107 10.4.7 Design Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 10.5 Design Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 11 CONCLUSIONS 109
  • 7.
    Memory Synthesis UsingAI Methods 1 1 RESEARCH PROJECT GOALS This report presents the results of a three-month work carried out at University Dort- mund, within the framework of the European Economic Community individual fellowship contract CIPA-3510-CT-925978. The purpose of the work has been to provide domain knowledge for a knowledge based memory synthesis tool that is now under development at Lehrstuhl Informatik XII, Universit¨at Dortmund. The increasing gap between processor and main memory speeds has led computer archi- tects to the concept of memory hierarchy. The gap expresses the trend that CPUs are getting faster and main memories are getting larger, but slower relative to the faster CPUs. The performance improvement of the CPU has been and still is faster than that of the memory. The memory hierarchy concept recognizes that smaller memory is faster, and is based on organizing the memory in levels, each smaller, faster, and more expensive per byte than the level below. Synthesizing a memory hierarchy for a System Under Design (SUD) is a complex design task that cannot be approached separatedly from the entire system design, beginning with the architecture features and going on to the compiler technology and to application characteristics. Two main problems arise in designing a memory hierarchy: first, the need of knowledge about the design rules and the available design choices that are reflecting the state of the art in the domain, and second, a way to evaluate the performance of the memory hierarchy for the system under design, taking into account aspects such as the architecture, number of processors, typical application programs for which the machine is targeted, compiler technology, and manufacturing technology (e.g., silicon technology, packaging). Solving the first problem requires extensive specific knowledge of design parameters and their relationships, and unfortunately, some relationships can not be expressed exaclty for the general case (e.g., which cache line-size to choose for a given cache size). Similarly, the impact of many design decisions on performance is not quantifiable for the general case (e.g., which miss rate will occur for a given cache size). Several analytical models have been proposed for evaluating the performance impact of the design parameters. Generally, a combination of simulation and analytical methods is used to evaluate the performance of a design alternative. Analytical models for performance evaluation are limited to a given architecture or to a range of similar architectures. However, when a new architecture is developed, analytical models may not be available. To overcome these problems, a knowledge-based memory synthesis tool is proposed. The tool is an expert system that designs a (part of) memory hierarchy for a specified SUD, and it is currently developed at Universit¨at Dortmund by Renate Beckmann who has dubbed it: Intelligent Design Assistant for Memory Synthesis (IDAMS). Because of the limited duration of the research stay, I have chosen to focus my attention on the design of the upper level of memory hierarchy, that is, the cache. Design of cache is tackled both for uniprocessor and multiprocessor architectures. The domain knowledge provided in this report will be incorporated in the IDAMS. Efforts have been made to
  • 8.
    2 Memory SynthesisUsing AI Methods cover the state of the art in cache design, but it is likely that some aspects have been overlooked. However, the flexibility of IDAMS allows incorporation of future knowledge as it will be available. The organization of the report is as follows. First, the high-level organization of the design assistant for memory synthesis (IDAMS) is explained in Chapter 2. Since cost-performance is a crucial design evaluation criterion, the measure of computer performance is described in Chapter 3. The architecture of a system affects the cache organization and coherence protocol. Architectural aspects that should be considered when designing the memory system are discussed briefly in Chapter 4. Chapter 5 presents the basic design issues for the memory hierarchy, with emphasis on caches. There exists a great number of techiques for improving the performance of the basic cache design and the most important ones are presented in Chapter 6. Synchronization is imperative for parallel programming and the efficiency of synchronization operations has a great impact on multiprocessor performance. Synchronization issues are discussed in Chapter 7. The memory-consistency model of a system has a direct effect on the complexity of the programming model, on the achievable implementation efficiency and on the amount of overhead associated with cache coherence protocols —thus on performance. The major memory consistency models are presented in Chapter 8, and cache-coherence protocols are analized in Chapter 9. Finally, all is put together in Chapter 10, and the steps that are involved in cache memory design are shown, based on the knowledge incorporated in the previous chapters. I would like to thank Renate Beckmann, assistant at the Department of Computer Science, University of Dortmund. We had many useful discussions about this project and her ideas, comments, and suggestions helped me a lot in clarifying several design aspects. Much credit goes to Renate especially for her contributions to Chapter 2. I am particularly grateful to Professor Peter Marwedel, Chair of the Department of Computer Science, for giving me the opportunity to work at the University of Dortmund and for providing me with a generous logistic support.
  • 9.
    Memory Synthesis UsingAI Methods 3 2 INTELLIGENT DESIGN ASSISTANT FOR MEMORY SYNTHESIS 2.1 High Level Organization of IDAMS The Intelligent Design Assistant for Memory Synthesis (IDAMS) is a knowledge-based tool for memory synthesis that configurates (part of) a memory hierarchy for a specified system under design (SUD). Its input information contains some details of the architecture of the SUD and the ap- plication domain for which the machine is designed. The architectural information will influence some of the design decisions. For example, if cache is used in the memory hier- archy we have to deal with the coherence problem to guarantee the consistency of shared data. Another required piece of information is the application domain. The more is known about the applications the SUD is designed for, the more is known about the typical char- acteristics of memory accesses, and the greater is the possibility to configurate the memory hierarchy such that for typical cases memory accesses are fast enough. For example, in the domain of digital signal processing (DSP) large amount of data are processed, but the ac- cesses exhibit a sequential pattern. This kind of application favors memory organizations for which once a datum is fetched, the next data can be found easily. Because of the high dependences between the memory structure and the architectural features of the SUD, it should be possible for IDAMS to interact with the environment if additional information about the architecture is needed; if there are some design al- ternatives on which IDAMS can not decide, then the designer (user of IDAMS) may be asked. IDAMS deals with the memory design alternatives by maintaining a generic model of the memory hierachy. Every useful design possibility for the memory is expressed by param- eters in this model. For example, the memory hierarchy can be expressed as consisting of a main memory of size M, with interleaving yes/no, and a cache yes/no of size C, with line size L, associativity n, and so on. To configurate a memory hierarchy for a specific SUD means to adjust these parameters so that all requirements on the memory are met. There are many parameters to adjust and some of them have a lot of possible values (e.g., memory size). This leads to a great number of possible choices when searching for the right combination of parameter adjustments. The output of IDAMS will be a model of the memory hierarchy that specifies the design parameters and that meets the requirements imposed by the designer. This model may be transmitted to a module generator that generates the components of the memory hierarchy at a lower level of abstraction. There is no complete theory about how to design a memory hierarchy. That makes it difficult to write an algorithm for this problem. Therefore, IDAMS is organized as an expert system. This approach has the advantage that knowledge about memory design (domain knowledge) can be separated from that about the organization and control of the
  • 10.
    4 Memory SynthesisUsing AI Methods . expert knowl. specific problem user interview component explanation component knowledge acquisition component problem solving component domain specific knowl. intermediate states and problem solution Figure 1: Expert System Architecture the design process (inference component). This makes it simple to extend the knowledge — a necessary feature in problems with incomplete theory and knowledge about the problem solving step. Another advantage of the expert system approach is that it supports the process of modeling the rules of thumb (heuristics) of an expert designer. The well known architecture of an expert system is illustrated in Figure 1. An expert system contains several kinds of knowledge: The domain specific knowledge consists of rules about the domain; for IDAMS, these are rules about how to design a memory hierarchy. The problem specific knowledge holds the information about the actual problem to be solved; for IDAMS, this is the information and the requirements about the SUD for which the memory hierarchy is designed. The intermediate states are the descriptions of partial solutions. For this system, the partial solution is initially the generic model of the memory hierarchy. During the problem solving step the parameters of the model are adjusted by IDAMS. The solution of the problem is the model in which all parameters have been adjusted. An important part of an expert system is the expert system shell. It contains the problem solving component (also called the inference unit). The problem solving unit searches in the knowledge base for a rule that is applicable to the current intermediate state. If a rule is found, it is applied to the current state and a new state is reached. If more than one rule are found then the problem solver has to select one rule. This can be done by several strategies: use the newest rule, use the most specific one (that one with the most specific IF-part), take the one with the highest priority (given in the rule), or randomly select a rule. The interaction between the designer and the expert system is managed by the interview component. The knowledge acquisition component has to insert new knowledge given by the expert into
  • 11.
    Memory Synthesis UsingAI Methods 5 the knowledge base. New knowledge about the memory design process can be inserted into IDAMS through this component. This may be an editor used to write new rules into specified files. The explanation component gives information about the problem solving process to make it transparent to the user and to the expert. This may display the last rule applied (how the next state is reached) or the state before applying the last rule (why the rule has been selected). The explanation component may be emulated by trace modus. From the IDAMS point of view, the user is the designer of the memory hierarchy for a specific machine. The expert inserts the rules about memory synthesis into IDAMS. He/she may be an expert in designing memories or he/she may be a knowledge engineer who acquires his/her knowledge from literature or from a memory design expert. There is a big amount of knowledge in the area of memory synthesis and some of it adresses specific parts of the memory design process. To keep an overview over the knowledge base and to handle changes and extensions of the knowledge, it should be modular. This can be done by building the IDAMS with a blackboard architecture, as shown in Figure 2. In an expert system with a blackboard architecture there are several agents, and each agent acts as an expert for a special subtask (containing the rules to solve the subtasks). An agenda contains the (sub)tasks to be done. Each agent that is solving a subtask erases it from the agenda and possibly creates new subtasks that are inserted there. The agents can communicate through a blackboard from/on which every agent can read/write information. The knowledge can be structured in IDAMS with respect to the architecture of the SUD, the components of the memory hierarchy, or the problem to deal with. Each module is handled by some agents: • architecture agents: The architecture agents handle the knowledge about the architecture of the SUD. These agents know the requirements that must be met by the system and the com- ponents. Architecture agents may exist for uniprocessors, multiprocessors, etc. • memory component agents: These agents have knowledge about the parameters of a specific component. Memory component agents may exist for different levels of the memory hierarchy: cache, main memory, secondary memory. For example, the cache agent knows which parameters of the cache to adjust and the constraints on the allowable choices. • special domain agents: The knowledge about the domain for which the system is designed is maintained by these agents. They analyze the special requirements of the domain for which the SUD is designed. Some decisions on the design of the memory hierarchy are made here, taking into account the characteristics of the domain. Special domain agents may exist for general-purpose processors, digital signal processing, AI machines, etc.
  • 12.
    6 Memory SynthesisUsing AI Methods architect. agent multiprocessor blackboard ... ...... domain agent DSP component agent cache agent i ... generic memory model design parameters ... cache size . effects to requirements ... interferencing architecture informations and requirements module generator memory model interaction knowledge base Figure 2: Blackboard Architecture of IDAMS
  • 13.
    Memory Synthesis UsingAI Methods 7 The agents have to work together to adjust the parameters because the parameters are interdependent and are influenced by the domain for which the machine is targeted. For example, if the line size of a cache has to be chosen, the cache agent and the special domain agent are needed. If the special domain agent is the DSP agent, he will perhaps favor large line sizes, because for this domain data is often accessed sequentially — data next to the currently accessed data is likely to be needed soon. On the other hand, the cache agent knows that increasing the line size may have the negative effect of increasing the average memory-access time (as shown in Section 5.9). Generally, the dependences between the parameters of the memory hierarchy are manifold, they are hard to express exactly, and cooperation between agents is absolutely necessary. The modular structure of the expert system allows using expert system building tools. Expert system building tools are tools providing an expert system shell, so that only the knowledge has to be inserted into the system. The report deals mainly with the cache synthesis for uni- and multiprocessors. Cache, the level closer to the processor in the memory hierarchy, is crucial for achieving high performance. The cache design will be an important part of the IDAMS system. 2.2 Knowledge Acquisition The availabiltiy of expert system tools helps building the IDAMS. The tools usually con- sist of an expert system shell, hence the main work left to do is to acquire the necessary knowledge, to formalize it, and to insert it into the system using an adequate represen- tation. The process is called knowledge acquisition and is mentioned in the literature as the bottleneck in constructing expert systems, because it is hard work: designers with knowledge and expertise are usually busy and expensive. They get their knowledge from working in the special domain. The knowledge is sometimes unstructured and unformal- ized. Another problem is the lack of motivation: an expert with special knowledge has some kind of power and to make the knowledge public domain means to loose power. To solve the knowledge acquisition problem, two main strategies have been developed: • Direct methods — experts are asked about their knowledge. Interviews, questionar- ies, introspection (observing an expert solving a special problem and explaining the steps), self report (asking an expert to explain an existing solution), and protocol checking (asking an expert to check a protocol of a former knowledge acquisition session) are direct methods. • Indirect methods — they attempt to acquire knowledge that is not explicit in the brain of the expert. Their goal is not to find out how to solve the problem, but to discover how to select an adequate form of knowledge representation. This report addresses the direct methods for knowledge acquisition. The sources of knowl- dge include information extracted from books and scientific papers published by leading scientists in the field, and my own experience in designing computers. The basic knowledge
  • 14.
    8 Memory SynthesisUsing AI Methods necessary for synthesizing a cache with the IDAMS is presented and, as shown in Section 10.5, the acquired knowledge influences the steps followed by IDAMS in the design cycle.
  • 15.
    Memory Synthesis UsingAI Methods 9 3 PERFORMANCE AND COST 3.1 Performance Measurement The designer of a computer system may have a single design goal: designing only for high- performance, or for low-cost. In the first case, no cost is spared in achieving performance, while in the second case performance is sacrificed to achieve the lowest cost. In between these extremes is the cost/performance design, where the designer balances cost versus performance. We need a measure of the performance. Time is the measure of computer performance, but time can be measured in different ways, depending on what we count. Generally, performance may be viewed from two perspectives: • the computer user is interested in reducing response time — the time between the start an the completion of the job — also referred to as execution time, elapsed time, or latency. • the computer center manager is interested in increasing throughput — the total amount of work done in a given time — sometimes called bandwidth. Typically, the terms “response time”. “execution time”, and “throughput” are used when an entire computing task is discussed. The terms “latency” and “bandwidth” are often the terms of choice when discussing a memory system. Therefore, the two aspects of performance are performing a given amount of work in the least time —reducing response time—, or maximizing the number of jobs that are performed in a given amount of time —increasing throughput. Performance is frequently measured as a rate of some number of events per second, so that lower time means higher performance. Given two machines, say X and Y, the phrase “X is n% faster than Y” means that: n = 100 PerformanceX − PerformanceY PerformanceY (1) where PerformanceX PerformanceY = Execution timeY Execution timeX (2) Because performance and execution times are reciprocicals, increasing performance de- creases execution time. To help avoid confusion between the terms “increasing” and “decreasing”, we usually say “improve performance”or “improve execution time” when we mean increase performance and decrease execution time. The program response time is measured in seconds per program, and includes disk accesses, memory accesses, input/output activities, operating system overhead — everything. Even if the response time seen by the user is the elapsed time of the program, one must take into account that with multiprogramming the CPU works on another program while waiting
  • 16.
    10 Memory SynthesisUsing AI Methods for I/O and may not necessarily minimize the response time of one program. The measure of the CPU performance is the CPU time (Section 3.3) which means the time CPU is executing a given program, not including the time running other programs or waiting for I/O. 3.2 Performance improving: Amdahl’s Law An important principle of computer design is to make the common case fast and favor the frequent case over the infrequent case. That is, the performance impact of an improvement of an event is higher if the occurence of the event is frequent. A fundamental law, called Amdahl’s Law, is quantifying this principle: The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. In other words, Amdahl’l law addresses the speedup that can be gained by using a particular feature. Speedup is defined as: Speedup = Performance for entire job using the enhancement when possible Performance for entire job without using the enhancement (3) Speedup tells us how much faster a task will run using the machine with the enhance- ments as opposed to the original machine. Amdahl’s Law expresses the law of diminishing returns: The incremental improvement in speedup gained by an improvement in per- formance of just a portion of the computation diminishes as improvements are added. In other words, an enhancement that provides a speedup denoted by Speedupenhanced but is used only a fraction of time, denoted by Fractionenhanced, will provide an overall Speedup: Speedupoverall = 1 (1 − Fractionenhanced) + Fractionenhanced Speedupenhanced (4) An important corollary of Amdahl’s Law is that if an enhancement is only usable for a fraction of a task, we can’t speed up the task by by more than the reciprocal of 1 minus that fraction. Amdahl’s Law can serve as a guide to how much an enhancement will improve performance and how to distribute resources to inprove cost/performance. The idea is to spend resources proportional to where time is spent. 3.3 CPU Performance The response time includes the time of waiting for I/O, the time the CPU may work on another program (in multiprogramming systems), and operating system overhead. If we
  • 17.
    Memory Synthesis UsingAI Methods 11 are interested in the CPU performance, its measure is the time the CPU is computing. CPU time is the time the CPU is computing, not including the time waiting for I/O or running other programs. The CPU time can be further divided into the CPU time spent in the program, called user CPU time, and the CPU time spent in the operating system, called system CPU time. In this report, we shall use the response time as the system performance measure and the user CPU time as the CPU performance measure. CPU time for a program can be expressed with the following formula: CPU time = CPU clock cycles for a program ∗ Clock cycle period (5) If we denote the number of instructions to execute a program by Instruction count (IC), then we can calculate the average number of clock cycles per instruction (CPI): CPI = CPU clock cycles for a program IC (6) Expanding equation (5), we get: CPU time = IC ∗ CPI ∗ Clock cycle time (7) CPU time can be divided into the clock cycles the CPU spends executing the program and the clock cycles the CPU spends waiting for the memory system (when we say that “the CPU is stalled”): CPU time = IC ∗ CPIExecution + Memory − stall cycles Instruction ∗ Clock cycle time (8) Example What effect do the following performance enhancements have on throughput and response time? 1. Faster clock cycle time; 2. Parallel processing of a job; 3. Multiple processors for separate jobs. Answer Since decreasing response time usually increases throughput, both 1 and 2 improve response time and throughput. In 3, no one job gets work done faster, so only throughput increases.
  • 18.
    12 Memory SynthesisUsing AI Methods 4 COMPUTER ARCHITECTURE OVERVIEW 4.1 An Architecure Classification Designing a computer system is a task having many aspects, including instruction set de- sign, functional organization, logic design, and implementation. The design of the machine should be optimized across these levels. The term instruction set architecture refers to the actual programmer-visible instruction set. The instruction set architecture serves as the boundary between the hardware and the software. The term organization includes the high-level aspects of a computer’s design, such as the memory system, the bus structure, and the internal CPU design. Hardware is used to refer to the specifics of a machine, such as the detailed logic design and the packaging technology of the machine. Hardware is actually the main aspect of implementation, which encompasses integrated circuit design, packaging, power, and cooling. We shall use the term architecture to cover all three aspects of computer design. According to the number of processors included, computer systems fall into two broad categories: uniprocessor computers —which contain only one CPU—, and multiprocessor computers —which contain more CPU’s. The CPU can be partitioned into into three functional units, the instruction unit (also called the control unit) (I-unit), the execution unit (E-unit), and the storage unit (S-unit); • the I-unit is responsible for instruction fetch and decode and generates the control signals; it may contain some local buffers for instruction prefetch (lookahead); • the E-unit executes the instructions and contains the logic for arithmetic and logical operations (the data path blocks); • the S-unit provides the memory interface between the I-unit and the E-unit; it provides memory management (protection, virtual address translation) and may contain some additional components whose goal is to reduce the access time to data. A finer classification of computer architectures may be made taking into account not only the number of CPU’s but also the instruction and data flow. Under this approach, there are three architectures ([7]): • Single Instruction Stream Single Data Stream (SISD) architecture, in which a single instruction stream is executed on a single stream of operands; • Single Instruction Stream Multiple Data Stream (SIMD) architecture, in which a single instruction stream is executed on several streams of operands; • Multiple Instruction Stream Multiple Data Stream (MIMD) architecture, in which different instruction streams are executed on different operand streams in parallel. In this report we shall use as synonym for the MIMD architecture the term multiprocessor. Multiprocesing is a way to increase system computing power. A distinctive feature of an
  • 19.
    Memory Synthesis UsingAI Methods 13 Processor 1 Processor 2 ... Processor n Interconn Network Memory Memory ... Memory I/O I/O Figure 3: Shared-memory system: all memory and I/O are remote and shared MIMD architecture is that multiple instruction streams must communicate or synchronize by passing message or sharing memory. 4.2 Multiprocessors Among multiprocessors, we can distinguish two broad classes, according to the logical architecture of their memory systems [7]: • shared memory systems (also called tightly coupled systems): all processors access a single address space (Figure 3). • distributed systems (also called loosely coupled systems or multicomputers) : each processor can access only its own memory; each processor’s memory is logically disjoint from other processor’s memory (Figure 4). In order to communicate, the processors are sending messages to each other. A memory unit is called a memoru module. Each processor has registers, arithmetic and logic units, and access to memory and input/output modules. The distinction between the architectures stems from way memory and input/output units are accessed by the processor: • in the shared memory model, memory and I/O systems are separate subsystems shared among all of the processors; • in the distributed memory model, memory and I/O units are attached to individual processors; no sharing of memory and input/output is permitted.
  • 20.
    14 Memory SynthesisUsing AI Methods Processor 1 Processor 2 ... Processor n Memory Memory Memory I/O I/O I/O Interconn Network Figure 4: Distributed system: all memory and I/O are local and private In the shared memory model the address space of all processors is the same and is dis- tributed among the memory modules, while in the distributed memory model each pro- cessor has its own address space mapped to the local memory. In both cases, the systems contains multiple processors each capable of executing an independent program, therefore the system fits the MIMD model. Actually, these two architectures represent the extremes in the design space, and practical designs may lie at the extremes or anywhere in between. Therefore, multiprocessors can have any reasonable combination of shared global memory and private local memory. The goal of multiprocessing may be either to maximize throughput of many jobs (these are called throughput-oriented multiprocessors) or to speed up the execution of a single job (these are called speed-up - oriented multiprocessors). In the first type of systems, jobs are distinct from each other and execute as if they were running on different unipro- cessors. In the second type an application is partitioned into a set of cooperating precesses and these processes interact while executing concurrently on different processors. The partitioning of a job into cooperating processes is called mutlithreading or parallelization. The shared memory model provides a convenient means for information interchange and synchronization since any pair of processors can communicate through a shared location. Shared memory systems present to the programmer a single address space, enhancing the programmability of a parallel machine by reducing the problems of data partition- ing and dynamic load distribution. The shared address space also improves support for automatically parallelizing compilers, standard operating systems, multiprogramming.
  • 21.
    Memory Synthesis UsingAI Methods 15 The distributed system (i.e., local and private memory and I/O) supports communication through point-to-point exchange of information, usually by message passing. Depending on the structure of the interconnection network, there are two types of shared- memory architectures: • bus-based memory systems: the memory and all processors (with optional private caches) are connected to a common bus. In other words, communication on a bus is of broadcast type: any memory access made by one CPU can be “seen” by all CPU’s. • general interconnection networks: they provide several simultaneous connections between pairs of nodes, that is, only two nodes are involved in any connection: the sender and the receiver. These interconnection networks adhere to the point-to-point communication model and may be direct or multistage networks. 4.3 Multiprocessing performance The main purpose of a multiprocessor is either to increase the throughput or to decrease the execution time, and this is done by using several machines concurrently instead of a single copy of the same machine. In some applications, the main purpose for using multiple processors is for reliability rather than high performance; the idea is that if any single processor fails, its workload can be performed by other processors in the system (fault-tolerant computing). The design principles for fault-tolerant computers are quite different from the principles that guide the design of high-performance systems. We shall focus our attention on performance. As mentioned in Section 3.2, the amount of overall Speedup is dependent on the fraction of time that a given enhancement is actually used. In the case of improving performance by using multiprocessing instead of uniprocessing, the overall efficiency is maximum when all processors are engaged in useful work, no processor is idle, and no processor is executing an instruction that would not be executed if the same algorithm were executing on a single processor. This is the state of peak performance, when all N processors of a multiprocessor are contributing to effective performance, and in this case the Speedup is equal to N. Peak performance is rarely achievable because there are several factors ([7]) that introduce inefficiency, such as: • the delays introduced by interprocessor communications; • the overhead in synchronizing the work of one processor with another; • lost efficiency when one or more processors run out of tasks; • lost efficiency due to wasted effort by one or more processors; • the processing costs for controlling the system and scheduling operations.
  • 22.
    16 Memory SynthesisUsing AI Methods Even though both scheduling and synchronization are sources of overhead on uniproces- sors, we cite them here because they degrade multiprocessoor performance beyond the effects that may already be present on individual processors. The sources of inefficinecy must be carefully examined because the increase in performance of multiprocessing compared to serial processing may be compromised. For example, if the combined ineficiencies produce an effective processing rate of only 10 percent of the peak rate, then ten processors are required in a multiprocessor system just to do the work of a single processor. The inefficiency tends to grow as the number of processors increases. For a small number of processors (tens), careful design can hold the inefficiency to a low figure. Moreover, the complexity of programming a machine with many (hundreds of) processors far exceeds the complexity of programming a single processor or a computer with a few processors. Therefore, the higher performance benefit of parallelism should be compared with the increase in cost and complexity to find out the degree of parallelism that can be used effectively. 4.4 Interprocess communication and synchronization In MIMD computers parallel programs are executing. A parallel program is a set of con- currently executing sequential processes. These processes cooperate and/or compete while executing, either by explicitly exchanging information or by sharing variables. To enforce correct sequencing of processes and data consistency, some methods of communication and synchronization of processes must be used. The notions of communication and synchro- nization are tightly related. In general, communication refers to the exchange of data between different processes. Usually, one or several sender processes transmit data to one or several receiver processes. Interprocess communication is mostly the result of explicit directives in the program. For example, parameters passed to a coroutine and results returned by such a coroutine con- stitute interprocess communication. Synchronization is a special form of communication, in which the data are control infor- mation. Synchronization serves a dual purpose: enforcing the correct sequencing of processes (e.g., control of a producer process and consumer process, such that the consumer never reads stale data and the producer never overwrites data that have not yet been read by the consumer); ensuring data consistency through mutual exclusive access to certain shared writable data. (e.g., protect the data in a database such that concurrent write accesses to the same record in the database are not allowed). Communication and synchronization can be implemented in two ways: • through controlled sharing of data in memory; This method can be used in shared memory systems. Synchronization is achieved through a hierarchy of mechanisms:
  • 23.
    Memory Synthesis UsingAI Methods 17 1. hardware level synchronization primitives such as TEST&SET(lock), RESET(lock), FETCH&ADD(x,a), Empty/Full bit; 2. software-level synchronization mechanisms such as semaphores and barriers; • message passing; This method can be used both for shared memory and distributed systems. Synchronization mechanisms are used to provide mutual exclusive access to shared variable and to coordonate the execution of several processes: Mutual exclusive access Acces is mutually exclusive if no two processes access a shared variable simultaneously. A critical section is an instruction sequence that has mutually exclusive access to shared variables. Locks and semaphores can be used to guarantee mutual exclusive access. On a uniprocessor, mutual exclusion can be guaranteed by disabling interrupts. Conditional synchronization Conditional synchronization is a method of process coordination which ensures that a set of variables are in a specific state (condition) before any process requiring that condition can proceed. Mechanisms such as Empty/Full bit, Fetch&Add, and Barrier can be used to synchronize processes. 4.5 Coherence, Consistency, and Event Ordering Memory coherence is a system’s ability to execute memory operation corectly. We need a precise definition of correct execution. Censier and Feautrier define [1] a coherent memory system as follows: A memory scheme is coherent if the value returned on a LOAD opera- tion is always the value given by the latest STORE operation with the same address. This definition, while very concise and intuitive, is difficult to interpret and too ambiguous in the context of a multiprocessor, in which data accesses may be buffered and may not be atomic. An access by processor i on a variable X is called atomic if no other processor is allowed to access any copy of X while the access by processor i is in progress. A LOAD of a variable X s said to be performed at a point in time when issuing of a STORE from any processor to the same address cannot affect the value returned by the LOAD. A STORE on a variable X by processor i is said to be performed at a point in time when an issued LOAD from any processor to the same address cannot return a value of X preceding the STORE. Accesses are buffered if multiple accesses can be queued before reaching their destination, such as main memory or caches.
  • 24.
    18 Memory SynthesisUsing AI Methods Serial computers present a simple and intuitive model that adheres to the memory coher- ence as defined by Censier and Feautrier: a LOAD operation returns the last value written to a given memory location and a STORE operation binds the value that will be returned by subsequent LOADs until the next STORE to the same location. For multiprocessors, the memory system model is more complex, because the definitions of “last value written”, “subsequent LOADs”, and “next STORE” become unclear when there are multiple proces- sors reading and writing a location. Furthermore, the order in which shared memory operations are done by one process may be used by other processes to achieve implicit synchronization. For example, a process may set a flag variable to indicate that a data structure it was manipulating earlier is now in a consistent state. The behavior of the machine with respect to memory accesses is defined by the mem- ory consistency model. Consistency models place specific requirements on the order that shared memory accesses (events) from one process may be observed by other processes in the machine. The consistency model specifies what event orderings are legal when several processes are accessing a common set of locations. Two major classes of machine behavior with respect to memory consistency have been defined: sequential consistency and weak consistency models of behavior (Chapter 8). Because the only way that two concurent processes can affect each other’s execution is through sharing of writable data and sending of interrupts, it is the order of these events that really matters. The machine must enforce these models by proper ordering of storage accesses and execution of synchronization and communication primitives. Thus, the ordering of events in a multiprocessor is an important issue and it is related to memory consistency. Coherence problems may exist at various levels of a memory hierarchy. Inconsistencies, that is, contradictory information, can occur between adjacent levels or within the same level of a memory hierarchy. For example, in a shared memory multiprocessor with private caches (Section 5.12), caches and main memory may contain inconsistent copies of data, or multiple caches could possess different copies of the same memory word because one of the processes has modified its data. The former inconsistency may not affect the correct execution of the program, while the latter condition is shown to lead to an incorrect behavior. Multiprocessor caches must be provided with mechanisms that make them to behave correctly. We can conclude that synchronization, coherence, and ordering of events are closely related issues in the design of multiprocessors.
  • 25.
    Memory Synthesis UsingAI Methods 19 5 MEMORY HIERARCHY DESIGN 5.1 General Principles of Memory Hierarchy As programmers tend to ask more amount of faster mmemory, fortunately a rule of thumb applies. This rule, called “the 90/10 Rule”, states that: A program spends 90% of its execution time in only 10% of the code. This rule holds that all programs favor a portion of their address space at any instant of time. Thus it can be restated as the “principle of locality”. An implication of locality is that based on the program’s recent past, one can predict with reasonable accuracy what instructions and data a program will use in the near future. This locality of reference applies both to data and code accesses, but it is stronger for code accesses. The propriety of locality has two dimensions (i.e., there are two types of locality): • Temporal locality (locality in time) - If an item is referenced, it will tend to be referenced again soon. • Spatial locality (locality in space) - If an item is referenced, nearby items will tend to be referenced soon. Locality can be exploited to increase memory bandwidth and decrease the latency of memory access, which are both crucial to system performance. The principle of locality of reference says that data (near that) recently used is likely to be accessed again in the future. According to the Amdahl’s Law, favoring accesses to such data will improve performance. Thus, recently addressed items should be kept in the fastest memory. Because smaller memories are faster, smaller memories are used to hold the most recently accessed items close to the CPU, and successively larger (and slower) memories as we move further away from the CPU are used to hold less recently accessed items. This type of organization is called a memory hierarchy. A memory hierarchy is a natural reaction to locality and technology. The principle of locality and the guideline that smaller hardware is faster yield the concept of a hierarchy based on different speeds and sizes. Since slower memory is cheaper, a memory hierarchy is organized into several levels - each smaller, faster, and more expensive per byte than the level below. The levels of the hierarchy subset one another: all data in one level is also found in the level below, and all data in that lower level is also found in the one below it, and so on until we reach the bottom of the hierarchy. Taking advantage of the principle of locality can improve performance; the address map- ping from a larger memory to a smaller but faster memory is intended to provide access to all levels of the memory hierarchy by allowing the processor to look for data in levels with decreasing speed (that is, first in the fastest level). A memory hierarchy normally consists of many levels, but it is managed between two adjacent levels at a time. The upper level —the one closer to the processor— is smaller and faster than the lower level (Figure 5).
  • 26.
    20 Memory SynthesisUsing AI Methods CPU Registers Cache Memory Bus Memory I/O Bus I/O Device Figure 5: The Levels of a Typical Memory Hierarchy A cache is a small, fast memory located close to the CPU that holds the most recently accessed code or data. Cache represents the level of memory hierarchy between the CPU and main memory. The minimum unit of information that can be either present or not present in the two-level hierarchy is called a block. The size of the block may be either fixed or variable. If it is fixed, the memory size is a multiple of that block size. Sometimes, the term line is used instead of block to refer to the unit of information that can be either present or absent in a cache. Success or failure of an access to the upper level is designated as a hit or a miss: a hit is a memory access found in the upper level, while a miss means it is not found in that level. Hit rate, or hit ratio, is the fraction of memory accesses found in the upper level, and it is usually represented as a percentage. Miss rate, or miss ratio, is the fraction of memory accesses not found in the upper level and is equal to (1−Hit rate). For example, when the CPU finds the needed data item in the cache we say that a cache hit occurs, and when the CPU does not find it we say that a cache miss occurs. Likewise, for a computer with virtual memory where the address space is broken into fixed-size units called pages, a page may reside either in main memory or on disk. When the CPU references an item within a page that is not present in the cache or main memory, a page fault occurs, and the entire page is moved from the disk to main memory. The cache and main memory have the same relationship as the main memory and disk. In this report we shall focus on the relationship between cache and main memory. Since performance is the major reason for having a memory hierarchy, the speed of hits and misses is important. Hit time is the time to access the upper level of the memory hierarchy, which includes the time to determine whether the access is a hit or a miss. Miss penalty is the time to replace a block in the upper level with the corresponding block from
  • 27.
    Memory Synthesis UsingAI Methods 21 the lower level, plus the time to deliver this block to the requesting device (normally the CPU). The miss penalty is further divided into two components: • access time or access latency —the time to access the first word of a block on a miss; • transfer time—the additional time to transfer the remaining words in the block. Access time is related to the latency of the lower-level memory, while transfer time is related to the bandwidth between the lower-level and upper-level memories. Defining the lower memory bandwidth, B, as the number of bytes transferred between the lower- and upper-level in a clock cycle, and denoting by L the number of bytes per block and by b the number of bytes per word, we can write: Miss penalty = Access latency + L − b B (9) 5.2 Performance Impact of Memory Hierarchy The impact of memory hierarchy on the CPU performance is dependent on the relative weight of the number of Memory stall clock cycles in the CPU time. Let us recall equation (8), Section 3.3, which shows the following aspects: • the efect of Memory-stall is to increase the total CPI; • the lower the CPIExecution, the more pronounced the impact on perfor- mance of memory stall is; • because memories have similar memory-access times, independent of the CPU, and the memory-stall is measured in CPU cycles needed for a miss, it results that a higher CPU clock rate leads to a larger miss penalty even if the main memories are the same speed. The importance of the cache for CPUs with low CPI and high clock rates is thus greater. The number of memory-stall cycles per instruction may be expressed as: Memory − stall clock cycles Instruction = Reads Instruction ∗ Read miss rate ∗ Read miss penalty+ + Writes Instruction ∗ Write miss rate ∗ Write miss penalty (10) We can use a simplifyed formula by combining the read and write misses together: Memory − stall clock cycles Instruction = Memory accesses Instruction ∗ Miss rate ∗ Miss penalty (11) Therefore, we obtain for the CPU time:
  • 28.
    22 Memory SynthesisUsing AI Methods CPU time = IC ∗ CPIExecution + Memory accesses Instruction ∗ Miss rate ∗ Miss penalty ∗Clock cycle time (12) The Miss penalty, just as the value of Memory stall clock cycles, is measured in CPU cy- cles; therefore, the same main memory will produce different miss penalties with different values of the Clock cycle time. Another form of formula (12) may be obtained if we measure the number of misses per instruction instead of the number of misses per memory reference (i.e., instead of Miss rate): Misses Instruction = Memory accesses Instruction ∗ Miss rate (13) The advantage of the Misses/Instruction measure over the Miss rate measure is that it is independent of the hardware implementation which can artificially reduce the Miss rate (e.g., when a single instruction makes repeated references to a single byte). Even if the number of misses per instruction takes into account the real number of misses, independent of the hardware implementation, it has the drawback of being architecture dependent, that is, different architectures will have different values of this parameter. However, with a single computer family one can use the following CPU-time formula: CPU time = IC∗ CPIExecution + Misses Instruction ∗ Miss penalty ∗Clock cycle time (14) When we evaluate the performance of the memory hierarchy, it is not enough to look only at the Miss rate. While this parameter is independent of the speed of the hard- ware, one should be aware that, as equation (12) shows, the effect of the Miss rate on the CPU performance is dependent on the value of other parameters, i.e., CPIExecution, Memory accesses/Instruction, and Miss penalty —which, as explained, is influenced by the Clock cycle time. A better measure of the memory-hierarchy performance is the average time to access memory: Average memory − access time = Hit time + Miss rate ∗ Miss penalty (15) In equation (15) the Miss penalty is measured in nanoseconds, therefore it is no more dependent on the Clock cycle as is the Miss penalty in equation (10), (11), (12), and (14). While minimizing Average memory-access time is a reasonable goal, the final goal is to improve CPU performance, that is, to decrease the CPU execution time and one must be aware that CPU performance is not linearly dependent of the Average access time. Misses in a memory hierarchy mean that the computer must have a mechanism to transfer blocks between upper- and lower-level memory. If the block transfer is tens of clock cycles, it is controlled by hardware; this is the case for cache misses. If the block transfer is thousands of clock cycles, it is usually controlled by software; this is the case for page faults. For a cache miss, the processor normally waits for the memory transfer to complete. For a page fault it would be too wasteful to let the CPU sit idle; therefore, the CPU is
  • 29.
    Memory Synthesis UsingAI Methods 23 interrupted and used for another process during the miss handling. Thus, avoiding a long miss penalty for page faults means any memory access can result in a CPU interrupt. This also means the CPU must be able to recover any memory address that can cause such an interrupt, so that the system can know what to transfer to satisfy the miss. When the memory transfer is complete, the original process is restored, and the instruction that missed is retried. The processor must also have some mechanism to determine whether or not information is in the top level of the memory hierarchy. This check happens on every memory access and affects hit time; maintaing acceptable performance requires the check to be implemented in hardware. 5.3 Aspects that Classify a Memory Hierarchy The fundamental principles that drive all memory hierarchies allow us to use terms that transcend the levels we are talking about. The same principles allow us to pose four ques- tions about any level of the hierarchy: Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification or Block lookup) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy) The answers to these questions induce a classification of a level of the memory hierar- chy. The memory address is divided into pieces that access each part of the hierarchy. Based on this address, there are two pieces of data that must be determined: 1. what is the number of the block to which the memory address corresponds; 2. what data item within the block is addressed. We shall consider a byte-addressable machine, in which the data item is a byte. For the sake of simplicity we shall consider that the number of bytes in a block is a power of 2. More specifically, we shall denote by m the number of bits in the memory address, and by j the number of bits that identify a byte in a block (therefore, the size of a block is 2j bytes). With this notation, we can partition the memory address in two fields: • block-offset address identifies the byte in the block and is composed from the bits 0 ... j − 1 of the memory addres; • block-frame address identifies the block in that level of the hierarchy and is composed from the bits j ... m − 1 of the memory addres. The Block-frame address is the higher-order piece of the memory address that identifies a block. The Block-offset address is the lower-order piece of the address and it identifies
  • 30.
    24 Memory SynthesisUsing AI Methods an item within the line. An item is the quantum of information that can be accessed by the processor’s instructions. It may be a byte, or a word, but as mentioned we consider byte-addressable machines. 5.4 Cache Organization Cache is the name chosen to represent the level of memory hierarchy between the CPU and main memory. Caches may also be used as an upper level for a disk or tape memory at the lower level, but we shall restrain our discussion to caches of main memory, called CPU caches. Therefore, the term cache is used in this report instead of CPU cache. The organization of the cache determines the block placement and identification. As men- tioned, data is stored in cache in lines (also called blocks), which represent the minimum unit of information that can be present in the cache (and the minimum unit transfered between cache and main memory). In general, a number of n lines are grouped into a set. A set is a collection of elements building an associative memory. Data in a set are content-addressable data: every line of the set has associated to Data an address Tag that identifies data. Therefore, finding in which line of the cache a block is found, is a two-step process: • first, the block-frame address is mapped onto a set number, and the block can be placed anywhere within this set. If there are n blocks in a set, the cache placement is called n-way set associative. • second, an associative search is performed within the set to find the line, by com- paring the tags of the lines in the set with the tag of the memory address currently accessed. The number of lines in a set, n, is also referred as the set size (synonyms: degree of associativity, or associativity). The method used to map a block-frame address onto a set is called the set-selection algorithm. There are two basic set-selection algorithms: • bit selection: the number of the selected set, denoted by Index, is the rest of dividing the block-frame addres by the number of sets in the cache: Index = (Block − frame address) modulo (Number of sets in cache) (16) Generally, the number of sets is chosen to be a power of 2, say 2k. In this case, the index is computed simply by selecting the lower-order k bits of the Block-frame address. • hashing: The block-frame address bits, that is, bits j ... m−1 of the memory address are grouped into k groups (where 2k is the number of sets), and within each group an EXCLUSIVE OR is performed. The resulting k bits then designate a set.
  • 31.
    Memory Synthesis UsingAI Methods 25 Bit-selection is very simple and is the general set selection algorithm for cache memories; The address mapping scheme is such that a Block-frame address selects a set of lines, and only if n = 1 a block-frame address identifies a unique line. A method is needed to identify to which memory address a given line in the cache corre- ponds. To do this we need to store on each line an address Tag. Because using bit-selection all lines in the set number Index correspond to a memory address with the value Index in the lower order k bits of the block-frame address, storing the Index is redundant. Thus, the address Tag consists of the bits j + k ... m − 1 of the memory address. Therefore, to each line of the cache two additional pieces of information are attached: • the Address Tag contains the address bits that identify the line; • the Valid bit marks information in the line as either valid or invalid. One can view this as the cache consisting of elements, each element of the cache having three fields : the Data field, that is the cache block, the Tag field, and the Valid field. If the Valid bit is viewed as another part of the Tag, then a cache element may be considered as formed from two fields: Data and Tag. One Tag is required for each line. If the total size of the cache and the line size are kept constant, then increasing associativity increases the number of blocks per set, thereby decreasing the number of sets, which means decreasing the number of Index bits and increasing the number of Address Tag bits and therefore the cost. The degree of associativity also determines the number of comparators needed to check the Tag against a given memory address, the complexity of the multiplexer required to select a line from the matching set, and increases the hit time. These aspects should be considered because the size of the Tag memory (the product between the size of the Tag field and the number of lines) affects the total cost of the cache. Physically, the Data and Tag fields may be stored in the same storage, or in separate “data array” and “address array” respectively. To summarize, the cache structure is described by three parameters: • line size: the number of bytes stored in one line and also the number of bytes transferred between the cache and main memory at one memory reference; • associativity (synonym: set size): represents the number of lines in a set; • number of sets in the cache. There are two boundary conditions of the cache organization: • if the associativity is equal to the total number of lines, then one line of main memory may be found in any line of cache. This organization is called a fully associative cache. This organization provides maximum flexibility for data placement, but it incurs increased complexity; • if n = 1, that is, one-way set associative, then there is only one line per set so that one main memory line may be placed only into one line of the cache. This organization is known as direct mapping.
  • 32.
    26 Memory SynthesisUsing AI Methods 5.5 Line Placement and Identification In this section we shall answer the first two questions of Section 5.3. Given a memory address, that we call the Target Address, we must find in which set of the cache it can reside and identify if it is actually in the cache. First, if the memory system employs virtual memory, then the Virtual Address generated by the CPU is translated into a Real Address or Physical Address. Depending on whether the Tag field contains the Virtual or the Real Address, a cache is called a Virtual address cache or a Real address cache, respectively. We shall consider in this report Real Address caches. A cache access begins with presenting the CPU-generated Virtual Address to the cache. (1) First, the Virtual address is translated into a Real Address. In this purpose, the virtual adress is passed to the Translator (which is part of the S-unit) and to an associative memory called Translation Lookaside Buffer (TLB) which holds (“caches”) the most recent translations. A TLB is a small associative memory, each of its elements consisting of a pair (Virtual Address, Real Addres). Thus the TLB contains the most recent translations. The TLB receives as input the Virtual Address, randomizes (hashes) it, and uses the hashed number to select a certain set. That set is then searched associatively for a match to the Virtual Address. If a match is found, the corresponding Real Address is passed along to the cache itself. If the TLB does not contain the required translation information, then the Real Address provided by the translator is waited for. (2) Second, using the Target address, the set to which the Target maps is found: employing the bit selection method, the set Index is extracted from the Target address, that is, the set in which the block can be present is found. The Data and Tag fields of the blocks in the selected set are accessed. (3) Third, the set is searched associatively over its n elements to check if there is an Address Tags that is matching the Tag portion of the Target. Because speed is of essence, all posible Tags are searched in parallel. If a match is found —hit— then the the data field of the element containing the matching Tag is presented at the cache output. Otherwise, there is a miss. (4) Because usually a line contains more words, using the block offset portion of the Target, the desired word is selected and presented to the processor. The first three steps are called cache lookup. The cache lookup time can be reduced if the steps (1) and (2) are done in parallel. This is possible if the number of bits in the Page-offset portion of the Virtual Address is greater than or equal to the sum between the number of Block-offset bits and the number of Index bits. The reason for this is that if this condition holds, and taking into account that the Page-Offset bits are not translated, the Index can be extracted directly from the Virtual address. 5.6 Line Replacement When a cache miss occurs and a new line is brought into cache from main memory, then, using bit selection, a set into which the line is to be placed is found. The lines of the
  • 33.
    Memory Synthesis UsingAI Methods 27 target set may be either in the Valid (i.e., already contain a line) or Invalid state. There are two possibilities: — there is an Invalid line in the set; then the newly brought line is replacing an invalid line; — all set lines are valid; in this case one of the lines containing valid information should be selected as the victim that will be replaced with the newly brought line. A method to select a line for replacement, also called an allocation method is therefore necessary when a new line is brought into the cache and all the lines in the target set are valid. For a direct mapped cache, there is only one line in a set and there is no choice: that line must be replaced by the new line. With set-associative and fully associative organizations, there are several lines to choose from on a miss. There are three strategies employed for selecting which block to replace: • First-in-first-out (FIFO) — The block that has been used n unique accesses before (where n is the associativity) is discarded, independent of its reference pattern in the last n − 1 references. This method is simple, but it is not exploiting temporal locality. • Random — To spread allocation uniformly, candidate blocks are randomly selected. Usually a pseudorandomizing scheme is used for spreading data across a set of blocks. • Least-recently used (LRU) — To reduce the chance of throwing out information that will be needed soon, accesses to blocks are recorded. The block replaced is the one that has been unused for the longest time. This makes use of a corollary of temporal locality: If recently used blocks are likely to be used again, then the best candidate for disposal is the least recently used. Random replacement generally outperforms FIFO and is easier to implement in hardware. LRU outperforms random but, as the number of blocks to keep track of increases, LRU becomes increasingly expensive. Frequently, for high associativiy, LRU is only approxi- mated. LRU implementation For a set size of two, only a hot/cold (toggle) bit is required. For a set size n, n ≥ 4, one creates an (n × n) upper-left triangular matrix, whose elements are denoted by R(i, j), with the diagonal and the elememnts below the diagonal equal to zero. When a line i, 1 ≤ i ≤ n is referenced, row i of R is set to 1 and column i of R is set to 0. The LRU line is the one for which the row is entirely equal to 0 and for which the column is entirely 1. This algorithm can be easily implemented in hardware and executes rapidly. The number of storage bits required by matrix R is n(n − 1)/2, that is, the storage requirement increases with the square of the set size. For n ≥ 8 this may be unacceptable. If this is considered too expensive, then an approximation to LRU is implemented in the following way: 1. lines are grouped into p = n/2 pairs (i.e., one pair has two lines); 2. if p > 4, then the pairs are repeatedly grouped into other pairs until the number of groups is equal to 4.
  • 34.
    28 Memory SynthesisUsing AI Methods For example, if n = 8, the 8 lines of a set form a group of 4 pairs, and if n = 16, a set is made up from 4 groups of two pairs. The LRU approximation is based on LRU management at the level of each group. All but the upper group contain 2 elements, therefore only a hot/cold LRU bit is required per group. The upper group contains 4 elements, and it uses 6 bits for LRU. The LRU approximation works as follows: 1. the LRU group is selected from the upper group: 2. the LRU of the selected group (which contains two elements) is repeatedly selected until a line is selected. For example, if n = 8, first the LRU pair is selected from the 4 pairs, then the LRU line of the pair is selected. If n = 16, first, the LRU group of two pairs is selected from the 4 groups, second the LRU pair is selected from the two pair group, third the LRU line of that pair is selected. This algorithm requires only 10 bits for n = 8 rather than the 28 bits needed for full LRU, and 18 bits for n = 16, as opposed to 120 bits needed for full LRU. FIFO implementation FIFO is implemented by keeping a modulo n counter (n is the associativity) for each set; the counter is incremented with each replacement and points to the next line for replacement. Random implementation One Random implementation is to keep a single modulo n counter, incremented in a variety of ways: by each clock cycle, each memory reference, or each replacement anywhere in the cache. Whenever a replacement is to occur, the value of the counter is used to indicate the replaceable line within the set. As it is apparent from the implementation methos presented, Random provides the sim- plest implemetation and LRU requires the most complex implemetation. Because random replacement generally outperforms FIFO, the choice is to be made only between LRU and Random. 5.7 Write Strategy Reads are more frequent than Writes cache accesses because all instruction accesses are reads and not every instruction is writing to memory. Making the common case fast (Amdahl’s Law) means optimizing caches for reads, but high-performance designs cannot neglect the speed of writes. The common case, Read, is made fast by reading the line at the same time that the tag is read and compared, so the line read begins as soon as the block-frame address is available. If the read is a hit, the block is passed on to the CPU immediately. If it is a miss, there is no benefit — but also no harm. Write accesses are posing several problems. First, the processor specifies the size of the write and only that portion of a line can be changed. In general, this means a Read-Modify-Write (RMW) sequence of operations on the line: read the original line, modify one portion, and write the new block value. Moreover, modifying
  • 35.
    Memory Synthesis UsingAI Methods 29 a line cannot begin until the tag is checked to see if it is a hit. Because tag checking cannot occur in parallel, writes normally take longer than reads. There are two basic write policies: • Write through (or store through) — The information is written to both the block in the cache and the block in main memory. • Write back (also called copy back or store in) — The information is written only to the block in cache. The modified cache block (also called dirty block) is written into main memory only when it is replaced. Another categorization of writes is made with respect to whether a line is fetched when a write miss occurs: • Write allocate (also called fetch on write) — The line is loaded (this is similar to a read miss), then the write-hit actions are performed with either write through or write back. • No write allocate (also called write around) — The line is modified in the lower level memory and not loaded into the cache. When the CPU is waiting for writes to complete during write throughs, or when a read miss requires a modified line to be replaced for write back strategy, the CPU is said to write stall. A common optimization to reduce write stalls is a write buffer. A write buffer allows the CPU to continue while the memory is updated, and it can be used for both write back and write through strategies: • In a write through cache, both for hit and miss, data must be written to lower-level memory. When a write buffer is used, the CPU has to wait only for the buffer to be not full, then data and the address is written into the buffer, and the CPU continues working while the write buffer writes data to memory. • In a write back cache, the write buffer is used to store the dirty block (i.e., the modified block) that must be replaced with another block brought from memory. After the new data is loaded into the line, the CPU continues execution and in parallel the buffer writes the dirty block in memory. The problem with write buffers is that they complicate the handling of read misses, as discussed in Section 6.7. For the write back strategy it is necessary to keep track of whether a block in cache is modified but not yet written into main memory (i.e., the block is dirty). For this purpose, a feature called dirty bit is commonly used. This is a status bit associated with each cache line that indicates whether or not the line was modified while in the cache. If it wasn’t, the line is said to be clean and it is not written when replaced, since the lower-level has the same information as the cache. The dirty bit is also useful for the memory coherence protocol (Section 9.4).
  • 36.
    30 Memory SynthesisUsing AI Methods Both write back and write through have their advantages. With write through, read misses don’t result in writes to the lower level as may happen with write back when a dirty line is replaced. Write through keeps the cache and main memory consistent, that is, the main memory has the current copy of the data. This is important for I/O and multiprocessors, because it supports memory-coherence (Section 5.13). On the other hand, with write back, write-hits occur at the speed of the cache memory, and multiple writes within a line require only one write to main memory. Since not every write is going to memory, write back uses less memory bandwidth, which is an important aspect in multiprocessors. Even though either fetch on write or write around could be used with write through or write back, generally write-back caches use fetch on write (hoping that subsequent writes to that block will be captured by the cache) and write-through caches often use write around (since subsequent writes to that block will still have to go to memory). 5.8 The Sources of Cache Misses An intuitive model of cache behavior attributes all misses to one of three sources: • Compulsory —The first access to a line is not in the cache, so the line must be brought into the cache. These are also called cold start misses or first reference misses. • Capacity —If the cache cannot contain all the blocks needed during execution of a program, capacity mises will occur due to blocks being discarded and later retrieved. • Conflict —If block placement strategy is set-associative or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a block can be discarded and later retireved if too many blocks map to its set. These are also called collision misses or interference misses. Having identified the three sources of misses, what can a computer designer do about them? For a given line size, the compulsory misses are independent of cache size. Compulsory misses may be reduced by increasing the line size, but this can increase conflict misses. There is little to be done about capacity misses, except to use larger cache size. When the cache is much smaller than is needed for a program, and a significant percentage of the time is spent moving data between the two levels of the hierarchy (i.e., cache and main memory), the memory hierarchy is said to thrash. Thrashing means that, because so many replacements are required, the machine runs close to the speed of the lower-level memory, or maybe even slower, due to the miss overhead. Conflict misses could be conceptually eliminated: fully associative placement avoids all conflict misses. However, associativity is expensive in hardware and may slow access time leading to lower overall performance. Conflict misses may also be decreased by increasing the cache size.
  • 37.
    Memory Synthesis UsingAI Methods 31 This simple model of miss causes has some limits. For example, increasing cache size re- duces capacity misses as well as conflict misses, since a larger cache spreads out references. Thus, a miss might move from one category to the other as parameters change. 5.9 Line Size Impact on Average Memory-access Time Let us analize the effect of the block size on the Average memory−access time (equation (15), Section 5.2) by examining the effect of the line size on the Miss rate and Miss penalty. We assume that the size of the cache is constant. Larger block sizes reduce compulsory misses, as the principle of spatial locality suggests. At the same time, larger block sizes increase conflict misses, because they reduce the number of blocks in the cache. Reasoning in terms of the two aspects of the principle of locality, we say that increasing line size lowers the miss rate until the reduced misses of larger blocks (spatial locality) are outweighted by the increased misses as the number of blocks shrinks (temporal locality), because larger block sizes means fewer blocks in cache. Let us examine the effect of line size on the Miss penalty. The Miss penalty is the sum of the access latency and the transfer time. The access-latency portion of the miss penalty is not affected by the block size, but the transfer time does increase linearly with the block size. If the access latency is large, initially there will be little additional miss penalty relative to access time as block size increases. However, increasing the line size will eventuaslly make the transfer-time become an important part of the miss penalty. Since a memory hierarchy must reduce the Average memory-access time, we are interested not in the lowest Miss rate, but in the lowest Average access time. This is related to the product of Miss rate by the Miss penalty, according to equation (15), Section 5.2. Therefore, the “best” line size is not that which minimizes Miss rate, but that which minimizes the product between the Miss rate and the Miss penalty. Measurements on different cache organizations and computer architectures have indicated that the lowest average memory-access time is for line sizes ranging from 8 to 64-bytes ([6],[20]). Of course, overall CPU performance is the ultimate performance test, so care must be taken when reducing Average memory-access time to be sure that changes to Clock cycle time and CPI improve overall performance as well as average memory-access time. 5.10 Operating System and Task Switch Impact on Miss Rate When the Miss rate for a user program is analized (for example, by using a trace-driven simulation), one should take into account that the real miss rate for a running program, including the operating system code invoked by the program, is higher. The miss rate can be broken into three components: • the miss rate caused by the user program; • the miss rate caused by the operating system code; • the miss rate caused by the conflicts between the user code and the system code.
  • 38.
    32 Memory SynthesisUsing AI Methods In fact, the operating system has a greater impact on the actual miss rate. Due to task switching, the miss rate of a program increases. Using the model of miss sources from Section 5.8, the rationale is that task-switching increases the compulsory misses. A possible solution to the miss rate due to task-switching is to use a cache that has been split into two parts, one of which is used only by the supervisor and the other of which is used primarily by the user state programs – this organization is called User/Supervisor Cache. If the scheduler were programmed to restart, when possible, the same user program running before an interrupt, then the user state miss rate would drop appreciably. Further, if the same interrupts recur frequently, the supervisor state miss rate may also drop. The supervisor cache may have a high miss rate due to its large working set. However, if the total cache size is split evenly among the user and supervisor cache, then the miss rate in the supervisor state is likely to be worse than with an unified cache since the maximum capacity is no longer available to the supervisor. Moreover, the information used by the user and the supervisor are not entirely distinct, and cross-access must be permitted. This introduces the coherence problem (Section 5.13) amomg the user and supervisor cache. 5.11 An Example Cache We shall consider the organization of the VAX-11/780 cache as an example. The cache contains 8-KB of data, is two-way set-associative with 8-byte blocks, uses random replace- ment, write through with a one-word write buffer, and no write allocate on a write miss. Figure 6 shows the organization of this cache. A cache hit is traced through the steps of a hit a labeled in Figure 6, the five steps being shown as circled numbers. The address coming into the cache is divided into two fields: the 29-bit block-frame address and 3-bit block-offset. The block-frame address is further divided into an address tag and set index. Step 1 shows this division. The set index selects the set to be tested to see if the block is in the cache. A set is one block from each bank. The size of the index depends on cache size, block size, and set associativity. In this case, a 9-bit index results: Blocks/Bank = Cache size/(Block size ∗ Associativity) = 8192/(8 ∗ 2) = 29 The index is sent to both banks (because of the 2-way set-associative organization) and the address tags are read — step 2. After reading an address tag from each bank, the tag portion of the block frame address is compared to the tags. This is step 3 in the figure. To be sure the tag contains valid information, the valid bit must be set, or the results of the comparison are ignored. Assuming one of the tags does match, a 2:1 multiplexer (step 4) is set to select the block from the matching set. It is not possible that both tags match because the replacement algorithm makes sure that an address appears in only one block. To reduce the hit time, the data is read at the same time as the address tags; thus by the time the block multiplexer is ready, the data is also ready. This step is needed in set-associative caches but it can be omitted from direct-mapped caches since there is no selection to be made. The multiplexer
  • 39.
    Memory Synthesis UsingAI Methods 33 Tag< 20 > Index< 9 > Offset< 3 > CPU Address Data In Out =? =? Valid< 1 > Tag< 20 > Data< 64 > . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2:1 Mux Memory Write buffer 1 2 2 3 3 4 5 Bank 0 Bank 1 Figure 6: Organization of a 2-way set associative cache
  • 40.
    34 Memory SynthesisUsing AI Methods used in step 4 is on the critical timing path, affecting the hit time. In step 5, the word is sent to the CPU. All five steps occur within a single CPU cycle. On a miss, the cache sends a stall signal to the CPU telling it to wait, and two words (eight bytes) are read from memory. That takes 6 clock cycles on the VAX 11/780, ignoring bus interference. When the data arrives, the cache must pick a block to replace, and one block is selected at random. Replacing a block means updating the data, the address, and the valid bit. Once this is done, the cache goes through a regular read hit cycle and returns the data to the CPU. Writes are involving additional steps. When the word to be written is in the cache, the first four steps are the same. The next step is to write the data in the block, then write the changed-data portion into the cache. Because no write allocate is used, on a write miss the CPU writes “around” the cache to main memory and does not affect the cache. Because write-through is used, the word is also sent to a one-word write-buffer. If the write buffer is empty, the word and its address are written in the buffer and the cycle is finished — the CPU continues working while the write buffer writes the word to memory. If the buffer is full, the cache and CPU must wait until the buffer is empty. 5.12 Multiprocessor Caches We have seen that increasing memory bandwidth and decreasing the access latency has a great impact on system performance. For shared-memory multiprocessors, bandwidth should be analyzed in a special context: several processors may try to access simultane- ously a level of the memory hierarchy. This gives rise to the contention problem, that is, the conflict between accesses from different processors. The access latency is related to the existence of a gap between processor and memory speeds; this aspect is present both in uniprocessor and multiprocessor architectures. Due to this gap, the memory access time introduces memory-stalls in CPU time. When the memory can’t keep up with the processor’s speed, it becomes a “bottleneck”. A common approach to solving both access latency and contention problems that occur in shared-memory multiprocessors is to use cache memories. Caches moderate a multipro- cessors’s memory traffic by holding copies of recently used data, and provide a low-latency access path to the processor. Caches may be attached to each CPU —private caches or to the shared-memory —shared cache. Private caches alleviate the contention problem: each processor has a high-speed cache connected to it that maintains a local copy of a memory block and is able to supply instructions and operands at the rate required by each processor. Because of locality in the memory access patterns of multiprocessors, the cache satisfies a large fraction of the processor accesses, thereby reducing both the average memory latency and the communication bandwidth requirements imposed on the system’s interconnection network. The architecture of a shared-memory system with private caches is shown in Figure 7. The key to using interconnection networks in multiprocessors is to send data over the net- works rather rarely. This is because a reduced network traffic tends to reduce contention,
  • 41.
    Memory Synthesis UsingAI Methods 35 Cache Processor 1 Cache Processor 2 Cache Processor n . . . Interconnection Network Memory Module 1 Memory Module 2 Memory Module m. . . Figure 7: Shared-memory system with private caches and, as the use of network per processor diminishes, the number of processors that can be served increases. A cache memory provides an effective means for maintaining local copies of data and reduces the need to traverse a network for remote data. For example, if a cache misses only 10 percent of the time, and remote fetches occur only on misses, then the number of processors supportable on the interconnection network is ten times greater than for a cacheless processor. The smaller the miss ratio, the greater the number of supported processors. Unfortunately, private caches give rise to the cache-coherence problem: multiple copies of data may exist in different private caches. This represents the coherence problem among private caches. That is, multiple copies of the same memory word must be kept consistent in different caches in the context of sharing of writable data and of process migration from processor to processor. Cache coherence schemes must be employed to maintain a uniform state for each cached block of data: a store to a data word present in a different cache must be reflected in all other caches containing the word either in the form of invalidation or update. 5.13 The Cache-Coherence Problem Because of caches, data can be found in memory or in the cache. As long as there is only one CPU and it is the sole device changing or reading the data, there is little danger in the CPU seeing the old or stale copy.
  • 42.
    36 Memory SynthesisUsing AI Methods However, due to input/output, and to the existence of several private caches in multipro- cessors, the opportunity exists for other devices to cause copies to be inconsistent (i.e., different values of the same data item) or for other devices to read the stale copies. This aspect of preventing all devices to access stale-data is referred to as the cache-coherence problem. The coherence problem is for a processor to have exclusive access to write an object and to have the most recent copy when reading an object. This problem applies to I/O as well as to shared-memory multiprocessors. However, unlike I/O, where multiple data copies is a rare event and can be avoided as shown in the next subsection, a process running on multiple processors will want to have copies of the same data in several caches. Performance of a multiprocessor program depends on the performance of the system when sharing data. 5.13.1 Cache Coherence for I/O Let A and B be two data items in memory, and A and B their cached copies. Let us assume an initially coherent state, say: A = A = 100 & B = B = 200 Inconsistency can occur in two cases. In one case, if the write strategy is write-back and the CPU writes, say the value 133, into A , then A will have the updated value, but the value in memory is the old, stale value of 100. If an output to I/O is issued, it uses the value of A from memory, therefore it gets the stale data: A = 133 & A = 100; A = A (A stale) In the another case, if the I/O system inputs, say the value 331, into the memory copy of B, then B in the cache will have the old, stale data: B = 200 & B = 331; B = B (B stale) In both cases the memory coherence condition as defined by Censier and Feautrier (Section 4.5) is not met, i.e., the value returned on a READ or INPUT instruction is not the value given by the latest WRITE or OUTPUT instruction with the same address. An architectural solution to the cache-coherence problem caused by I/O is to make I/O occur between the I/O device and the cache, instead of main memory. If input puts data into the cache and output reads data from the cache, both I/O and the CPU see the same data, and there is no problem. The difficulty with this approach is that it interferes with the CPU : I/O competing with the CPU for cache access will cause the CPU to wait for the I/O. Moreover, when the I/O device inputs data, it brings into the cache new information that is unlieky to be accessed by the CPU soon, whereas it replaces some information from cache that may be needed soon by the CPU. For example, on a page fault, the I/O inputs a whole page, while the CPU may need to access only a portion of the page.
  • 43.
    Memory Synthesis UsingAI Methods 37 The problem with the I/O system is to prevent stale-data while interfering with the CPU as little as possible. Many systems, therefore, prefer that I/O occur directly to main memory, acting as an I/O buffer. If a write-through cache is used, then memory has an up-to-date copy of the information, and there is no stale-date issue for output to I/O. This is the reason many machines use write-through. Input from I/O requires some overhead in order to prevent I/O to input data to a memory location that is cached. The software solution is to guarantee that no blocks of the I/O buffer designated for input from I/O are in the cache. This can be done in two ways. In one approach, a buffer page is marked as noncaheable; the operating system always inputs to such a page. In another approach, the operating system flushes, i.e., invalidates, the buffer addresses from the cache after the input occurs. The hardware solution is to check the I/O addresses on input to see if they are in the cache, using for example a snooping protocol (Section 9.4) and to invalidate the cache lines when their addesses match I/O addresses. All these approaches can also be used for I/O output with write-back caches. 5.13.2 Cache-Coherence for Shared-Memory Multiprocessors Caches in a multiprocessor must operate consistently or coherently, that is, they must obey the memory coherence condition for all copies of any data item. The coherence problem is related to two types of events: sharing writable data among several processors, or program migration between processors. In both cases, access to a stale copy of data must be prevented. The first type of coherence problem occurs when two or more processors try to update a datum simultaneously. Then, it must be treated in a special way so that its value can be updated successfully regardless of the instantaneous location of the most recent version of the datum. To illustrate this, let’s examine two examples. When a processor, let’s call it P1, updates the variable, the current value of the shared variable moves from memory to P1. While P1 holds this value and updates it, another processor, let’s call it P2, accesses shared memory. But the current value of the variable is no longer in the shared memory, because it has moved to P1. However, P2’s request is not redirected and it erroneously goes to the normal place for storing the shared variable. This example assumes that P1 updates the shared variable and immediately returns it to memory, but in a cache-based system, P1 may hold the variable indefinitely in the cache, so that the failure exhibited in the example becomes much more likely. The failure interval is not limited to a very brief update period, but it can happen for any access to the variable in shared memory while that variable is held in P1’s cache. There is a second failure mode for shared writable data that has to be considered too. If P2 copies a shared variable to its cache and updates that variable both in cache and in shared memory, then problems can arise if the values in cache and in shared memory do not track each other identically. Suppose, for example, that after P2 has updated the variable both in its cache and in shared memory, processor P1 requests the value of the variable. If P1 has already a copy of the variable in its cache, it ignores altogether the change in the variable from the update performed by P2. Thus, processor P1 accesses a stale copy of the data held in cache, instead of accessing the fresh data held in shared
  • 44.
    38 Memory SynthesisUsing AI Methods memory. With respect to the second type of failure —associated with program migration—, let us suppose that processor P1 is running a program that leaves in the cache the value 0 for variable X. Then the program shifts to a different processor P2 and writes a new value of 1 for variable X in the cache of that processor. Finally, the program shifts back to processor P1 and attempts to read the current value of X. It obtains the old, stale value of 0 when it should have obtained the new, fresh value 1 for X. Note that X does not have to be a shared variable for this type of error to occur. The cause of this mode of failure is ([7]) that a program’s footprint —that is, data associated with the program— was not flushed completely from cache when the program has moved from P1 to P2 and when it got back to P1 it has found there stale data. The protocols that maintain cache coherence for multiple processors are called cache- coherence protocols. This subject has been studied by many authors, among them being Censier and Feautrier ([1]), Dubois and Briggs ([3],[4]), Archibald and Baer ([5]), Agarwal ([10]), Lenovski, Laudon, Garachorloo, Gupta and Hennessy ([17]), who have explored a variety of cache-coherence protocols and examined their performance impact. Chapter 9 covers issues related to cache coherence. The implementation of cache in multiprocessors may enforce coherence either totally by the hardware, or may enforce coherence only at explicit synchronization points. 5.14 Cache Flushing When a processor invalidates data in its cache, this is called flushing or purging. Sometimes (Sections 5.13.1 and 9.6), it is necessary to invalidate the contents of more lines in cache, i.e., to set the invalid bit for more lines. If this is done one line at time, the required time would become excessive. Therefore, an INVALIDATE instruction should be available in the processor if a coherence scheme based on flushing is used. If one chooses to flush the entire cache, then resettable static random-access memories for Valid bits can be used allowing the INVALIDATE to be accomplished in one or two clocks.
  • 45.
    Memory Synthesis UsingAI Methods 39 6 IMPROVING CACHE PERFORMANCE 6.1 Cache Organization and CPU Performance The goal of the Cache memory designer is to improve performance by decreasing the CPUExecution time. As equations (14) and (15) (Section 5.2) show, the CPU time is not linearly dependent on the Average memory access time, but it depends on the two components of the Average memory access time: Hit time; this must be small enough not to affect the CPU clock rate and CPIExecution; Miss rate ∗ Miss penalty; this product affects the number of memory-stall clock cycles and therefore increases the CPI. After making some easy decisions in the beginning, the architect faces a threefold dilemma when attempting to further reduce average access time by changing the cache organization or size: Increasing line size does not improve average access time because the lower miss rate doesn’t offset the higher miss penalty; Making the cache bigger would make it slower, jeopardizing the CPU clock rate; Making the cache more associative would also make it slower, again jeopardizing the CPU clock rate. Example This example shows that a two-way set-associative cache may decrease the average memory-access time as compared to a direct mapped cache of the same capacity, but this does not mean better performance, because the CPU time is larger for the two-way set-associative memory. We assume that the clock cycle time is 20ns, the average CPI is 1.5 and there are 1.3 memory references per instruction. The cache size is assumed to be 64 KB, the Miss rate of the direct mapped cache is 3.9% and the Miss rate of the two-way set- associative cache is 3.0%. The hit time of the two-way set-associative cache is larger and this causes an 8.5% increase of the clock cycle time. The Miss penalty is considered to be 200 ns for either cache organization. Let us first compute the average memory access time for the two cache organizations using equation (15), Section 5.2: Average memory − access time1−way = 20 + 0.039 ∗ 200 = 27.8ns (17) Average memory−access time2−way = 20∗1.085+0.030∗200 = 27.7ns (18) Let us compute also the performance of each organization, as given by equation (12), Section 5.2. We substitute 200ns for (Miss penalty ∗ Clock cycle time) for either cache organization, even though in practice it must be rounded to an integer number of clock cycles. Because the Clock cycle time corresponding to a two-way set-associative cache is 20 ∗ 1.085, we obtain : CPU time1−way = IC ∗ (1.5 ∗ 20 + 1.3 ∗ 0.039 ∗ 200) = 40.1 ∗ IC (19)
  • 46.
    40 Memory SynthesisUsing AI Methods CPU time2−way = IC ∗ (1.5 ∗ 20 ∗ 1.085 + 1.3 ∗ 0.030 ∗ 200) = 40.4 ∗ IC (20) The result obtained shows that even though this direct-mapped cache has greater miss rate and average access-time than the 2-way set associative cache, it leads to a slightly better performance than the 2-way set-associative cache. There are some other methods to improve Hit time, Miss rate, and Miss penalty. The fol- lowing sections in this chapter will present the most important performance improvement methods. 6.2 Reducing Read Hit Time As mentioned in Section 5.5, the read hit time can be reduced if the cache lookup does in parallel the virtual-to-real address translation through the TLB and the set selection. However, this limits the size of the cache. Let p be the number of bits in the memory address that represent the page-offset, j be the number of bits for the byte-offset within a line, and k the set index bits (i.e., there are 2k sets). For a TLB lookup to be made in parallel with set selection, the following condition must be met: j + k ≤ p (21) This limits the cache size, C, to the value: C = n ∗ 2j+k ≤ n ∗ 2p (22) where n is the degree of associativity. For a direct mapped cache the limitation is that its size can be no bigger than the page size. Increasing the associativity is a solution, but increasing the associativity slows the cache. One scheme for fast cache hits without this size restriction is to use a more pipelined memory access where the TLB is one step of the pipeline. The TLB can be easily pielined because it is a distinct unit that is smaller than the cache. Pipelining the TLB doesn’t change memory latency but achieves higher memory bandwidth based on the efficiency of the CPU pipeline. An alternative would be to eliminate the TLB and its associated translation time from the cache access path by storing in the Tag memory the virtual addresses. Such caches are called virtual address caches or virtual caches. There are three major problems with virtual caches that, in our opinion, make virtual caches not a very good choice for multiprocessors. The first is that every time a process is switched, the virtual addresses refer to different physical addresses, requiring the cache to be flushed (or purged). But purging the cache causes an increase in miss rate. A solution to this problem is to extend the width of the address Tag with a process-identifier tag (PID), to have the operating system assign PIDs to processes and to flush the cache only when a PID is reused. Another problem is that the user programs and the operating system may use two different virtual addresses for the same physical address, that is, a data item may have different virtual addresses that are called synonyms or aliases. The effect of synonyms in a virtual cache is that two (or more) copies of the same data are present in the cache and thus a coherence problem occurs:
  • 47.
    Memory Synthesis UsingAI Methods 41 if one copy is modified, the other will have the wrong value. Hardware schemes, called anti-aliasing, that guarantee every cache line a unique physical address may be employed to solve this problem, but software solution are less expensive. The idea of the software solution is to force aliases to share a number of address bits so that the cache can not accomodate duplicates of aliases. For example, for a direct mapped cache that is 256 KB, that is 218KB, and if the operating system enforces that all aliases are identical in the last 18 bits of their addresses, then no two aliases can be simultaneously in cache. The third problem is that I/O typically uses physical addresses and thus requires mapping to virtual addresses to interact with a virtual cache in order to maintain coherence. 6.3 Reducing Read Miss Penalty Because reads dominate cache accesses. it is important to make read misses fast. There are several methods to reduce the read miss penalty. In the first method, called fetch bypass or out-of-order fetch, the missed word is requested first, regardless of its position in the line, data requested from memory is transmitted in parallel to the CPU and cache, and the CPU waits only for the requested data, while in the second method, called early restart, the line that contains the requested data is brought from memory starting with the left-most byte, but the CPU continues execution as soon as the requested data arrives. With fetch bypass, the missed word is requested first from memory and sent to the CPU as soon as it arrives, bypassing the cache; the CPU continues execution while filling the rest of the words in the block. Because the first word requested by the CPU may not be the first word of a line, this strategy is also called out-of-order fetch or wrapped fetch. Usually, the cache is loaded in parallel when the processor reads data from main memory (i.e., fetch bypass with simultaneous cache fetch) in order to overlap fetching of the specified data for CPU and for cache. When the transfer begins with a byte that is not the left-most byte of the line, the transfer should wrap around the right-most byte of the line and transfer the left-most bytes of the line that have been skipped in the first place. This methods provide a reduction of the read miss penalty by obviating the need for the processor to wait for the cache to load the entire line. Unfortunately, not all the words of a line have an equal likelihood of being accessed first. If that were true, with a line size of L bytes, the average line entry point would be L/2. However, due to sequential access, the left side of the line is more likely to be accessed first. For example, Hennessy and Patterson have determined [6] for some architecture that the average line entry point for instruction fetch is at 5.6 bytes from the left-most byte in a 32-byte line. The left-word of a block is most likely to be accessed first due to sequential accesses from prior blocks on instruction fetches and sequentially stepping through arrays for data accesses. This effect of spatial locality limits the performance improvement obtained with out-of-order fetch. Spatial locality also affects the efficiency of early restart, because it is likely that the next cache request be to the same line. The reduction in the read miss penalty obtained with these methods should be compared to the increased complexity incurred by handling another request while the rest of one line is being filled.
  • 48.
    42 Memory SynthesisUsing AI Methods 6.4 Reducing Conflict Misses in a Direct-Mapped Cache As described in Section 5.8, conflict misses may appear when two addresses map into the same cache set. Consider referencing a cache with two addresses, ai and aj. Using the bit selection method described in Section 5.4, these two addresses will map into the same set if and only if they have identical Index fields. Denoting by b the bit selection operation performed on the addresses to obtain the index, then the two addresses will map into the same set iff: b[ai] = b[aj] (23) Two addresses that satisfy this equation are called conflicting addresses because they may potentially cause conflicts. Assume the following access pattern: ai aj ai aj ai aj ai aj . . . where addresses ai and aj are conflicting addresses. A 2-way set-associative cache will not suffer a miss if the processor issues this adddress pattern because data referenced by ai and aj can co-reside in a set. In contrast, in a direct-mapped cache, the reference to aj will result in an interference (or conflict) miss because the data from ai occupies the same selected line. The percentage of misses that are due to conflicts varies widely among different applica- tions, but it is often a substantial portion of the overall miss rate. 6.4.1 Victim Cache The victim cache scheme has been proposed by Jouppi [12]. A victim cache is a small, fully-associative cache that provides some extra cache lines for data removed from the the direct-mapped cache due to misses. Thus, for a reference stream of conflicting addresses, such as ai aj ai aj ai aj ai aj . . ., the second reference, aj, will miss and force the data indexed by ai out of the set. The data that is forced out is placed in the victim cache. Consequently, the third reference, ai, will not require accessing the main memory because the data can be found in the victim cache. Fetching a conflicting datum with this scheme requires two or three clock cycles: 1. the first clock cycle is needed to check the primary cache; 2. the second cycle is needed to check the victim cache; 3. a third cycle may be needed to swap the data in the primary cache and victim cache so that the next access will likely find data in the primary cache; This scheme has several disadvantages: it requires a separate, fully-associative cache to store the conflicting data. Not only does the victim cache consume extra area, but it can also be quite slow due to the need for an associative search and for the logic to maintain a least-recently-used replacement policy. For adequate performance a sizeable victim cache is required in order for the victim cache to be able to store all conflicting data blocks. If the size of the victim cache is fixed relative to the primary direct-mapped cache, then it is not very effective at resolving conflicts for large primary caches.
  • 49.
    Memory Synthesis UsingAI Methods 43 6.4.2 Column-Associative Cache The challenge is to find a scheme that minimizes the conflicts that arise in direct-mapped accesses by allowing conflicting addresses to dynamically choose alternate mapping func- tions, so that most of the conflicting data can reside in the cache. At the same time, however, the critical hit access path (which is an advantage of direct-mapped organiza- tion) must remain unchanged. The method presented is called Column-Associativity and has been invented by A. Agarwal and S.D. Pudar [11]. The idea is to emulate a 2-way set-associative cache with a direct-mapped cache by map- ping two conflicting addresses to different sets instead of referencing another line in the same set as the 2-way set-associativity does. Therefore, conflicts are not resolved within a set but within the entire cache, which can be thought of as a column of sets —thus the name column associativity. The method uses two mapping functions (also called hashing functions) to access the cache. The first hashing function is the common bit selection, that is, an address ai is mapped into the set with the number: b[ai] (24) The second hashing function is a modified bit-selection, which gives the same value as the bit-selection function except for the highest-order bit, which is inverted. We call this hashing function bit flipping and denote it by f. For example, if b[a] = 010, then applying the bit flipping function to the address a yields f[a] = 110. Therefore, the function f applied to an address aj will always give a set number which is different from that given by the function b: b[aj] = f[aj] (25) The scheme works as follows: 1. the bit selection function b is applied to a memory address ai. If b[ai] indexes to valid data, a first-time hit ocurs, and there is no time penalty; 2. if the first access has missed, then the bit flipping function f is used to access the cache. If f[ai] indexes to valid data, then a second-time hit occurs and data is retrieved. 3. if a second-time hit has occured, then the two cache lines are swapped so that the next access will likely result in a first-time hit. 4. if the second access misses, then data is retrieved from main memory, placed in the cache set indexed by f[ai], then it is swapped with the data indexed by b[ai] with the goal of making the next access likely to be a first-time hit. The first and second step each require one clock cycle, while swapping requires two clock cycles. The second-time hit, including swapping, is then four cycles but can be reduced to only three cycles using an extra buffer for the cache: Given this buffer, the swap need not involve the processor, which may be able to do other useful work while waiting for the cache to become available again. If this is the case half of the time, then the time wasted
  • 50.
    44 Memory SynthesisUsing AI Methods by a swap is only one cycle. Therefore, it can be considered that a swap adds only one cycle to the execution time, and hence the second-time hit is 3 clock cycles. Using two hashing functions mimics 2-way set-associativity because for two conflicting addresses, ai and aj, rehashing aj with f resolves the conflict with a high probability: from equations (23) and (25) it results that the function f applied to aj will give a set different from b[ai]: b[ai] = b[aj] = f[aj] (26) The difference is that a second-time hit takes three clock cycles, while in a 2-way set- associative cache the two lines of a set can be retrieved in a clock cycle. However, the clock cycle of the two-way set-associative cache is longer that the clock cycle of the direct- mapped cache. A problem that must be solved for column-associative caches is the problem of possible incorrect hits. Consider two addresses, ai and ak, that map with bit-selection to indexes that differ only in the highest-order bit. In this case, the index obtained by applying bit- selection mapping to one address is the same as the index obtained by applying bit-flipping mapping to the other address: b[ak] = f[ai] and b[ai] = f[ak] (27) These two addresses are distinct, but they may have identical tag fields. If this is the case, when a rehash occurs for the address ai and data addressed by ak is already in cache at location b[ak], then the bit-flipping mapping f[ai] results in a hit with a data block that should only be accessed by b[ak]. For example, if b[ak] = 110, b[ai] = 010, and Tag[ak] = Tag[ai] (28) and assuming that data line addressed by ak is cached in the set with the index b[ak], then when the address ai is presented to the cache, this address will be rehashed to the same set as ak (i.e., f[ai] = 110) and will cause a second-time hit (a false hit) because the two addresses have the same Tag. This is incorrect, because a data-line must have a one-to-one correspondence with a unique memory address. The solution to this problem is to extend the Tag with the highest-order bit of the index field. In this case, the rehash with f[ai] will correctly fail because information about ai and ak having different indexes is present in the Tag. In this way, the data line stored in the set with the number b[ak] = f[ai] is put into correspondece with an unique index, and hence a unique address. Another problem is that storing conflicting data in another set is likely to result in the loss of useful data, and this is referred to as clobbering. The source of this problem is that a rehash is attempted after every first-time miss, which can replace potentially useful data in the rehashed location, even when the primary location had an inactive line. Clobbering may lead to an effect called secondary thrashing that is presented in the following paragraph. Consider the following reference pattern: ai aj ak aj ak aj ak . . . , where the addresses ai and aj map into the same cache location with bit selection, and ak
  • 51.
    Memory Synthesis UsingAI Methods 45 is an address which maps into the same location with bit-flipping, that is: b[ai] = b[aj], b[ak] = f[ai] and f[ak] = b[ai] (29) After the first two references, the data referenced by aj (which will be called j for brevity) and the data i will be in the non-hashed and rehashed locations, respectively (because of swapping). When the next address, ak, is encountered, the algorithm attempts to access b[ak] (bit selection is tried first), which contains the rehashed data i; when the first-time miss ocurs, the algorithm tries to access f[ak] (bit flipping is tried second), which results in a second-time miss and the clobbering of the data j. This pattern continues as long as aj and ak alternate: the data referenced by one of them is clobbered as the inactive data block i is swapped back and forth but never replaced. This effect is referred to as secondary thrashing. The solution to this problem is finding a method to inhibit a rehash access if the location reached by the first-time access itself contains a rehashed data block, that is, with the previous notation, when the location referenced by ak with bit-selection (b[ak]) already contains a rehashed data (data i is rehashed to f[ai]). This condition can be satisfied by adding to each cache set an extra bit that indicates whether the set is a rehashed location, that is, whether the data in that set is indexed by f[a]. This bit that indicates a rehashed location is called the rehash bit, denoted by Rbit, and it makes possible to test if a first- time miss occurs on a rehashed data and thus to avoid rehashing a first-time miss to a set that contains rehashed data. Therefore, the scheme for column associativity is the following (step 2 of the the basic scheme is modified to avoid clobbering): 1. the bit-selection hashing function b is applied to a memory address a. If b[a] indexes to valid data, a first-time hit ocurs, and there is no time penalty; 2. if the first access is a miss, then the action taken depends on the value of the rehash bit of the set indexed by b[a]: (a) if the rehash bit has been set to one, then no rehash access will be attempted, but the data retrieved from memory will be placed in the location obtained by bit-selection. Then the rehash bit for that set will be reset to zero to indicate that the data in this set is indexed by bit-selection and the access is completed. (b) if the rehash bit is already a zero, then the bit-flipping function f is used to access the cache. If f[a] indexes to valid data, then a second-time hit occurs and data is retrieved; 3. if a second-time hit has occured, then the two cache lines are swapped so that the next access will likely result in a first-time hit. 4. if the second access misses, then data is retrieved from main memory, placed in the cache set indexed by f[a], then it is swapped with the data indexed by b[a] with the goal of making the next access likely to be a first-time hit. Note that if a second-time miss occurs, then the set whose data will be replaced is again a rehashed location, as desired. At start-up (or after a cache flush), all of the empty cache
  • 52.
    46 Memory SynthesisUsing AI Methods locations should have their rehash bits set to one. The reason that this scheme correctly replaces a location that has the Rbit set to one immediately after a first-time miss is based on the the relationship between bit selection and bit-flipping mapping: given two addresses ai and ak, if f[ai] = b[ak] then f[ak] = b[ai]. Therefore, if ai accesses a location using b[ai] whose rehash bit bit is set to one, then there are only two possibilities: 1. The accessed location is an empty location from start-up, or 2. there exists a non-rehashed location at f[ai] (that is, b[ak]) which previously encoun- tered a conflict and placed the data in its rehashed location, f[ak]. In both cases replacing the location reached during first-time access that has the Rbit set to one is a good action, because data at location b[ai] is less useful than data at location f[ai] = b[ak]. The rehash bits limit the rehash accesses and the clobbering effect, and lower the proba- bility of secondary thrashing. For the mentioned reference stream: ai aj ak aj ak aj ak . . . , the third reference accesses b[ak], but it finds the rehash bit set to one, because this loca- tion contains the data referenced by ai. Therefore, the data i is replaced immediately by k, the desired action. Even though the column-associative cache can present secondary thrashing if three or more conflicting addresses alternate, as in the pattern: ai aj ak ai aj ak ai aj . . . , this case is much less probable than two alternating addresses. 6.5 Reducing Read Miss Rate When the processor makes a memory reference that misses in the cache, then the line corresponding to that memory address is fetched from memory. If no line is fetched until it is referenced by the processor, one calls this demand fetching, that is no line is fetched in advance from memory. When a line is fetched from memory and brought into cache before it is requested by the processor, one calls this a prefetch operation. The purpose of prefectch is to bring in advance information that will soon be needed by the processor, and in this way to decrease the miss rate. A prefetch algorithm guesses what information will soon be needed and fetches it. When a prefetch algorithm decides to fetch a line from memory, it should interrogate the cache to see if that line is already resident in cache. This is called prefetch lookup and may interfere with the actual cache lookups generated by the processor. Given that a prefetch may require to replace an existing line, this interference consists not only in cycles lost by the CPU when waiting for the prefetch lookup cache accesses, or in cache cycles used to bring in the prefetched line and perhaps to move out a line from cache, but also in a potentially increase in miss ratio when lines that are more likely to be referenced are expelled by a prefetch. This problem is called memory pollution and its impact depends on the line size. Small line sizes generally result in a benefit from prefetching, while large line sizes lead to the ineffectiveness of prefetch. The reason for this is that when the line
  • 53.
    Memory Synthesis UsingAI Methods 47 is large, a prefetch brings in a great deal of information, much or all of which may not be needed, and removes an equally large amount of information, some of which may still be in use. The fastest hardware implementation (which is a major design criterion) is provided by prefetching the line that is the immediately sequential to a referenced line. That is, if line i is referenced, only line i + 1 is considered for prefetching. This method is known as one block lookahead (OBL). A prefetch may potentially be initiated for every memory reference and there are two strategies to decide when to do prefetching: 1. always prefetch — means that on every memory reference, accesws for line i (for all i) implies a prefetch access for line i + 1. 2. prefetch on misses — implies that a reference to a line i causes a prefetch to line i + 1 if and only if the reference to line i was a miss. Prefetching has several effects: it (presumably) reduces the miss ratio, increases the mem- ory traffic and introduces cache lookup accesses. Always prefetch provides a greater de- crease in miss ratio than prefetch on misses, but it also introduces greater memory and cache overhead. The advantage of prefetching depends very strongly on the effectiveness of the implementation. Prefetching should not use too many cache cycles if an acceptable interference with normal program accesses to the cache is to be maintained. This can be accomplished in several ways: 1. by instituting a second, parallel port to the cache; 2. by defering prefetches until spare cache cycles are available; 3. by not repeating recent prefetches: this can be done by remembering the addresses of the last n prefetches in a small auxiliary cache, testing a potential prefetch against this buffer and not issuing the prefetch if the address is found. Another scheme that may help is buffering the transfers between the cache and the main memory required by a prefetch and making them during otherwise idle cache cycles. The memory traffic caused by prefetch seems unavoidable, but it is tolerable for one-block look ahead. A prefetch operation may be thought of as a nonblocking read of two lines, that is, when a read miss occurs, the processor does not have to wait until both lines are fetched from memory, but it can proceed immediately after the requested line has been brought into the cache. Thereafter, while the processor proceeds, the cache is fetching the adjacent line from memory. One block lookahead prefetch with a line size L outperforms demand fetching with a line size 2L, due to the overlapping of the line fetch and CPU execution.
  • 54.
    48 Memory SynthesisUsing AI Methods 6.6 Reducing Write Hit Time Write hits take usually more than one cycle because the Tag must be checked before writing the data, and because when the processor modifies only a portion of a line, that line must first be read from cache in order to get the unmodified portion. There are two ways to do faster writes: pipelining the writes, and subblock placement for write-through caches. 6.6.1 Pipelined Writes This technique pipelines the writes to cache. The two steps of a cache write operation —tag comparison and write data— are pipelined in a two-stage scheme: • the first stage compares the Target Address and the Tags; • the second stage makes a write to the cache using the address and data from the previous write hit. The idea is that when the first stage compares the Tag with the Target address, the second stage accesses the cache using the address and data from the previous write. This scheme requires that the Tag and Data can be addressed independently, that is, they must be stored in separate memory arrays. Therefore, when the CPU issues a write and the first stage produces a hit, the CPU does not have to wait for the write to the cache that will be made in the second stage. In this way, a write to the cache takes only one clock cycle. Moreover, this technique does not affect read hits: the second stage of a write hit occurs during the first stage of the next write hit or during a cache miss. 6.6.2 Subblock Placement This scheme may be applied to direct mapped caches with write-through policy. The scheme maintains a valid bit on units smaller than the full block, called subblocks. The valid bits specify some parts of the block as valid and some parts as invalid. A match of the tag doesn’t mean the word is necessarily in the cache, as the valid bits for that word must also be on. For caches with subblock placement a block can no longer be defined as the minimum unit transferred between cache and memory, but rather as the unit of information associated with an address tag. Subblock placement was invented with a twofold goal: to reduce the long miss penalty of large blocks (since only a part of a large block need to be read on a miss) and to reduce the tag storage for small caches. The discussion below demonstrates the usefulness of this method for writes. Subblock placement may be used for writes by extending it in the following mode: a word is always written into the cache no matter what happens with the tag match, the valid bit is turned on, and then the word is sent to memory. This trick improves both write hits and misses and works in all cases, as shown below:
  • 55.
    Memory Synthesis UsingAI Methods 49 1. Tag match and valid bit already set. Writing the block was the proper action, and nothing was lost by setting the valid bit on again. 2. Tag match and valid bit not set. The tag match means that this is the proper block; writing the data into the block makes it appropriate to turn the valid bit on. 3. Tag mismatch. This is a miss and will modify the data portion of the block. However, as this is a write-through cache, no harm was done; memory still has an up-to-date copy of the old value. Only the tag to the address of the write need be changed because the valid bit has already been set. If the block size is one word and the STORE instruction is writing one word, then the write is complete. When the block is larger than a word or if the instruction is a byte or halfword store, then either the rest of the valid bits are turned off (allocationg the subblock without fetching the rest of the block) or memory is requested to send the missing part of the block (i.e., write allocate). This scheme can’t be used with a write-back cache because the only valid copy of the data may be in the block, and it could be overwritten before checking the tag. 6.7 Reducing Write Stalls Write stalls may occur at every write for write through or when a dirty line is replaced for write back strategy, and they can be avoided using a write buffer (described in Section 5.7) of a proper size. Write buffers, however, introduce additional complexity for handling misses because they might have the updated value of a location needed on a read miss. For write through, the simplest solution to solve this problem is to delay the read until all the information in the write buffer has been transmitted to memory, that is, until the write buffer is empty. But, since a write buffer usually has room for a few words, it will almost always have data not yet transferred, that is, it will not be empty, which incurs an increase in the Read Miss Penalty. This increase may reach as much as 50% for a four-word buffer as stated by Hennessy and Patterson in [6]. An alternative approach is to check the contents of the write buffer on a read miss, and if there are no conflicts and the memory system is available, let the read miss continue. For write back, the buffer (whose size is one line in this case) may contain a dirty line that has been purged from cache to make room for a new line but has not yet been written into memory that still contains the old data. When a read miss occurs, there are also two approaches: either to wait until the buffer is empty, or to check if it contains the referenced line and to continue with the memory access if there is no conflict. 6.8 Two-level Caches 6.8.1 Reducing Miss Penalty The gap between the CPU and main memory speeds is increasing due to CPUs getting faster and main memories getting larger, but slower relative to the faster CPUs. The
  • 56.
    50 Memory SynthesisUsing AI Methods question arising is whether the cache should be made faster to keep pace with the speed of CPU or larger to reduce the miss rate. These two conflicting choices can be reconciled by adding another level of cache between the original cache and main memory: • the first-level cache is small enough to match the clock cycle time of the CPU; • the second-level cache is large enough to capture many accesses that would go to main memory, that is, misses in the first-level cache. This is a two-level cache, with the first-level cache closer to the CPU. The average memory- access time for a two-level cache may be computed through the steps below, where the subscripts L1 and L2 refer to the first-level and the second-level cache respectively: Average memory − access time = Hit timeL1 + Miss rateL1 ∗ Miss penaltyL1 (30) and Miss penaltyL1 = Hit timeL2 + Miss rateL2 ∗ Miss penaltyL2 (31) therefore, Average memory − access time = = Hit timeL1 + Miss rateL1 ∗ (Hit timeL2 + Miss rateL2 ∗ Miss penaltyL2) (32) For a two-level cache, one should make a distinction between the local miss rate and the global miss rate: • Local miss rate — is the number of misses in the cache divided the the total number of memory accesses to that cache; these are Miss rateL1 and Miss rateL2; • Global miss rate — is the number of misses in the cache divided by the total number of memory accesses generated by the CPU; this is Miss rateL1 ∗ Miss rateL2. The second-level cache reduces the miss penalty of the first-level cache (equation (31)) and allows the designer to optimize the second-level cache for lowering this parameter, while the first-level cache is optimized for low hit time. 6.8.2 Second-level Cache Design The most important difference between the two levels of the cache is that the speed of the first-level cache affects the clock rate of the CPU, while the speed of the second level cache only affects the miss penalty of the first-level cache (equation (31)). Hence, for the second-level cache there is a clear design goal: lower the miss penalty (equation (12), Section 5.2), where for a two-level cache the miss penalty is Miss penaltyL1 (equation (31)). Capacity of second-level cache The choice of the size for the second-level cache is so that it is bigger than the first-level
  • 57.
    Memory Synthesis UsingAI Methods 51 cache, because the information in the first-level cache should be likely to be also in the second-level cache. If the second-level cache is just a little bigger than the first-level cache, then its local miss rate will be high; if it is much larger than the first-level cache (this usually means above 256 KB), then the global miss rate is about the same as for a single- level cache of the same size. Typical values for second-level cache sizes are from 256 KB to 4 MB. Associativity of second-level cache Unlike first-level cache, where the associativity is limited by the impact on clock cycle time (Section 6.1), high associativity for second-level cache may be helpful because here the sum expressed by equation (31) matters. Therefore, as long as increasing associativity has a small impact on the second-level hit time, Hit timeL2 but has a great impact on Miss rateL2, it is worthwhile to increase it. However, for very large caches the benefits of associativity diminish because the larger size has eliminated many conflict misses, that is, the decrease in Miss rateL2 does no more outweigh the increase in Hit timeL2. Line size of second-level cache As shown is Section 5.9, increasing block size reduces the compulsory misses as long as spatial locality holds, but does not preserve temporal locality, leading to an increase in conflict misses. Because second-level caches are large, increasing the line size has a small efect on conflict misses, which favors larger line size. Moreover, if the access time of the main memory is relatively long, then the effect of large line size on the Miss penalty (i.e., increased transfer time) is tolerable. Therefore, second-level cache have larger line sizes, usually from 32 to 256 bytes. 6.9 Increasing Main Memory Bandwidth The Miss penalty is the sum of the Access latency and Transfer time. As shown in Section 5.9, increasing the line size may decrease the Miss ratio, but the increase of line size is limited by the associated increase in Miss penalty. The organization of main memory has a direct impact on the Miss penalty because an improvement of the main memory bandwidth (i.e., decrease in transfer time) allows cache line size to increase without a corresponding increase in the Miss penalty. 6.9.1 Wider Main Memory Let us consider a basic main memory organization in which a word has b bytes, described by the following parameters: m1 — is the number of clock cycles to send the address to main memory; m2 — is the number of clock cycles for the access time per word; m3 — is the number of clock cycles to send a word of data; w — is the width of the memory (and of the bus) in words;
  • 58.
    52 Memory SynthesisUsing AI Methods We can compute the memory bandwidth Bw (i.e., the number of data bytes transferred in a clock cycle) corresponding to a bus width of w words: Bw = b ∗ w m1 + m2 + m3 (33) We can compute the access latency for one word (assuming that a line contains a multiple of w words): Access latencyw = b B = m1 + m2 + m3 w (34) and the Miss penalty time is then computed using equation (9), Section 5.1: Miss penaltyw = L ∗ (m1 + m2 + m3) bw (35) If the memory width were one word (i.e., w = 1), then the Miss penalty would be: Miss penalty1 = L ∗ (m1 + m2 + m3) b (36) Therefore, incresing the width of main memory w times increases the Memory bandwidth (equation (33)) and decreases the Miss penalty (equations (35) and (36)) by the factor w, allowing larger line sizes. An wider bus poses, however, some problems. First, because the CPU accesses the cache one word at a time, a multiplexer is needed between the cache and the CPU —and the multiplexer is on the critical timing path. Another problem is that, because usually memories have error correction, writing only a portion of a word imposes a Read-Modify-Write (RMW) sequence in order to compute the error correction code. When error correction is done over the full width, the frequency of partial block writes will increase as compared to an one word width and hence the frequency of RMW sequences will increase. This can be remedied if the error correction codes are associated for every 32 bits of the bus width because most writes are that size. 6.9.2 Interleaved Memory Another way to increase the memory bandwidth is to organize the memory chips in banks that are one word wide. In this way the width of the bus to the cache is still one word, but sending addresses to the banks simultaneously permits them all to read simultaneously. If L is the line size in bytes, and b the number of bytes per word, then the number of memory banks, Nbanks is: Nbanks = L/b (37) The mapping of addresses to banks determines the interleaving factor. Interleaved memory means normally that word interleaving is used, that is, the following mapping function: Bank − number = (Memory address) modulo (L/b) (38) Word interleaving optimizes sequential memory accesses and is ideal for read miss penalty reduction because when a word misses in the cache, the line that must be fetched from memory is built up from words with sequential addresses which are hence located in
  • 59.
    Memory Synthesis UsingAI Methods 53 different banks. Write-back caches make writes as well as reads sequential, getting even more efficiency from interleaved memory. With the notations from the Subsection 6.9.1 and taking into account that the memory is one-word wide, all the L/b words in a line are accessed in parallel and only data sending is serial, one gets for the Memory bandwidth of interleaved memory, Bi, the expression: Bi = L m1 + m2 + m3 ∗ (L/b) (39) The access latency has the expression: Access latencyi = b B = b L ∗ (m1 + m2 + m3 (L/b)) (40) The Miss penalty for interleaved memory is computed using equation (9), Section 5.1: Miss penaltyi = m1 + m2 + m3 ∗ (L/b) (41) Interleaved memory provides a reduction of the Miss penalty as compared to the Miss penalty1 (equation (36)) of the basic memory organization: Miss penalty1 − Miss penaltyi = (L/b − 1) ∗ (m1 + m2) (42) In an interleaved memory the maximum number of banks is limited by memory-chip cost constraints. For example, consider a main memory of capacity 16-MB with 4-byte words (i.e., the main memory is 4 mega words). With an one-word wide memory organization, and using 4-Mbit DRAM chips, the number of memory chips needed is 32 chips. If a line size is chosen to be 16 words (i.e., 64 bytes) then, using interleaved memory, the number of banks in main memory must be 16 (equation (37)), and the capacity of a bank is 256 Kwords (1MB). Therefore, a bank contains 32 chips of 256-Kbit DRAMs, and a total of 512 DRAM chips are necessary, as opposed to only 32 chips of 4-Mbit DRAM needed in the simplest design. In conclusion, the number of banks increases linearly with the line size (equation (37)) and the maximum number of banks is limited by the cost of the memory. The availability of high-capacity DRAM chips causes an interleaved memory to be built from more memory chips than an one-word memory (because banks are using smaller-capacity DRAM chips). This leads to an increase in the cost of the interleaved memory and limits the maximum number of memory banks that can be economically used, and thus limits the line size.
  • 60.
    54 Memory SynthesisUsing AI Methods 7 SYNCHRONIZATION PROTOCOLS 7.1 Performance Impact of Synchronization In parallel applications, synchronization points are used for interprocess synchronization and mutually exclusive access to shared data. According to the frequency of synchroniza- tion points, applications fall into three broad categories: 1. coarse-grained applications —are applications in which parallel processes synchronize infrequently; 2. medium-grained applications —are applicatios in which parallel processes synchro- nize with a moderate frequency; 3. fine-grained applications —are applications in which parallel processes synchronize frequently. This aspect of the application behavior is referred to as the granularity of the parallelism. Synchronization involves accesses to synchronization variables. These variables are prone to becoming hot spots — variables frequently accessed by many processors. This in turn causes memory traffic and may degrade the system performance up to the point of satu- ration. The inefficiency caused by synchronization is twofold: waiting times at synchronization points and the intrinsic overhead of the synchronization operations. Reducing waiting time is the responsibility of programmers (the characteristics of a parallel application determine the amount of synchronization points and the waiting time), and reducing synchronization overhead is a task for the computer architect. The memory-consistency model (Chapter 8) influences the amount of synchronization activity: in machines exhibiting the weak or release consistency models of behavior the frequency of synchronization points is greater than in machines supporting sequential consistency. The memory-consistency model and the cache-coherence protocol should be taken into account when selecting how to implement a synchronization method. 7.2 Hardware Synchronization Primitives 7.2.1 TEST&SET(lock) and RESET(lock) A lock is a variable on which two atomic operations can be performed: 1. lock (also called acquire) — a process locks a lock when the lock is free, that is, a zero, and the process sets the lock to locked, that is, to one. A lock operation is accomplished by reading the lock variable (which is a shared variable) until the value zero is read, and then setting the lock to one (using for example an atomic RMW operation).
  • 61.
    Memory Synthesis UsingAI Methods 55 2. unlock (also called release) — a process unlocks a lock when it frees the lock, that is, it sets the lock variable to zero. An unlock operation is always associated with a write to the lock variable. Locks are useful in providing mutual exclusive access to shared variables. Two synchro- nization primitives, called TEST&SET and RESET are a common means to implement a lock. The TEST&SET primitive (which performs an RMW operation) provides atomic test of a variable and sets the variable to a specified value. The semantics of TEST&SET is: TEST&SET(lock) { temp = lock; lock = 1; return temp; } The value returned by the operation is the value before setting the variable. An acquire operation is done usually by having the software repeat the TEST&SET until the returned value is zero. This repeated check of a variable until it reaches a desired state is called busy waiting on continual retry. Busy waiting ties up the processor in an idle loop, increases the memory traffic and may lead to contention problems on the interconnection network. This type of lock that forces the process to “spin” on the CPU while waiting for the lock to be released is called a spin-lock. The RESET primitive is used to unlock (release) a lock and has the semantics: RESET(lock) { lock = 0; } To avoid sppinning, interprocessor interrupts are used. A lock that relies on interrupts instead of spinning is called a sleep-lock or suspend-lock. A sleep-lock is implemented as follows: whenever a process fails to acquire the lock, it records its status in one field of the lock and disables all interrupts except interprocessor interrupts. When a process releases the lock, it signals all waiting processes through an interprocessor interrupt. This mechanism prevents the excessive interconnection traffic caused by busy-waiting but still consumes processor cycles: the processor is no more spinning, but it is sleeping! 7.2.2 FETCH&ADD The FETCH&ADD primitive provides atomic incrementing (or decrementing) operation on uncached memory locations. Let x be a shared-memory word and a its increment (or decre- ment, if negative). When a single processor executes the FETCH&ADD on x, the semantics are:
  • 62.
    56 Memory SynthesisUsing AI Methods FETCH&ADD(x, a) { temp = x; x = temp + a; return temp; } When N processes attempt to execute FETCH&ADD on the same memory word x simultane- ously, the memory is updated only once, by adding the sum of the N increments, and each of the N processes receives a returned value that corresponds to an arbitrary serialization of the N requests. From the processor point of view, the result is similar to a sequential execution of N FETCH&ADD instructions, but it is performed in one memory operation. The success of this primitive is based on the fact that its execution is distributed in the inter- connection network using a combining interconnection network (Subsection 7.4.1) that is able to combine more accesses to a memory location into a single access. In this way, the complexity of an N-way synchronization on the same memory word is independent of N. This method for incrementing and decrementing has benefits as compared with the use of a normal variable protected by a lock to achieve the atomic increment or decrement, because it involves less traffic, smaller latency and decreased serialization. The serialization of this primitive is small because it is done directly at the memory site. This low serialization is important when many processors want to increment a location, as happens when getting the next index in a parallel loop. A multiprocessor using a combining network and this primitive is the IBM RP3 computer ([3]). FETCH&ADD is useful for implementing several synchronization methods such as barriers, parallel loops, and work queues. 7.2.3 Full/Empty bit primitive Under this primitive, a memory location is tagged as empty or full. LOADs of such words succeed only after the word is updated and tagged as full. After a successful LOAD, the tag is reset to empty. Similarly, the STORE on a full memory word can be prevented until the word has been read and the tag cleared. This primitive relies on busy-waiting, and memory cycles are wasted on each trial: when a process attempts to execute a LOAD on an empty-tagged location, the proces will spin on the CPU while waiting for the location to be tagged as full. By analogy with the locks, one says that the process spin-locks on the Full/Empty bit. This mechanism can be used to synchronize processes, since a process can be made to wait on an empty memory word until some other process fills it. 7.3 Synchronization Methods In this section we present methos for achieving mutual exclusion and conditional synchro- nization.
  • 63.
    Memory Synthesis UsingAI Methods 57 7.3.1 LOCK and UNLOCK operations A LOCK operation on a lock variable changes the value of the lock variable from zero to one. If several processes attempt to execute a LOCK, only one process is allowed to successfully execute this operation and to proceed. All other processes that attempt to execute a LOCK will be waiting until the process that has acquired the lock releases it via an UNLOCK operation. An UNLOCK operation sets the lock variable to zero, signaling that the lock is free. An UNLOCK operation may be implemented using the RESET(lock) primitive. A LOCK operation may be implemented using the TEST&SET primitive, as shown in the code segment: LOCK(lock) { repeat while(LOAD(lock)==1) ; // spin-lock with read cycle // until (TEST&SET(lock)==0) ;} // test free lock and LOCK it// A LOCK operation may be used to gain exclusive access to a set of data (Subsection 7.3.3). 7.3.2 Semaphores A semaphore is a nonnegative integer variable, denoted by s, that can be accessed by two atomic operations, denoted by P and V . The semantics of the P and V operations are: P(s) { if (s > 0) then s = (s − 1); else { Block the process and append it to the waiting list for s; Resume the highest priority process in the READY LIST; } } V (s) { if (waiting list for s empty) then s = (s + 1); else { Remove the highest priority process blocked for s; Append it to the READY LIST; } }
  • 64.
    58 Memory SynthesisUsing AI Methods In these two algorithms shared lists are consulted and modified, namely, the READY LIST and the waiting list for s. The READY LIST is a data structure containing the descriptors of processors that are runable. These accesses as well as the test and modify of s have to be protected by locks or FETCH&ADDs associated with semaphores and with the lists. Semaphores that have possible values 0 and 1 are called binary semaphores. Those that can take the values 0 to n are called general or counting semaphores. When a semaphore has a value greater than 0, it is defined open; otherwise, it is closed. Counting semaphores are useful when there are multiple instances of a shared resource. In practice, P and V are processor instructions or microcoded routines, or they are oper- ating system calls to the process manager. The process manager is the part of the system kernel controlling process creation, activation, and deletion, as well as management of the locks. Because the process manager can be called from different processors at the same time, its associated data structures must be protected. Semaphores are particulary well adapted for synchronization. Unlike spin-locks and sleep-locks, semaphores are not wasteful of processor cycles while a process is waiting, but their invocation requires more overhead. Note that locks are still necessary to implement semaphores. A drawback of semaphore-based synchronization is that it puts the responsibility for controlling access on the programmer or the parallelizing compiler, who must decide when to synchronize and on what conditions. 7.3.3 Mutual Exclusion Mutually exclusive access to shared variables is achieved by enforcing sequential execution of critical sections of different processes. The common methods used to control access to the critical section are the locks and the semaphores. Mutual exclusive access using locks If the machine supports an atomic TEST&SET primitive, mutual exclusive access can be implemented as follows: while (TEST&SET(lock)==1) ; // spin–lock with Read-Modify-Write cycles // . . . . . . . . . // execute critical section // RESET(lock); // unlock the lock and exit critical section // This segment of code protects access to a critical section via a spin-lock. A variant of implementation uses the LOCK operation (Section 7.3.1) to control access to the critical section: repeat while(LOAD(lock)==1) ; // spin-lock with read cycle // until (TEST&SET(lock)==0) ; // LOCK the lock//
  • 65.
    Memory Synthesis UsingAI Methods 59 . . . . . . // critical section// RESET(lock); // unlock the lock and exit critical section// The performance of these two approaches is examined in Section 7.5. Mutual exclusive access using semaphores To provide mutual exclusion, a binary semaphore s (Subsection 7.3.2) is associated with a critical section and is used to guarantee sequential access to it. Before entering the critical section, each process must execute P(s); upon exiting, it must execute V (s). 7.3.4 Barriers Barriers are used for a conditional synchronization that requires that all synchronizing processes reach a synchronization point called barrier before any processor is allowed to continue. The BARRIER operation “joins” a number of parallel processes: all processes synchronizing at a barrier must reach the barrier before any one of them can continue. If there are N processes that must reach the barrier, then a barrier variable, denoted by count —that is used as a process counter and has been initialized to zero— is used. The BARRIER operation is defined as follows: BARRIER(N) { count = count + 1; if (count ≥ N) then { Resume all processes on barrier queue; Reset count; } else Block task and place in barrier queue; } The first N − 1 processes that execute the BARRIER operation are blocked and are put in a barrier queue that can be implemeted with an Full/Empty tagged word in which identifiers of blocked processes are written. The processes that are blocked spin-lock on the Full/Empty bit. Upon execution of BARRIER by the N-th process, all N processes are ready to resume; consequently, this process writes into the tagged memory location and wakes up all blocked processes. A variant implemenation of the Barrier operation also uses a barrier variable that is incremented by each process when it reaches the synchronization point but, instead of using the Empty/Full bit, tests a barrier flag. After incrementing the barrier variable, each processor spin-locks on a barrier flag. The Nth processor that reaches the barrier increments the barrier variable to its final value, N, and writes into the barrier flag, thereby releasing the spinning processors. This method has the disadvantage of relying on busy waiting. Another variant is to use a sleep-lock for the barrier flag. The atomicity of the increment and test operations on the barrier variable must be enforced by some hardware synchronization mechanism, such as a lock. With regard to the barrier variable, the FETCH&ADD primitive is a good choice —if available—, because it provides the least contention thanks to the combining property of FETCH&AdDD.
  • 66.
    60 Memory SynthesisUsing AI Methods 7.4 Hot Spots in Memory When accesses from several processors are concentrated to data from a single memory module over a short duration of time, the access pattern is likely to cause hot spots in memory. A hot spot is a memory location repeatedly accessed by several processors. Syn- chronization objects such as locks and barriers, and loop index variables for parallel loops are examples of shared variables that can become hot spots. Hot spots can significantly reduce the memory and network throughput because they do not allow parallelism of the machine architecture to be exploited as it is possible under unifrom memory access patterns. Hot spots can cause severe congestion in the interconnection network, which degrade the bandwidth of the shared-memory system. 7.4.1 Combining Networks An widespread scheme to avoid memory contention is the combining network. The idea is to incorporate some hardware in the interconnection network to trap and combine data accesses when they are fanning in to a memory module that contains the shared variable. By combining data accesses in the interconnection network the number of accesses to the shared variable is decreased. The extra hardware required for this scheme is estimated in [15] to increase the switch size and/or cost by a factor between 6 and 32 for combining networks consisting of 2 × 2 switches. The extra hardware also tends to add extra net- work delay which will penalize most of the ordinary data accesses that do not need these facilities, unless the combining network is built separately. 7.4.2 Software Combining Trees A software tree can be used to eliminate memory contention due to the hot-spot variable. The idea is similar to the concept of a combining network, but it is implemented in software instead of hardware. A software combining tree is used to do the combining of data accesses. This technique, that has been proposed by Yew et al. in [16], does not require expensive hardware combining, while providing comparable performance. The principle of a software combining tree is first illustrated for a barrier variable. Let us assume a multiprocessor architecture with N processors and N memory modules. We define the fan-in of the accesses to a memory location as the number of accesses to that location. We assume a that a hot-spot with a fan-in of N exists in the system, for example when a barrier variable (Sectuion 7.3.4) is used to make sure that all processors are finished with a given task before proceeding. Therefore, the barrier variable is addressed by the N processors, causing a hot-spot with a fan-in of N. The barrier variable has initially the value zero, and each of N processors has to increment this variable so that when when all processors are finished, the value will be N. Assuming that N = 1000, we have a hot-spot with 1000 accesses. The software combining tree is replacing the single variable with a tree of variables, with each variable in a different memory module. For the example given, if we decide to reduce the fan-in to 10, then for each group of 10 processors, a variable is created in a different memory module, for a total of 100 variables (level 1 of the tree
  • 67.
    Memory Synthesis UsingAI Methods 61 . . . . . . . . . . . . . . . . . . level 1 level 2 hot-spot location . . . . . . . . . Figure 8: Software Combining Tree in Figure 8). Therefore, we partition the processes into N/10 = 100 groups of 10, with each group sharing one variable at level 1 of the tree. Then we partiton the 100 variables into 10 groups of 10, with each group sharing a variable at the level 2 in the tree, thus other 10 variables are created in different memory modules. Finally, for the 10 variables in level 2 we associate a variable in another memory module that is the root of the tree and corresponds with the old hot-spot. When the last process in each group increments its variable, it then increments the variable in the parent node. We have therefore increased the number of variables from one to : 100 + 10 + 1 = 111 and the number of memory accesses from 1000 to: 1000 + 100 + 10 = 1110 but instead of having one hot spot with 1000 accesses we have 111 hot spots with only 10 accesses each. This technique results in a significant improvement in throughput rate and bandwidth even if we account for the increase in total accesses. The idea of combining tree can be applied for implementing conditional synchronization as well. Because this operation implies processors that are waiting for a shared variable to change in some way, and the variable will be changed by another processor, the combining tree is built here by assigning one processor to each node in the tree. Thus, each processor monitors the state of its node by continually reading the node. When the processor monitoring the root node detects the change in its node, it in turn changes the state of its children’s nodes, and so on untill all processors have detected the change and are able to proceed with the next task.
  • 68.
    62 Memory SynthesisUsing AI Methods 7.5 Performance Evaluation The access patterns to locations used for synchronization may cause great performance penalty. Spin-locks (that rely on busy waiting) should ideally meet the following criteria: • minimum amount of traffic generated while waiting, • low latency release of a waiting process, and • low latency acquisition of a free lock. We shall examine how do the locks satisfy these criteria making reference to the use of locks for achieving mutual exclusive access (Subsection 7.3.3) in a shared-memory system with caches and show that the repeated execution of the TEST&SET operation by a process while the lock is acquired by another process causes the ping-pong effect. Let us see what happens when mutual exclusion is achieved using locks (Subsection 7.3.3). Assume there are N processors trying to enter the critical section and waiting that another process that has already acquired the lock release it. In the first variant of code segment in Subsection 7.3.3, the TEST&SET instruction repeatedly tests the value of the lock variable and also writes the lock variable in the SET part of the operation. With this scheme, each process contending for the lock continuously generates invalidations (due to the WRITEs to the lock variable) to the other caches. As a result of the invalidation of the lock variable by one process spinning on the lock, when the other N − 1 spinning processors read the lock they do not have a copy of the lock in cache and they must get it via interconnection. Consequently, each execution of TEST&SET by one processor causes N − 1 data transfers — the ping-pong effect that introduces a significant amount of traffic. The second variant of implementation of mutual exclusion with locks in Subsection 7.3.3 avoids the ping-pong effect by repeatedly executing a LOAD on the lock variable until the lock is seen to be free. After the first execution of a LOAD by a processor, the subsequent accesses to the lock variable will be made to the cached copy from the private cache of each processor, until this copy will be invalidated by the processor that releases the lock. In this way, waiting for the lock to be released is done on the cached copy of the lock and no traffic occurs because the waiting process does not write the lock anymore (thus, the criterion “miminum amount of traffic while waiting” is met). Let us examine now what happens when the lock is released by the processor that had acquired it. When the lock is released, all the N cache copies of the lock variable are invalidated. Therefore, the processors spinning on the lock must generate each a read to the memory system to read the lock. Depending on the timing, it is possible that all processors go to do the TEST&SET on the lock once they realize the lock is free, resulting in further invalidations and rereads. Therefore, it results that, even with the second variant, spin-locks may cause intense coherence activity if multiple processors are spinning when the lock is released by another processor. We can conclude that cached TEST&SET schemes are moderately successful in satisfying the efficiency criteria for low-contention locks, but fail for highly-contended locks because all the waiting processors rush to grab the lock when the lock is released.
  • 69.
    Memory Synthesis UsingAI Methods 63 8 SYSTEM CONSISTENCY MODELS The memory consistency model is the set of allowable memory access (events) ordering. Consistency models place requirements on the order that events from one process may be observed by other processes in the machine. The memory consistency model defines the logical behavior of the machine on the basis of the allowable order (sequence) of execution within the same process and among different processes. For the memory to be consistent, two necessary conditions must be met: 1. the memory must be kept coherent, that is, all writes to the same location are serialized in some order and are performed in that order with respect to any processor; 2. uniprocessor data and control dependences must be respected (this is the responsi- bility of the hardware and the compiler) — this is required for local-level consistency. Several memory consistency models have been proposed ([13]), such as sequential consis- tency, processor consistency, weak consistency, and release consistency. Sequential consis- tency is the strictest model and it requires the execution of a parallel program to appear as some interleaving of the execution of the parallel processes on a sequential machine. This model offers a simple conceptual programming model but limits the amount of hardware optimization that could increase performance. The other models attempt to relax the con- straints on the allowable event orderings, while still providing a reasonable programming model for the programmer. The architectural organization of a system may or may not inherently support atomicity of memory accesses. While memory accesses are atomic in systems with a single copy of data (a new value becomes visible to all processors at the same time), such atomicity may not be present in cache-based systems. The lack of atomicity introduces extra complexity in implementing consistency models. Caching of data also complicates the ordering of accesses by introducing multiple copies of the same location. When the bus is not the interconnection network, but general interconnection networks are used instead (non bus- based systems), the invalidations sent by a cache may reach the other caches in the system at different moments. Waiting for all nodes connected to the network to receive the invalidation and to send the acknowledge would cause a serious performance penalty. As a result of the distributed memory system and general interconnection networks being used by scalable multiprocessor architectures (for example, the DASH architecture [9,17] presented in Section 9.5.4), requests issued by a processor to distinct memory modules may execute out of order. Consequently, when distributed memory and general interconnection networks are used, performance would benefit if the consistency model allows accesses to perform out of order, as long as local data and control dependences are observed. The expected logical behavior of the machine must be known by the programmer in order to write correct programs. The memory consistency has a direct effect on the complexity of the programming model presented by a machine for the programmer.
  • 70.
    64 Memory SynthesisUsing AI Methods 8.1 Event Ordering Aspects Because a memory consistency model specifies what event orderings are legal when several processes are accessing a common set of locations, we first examine the stages a memory request goes through, and present some formal definitions. Access ordering has several aspects. Program order is defined as the order in which accesses occur in the execution of a single process given that no reordering takes place. All ac- cesses in the program order that are before the current access are called previous accesses. Execution of a memory access has several stages: the access is issued by a processor, it is performed with respect to some processor and is observed by other processors. We refine the definition of performing a memory request given in Section 4.5, as follows, where Pi refers to processor i: Defintion: Performing a Memory Request • A LOAD by Pi is considered performed with respect to Pk at a point in time when issuing a STORE to the same address by Pk can not affect the value returned by the LOAD. • A STORE by Pi is considered performed with respect to Pk at a point in time when an issued LOAD to the same address by Pk returns the value defined by this STORE (or a subsequent STORE to the same location) • An access is performed when it is performed with respect to all processors, and is performing with respect to a processor when it has been issued but has not yet been performed with respect to that processor. • A LOAD is globally performed if it is performed and if the STORE that is the source of the returned value has been performed. The distinction between performed and globally performed LOAD accesses is only present in architectures with non-atomic STOREs, because an atomic STORE becomes readable to all processors at the same time and it is not allowed to perform while another access to the same memory location is still performing. In architectures with caches and general inter- connection networks a STORE operation is inherently non-atomic unless special hardware mechanisms are employed to assure atomicity (for example, a cache-coherence protocol). Total order means the order in which accesses occur as the result of executing the accesses of all processes. A total order exists only if each processor is observing the same order of occurence with respect to accesses issued by all other processors. 8.2 Categorization of Shared Memory Accesses The knowledge from Chapter 7 allows making a general categorization of memory accesses, which will provide a global image of shared-memory accesses and will set the bottom line for the formulation of consistency conditions for different consistency models.
  • 71.
    Memory Synthesis UsingAI Methods 65 Conflicting accesses and Competing accesses Two accesses are called conflicting if they are to the same memory location and at least one of the accesses is a STORE (a Read-Modify-Write operation is treated as an atomic access consisting of both a LOAD and a STORE). Consider a pair of conflicting accesses a1 and a2 on different processors. If no ordering is guaranteed for the two accesses, then they may execute simultaneously thus causing a race condition. Such accesses a1 and a2 are said to form a competing pair. If an access is involved in a competing pair under any execution, then the access is called a competing access. A parallel program consisting of individual processes specifies the actions for each process and the interaction among processes. These instructions are coordonated through accesses to shared memory. For example, a producer process may set a flag variable to indicate to the consumer process that a data record is ready. Similarly, processes may enclose all updates to a shared data structure within LOCK and UNLOCK operations to prevent simul- taneous access. All such accesses used to enforce an ordering among processes are called synchronization accesses. Synchronization accesses have two distinctive character- istics: they are competing accesses, with one process writing a variable and the other reading it; and they are frequently used to order conflicting accesses (i.e., make them noncompeting) For example, the LOCK and UNLOCK synchronization operations (defined in Subsection 7.3.1) are used to order the non-competing accesses made inside a critical section. Synchronization accesses can be further partitioned into acquire and release accesses. An acquire synchronization acccess is performed to gain access to a set of shared locations. A release synchronization access grants this permission. An acquire is usually accomplished by reading a shared location until an appropriate value is read. Thus, an acquire is always associated with a read synchronization access. Similarly, a release is always associated with a write synchronization access. Examples of acquire synchronization accesses are: a LOCK operation or a process spinning for a flag to be set. Examples of release synchronization accesses are an UNLOCK operation or a process setting a flag. In fact, a LOCK operation requires a Read-Modify-Write (RMW) access. Most architectures provide atomic RMWs (such as the TEST&SET operation used to gain exclusive access to a set of data) for efficiently dealing with competing accesses. An atomic RMW can be associated with a pair consisting of an acquire access (for the read part of the operation) and of a release access (for the write part of the operation), but other categorization of the RMW operation is possible, as shown in the next Section. A competing access is not necessarily a synchronization access. A competing access that is not a synchronization access is called a non-synchronization competing access. The categorization of shared writable memory accesses is depicted in Figure 9(a). The categorization of shared accesses into these groups allows more efficient implemen- tation by using this information to relax the event ordering restrictions. The tradeoff is how easily that extra information can be obtained from the compiler or the programmer and what incremental performance benefits it can provide. For example, the purpose of a
  • 72.
    66 Memory SynthesisUsing AI Methods acquire release synchronization non-synchronization competing non-competing shared access acqL relL syncL nsyncL specialL ordinaryL sharedL (a) (b) Figure 9: Shared writable memory accesses: (a) Categorization ; (b) Labeling release access is to inform other processes that accesses that appear before it in program order have completed. On the other hand, the purpose of an acquire access is to delay future access to data until informed by another process. These two remarks are used in the definition of the release consistency model (Section 8.7). 8.3 Memory Access Labeling and Properly-Labeled Programs While categorization of shared writable accesses refers to the intrinsic properties of an access, the programmer or the compiler are asserting some categorization of accesses. The categorization of an access, as provided by the programmer or the compiler is called labeling for that access. The labelings for the memory accesses in a program are shown in Figure 9(b). The subscript L denotes that these are labels. Access labeling is usually done so that a label corresponds to the category of access that has the same position in the categorization tree of Figure 9(a). The labels acqL and relL refer to the acquire and release accesses respectively. The labels at the same level are disjoint, and a label at a leaf implies all its parents are labels (e.g., an access labeled as acqL is also labeled as syncL, specialL, and sharedL). For consistency models that use the information conveyed by the labels, the labels need to have a proper relationship to the actual category of accesses to ensure correctness. Although labeling normally corresponds to the categorization, sometimes it can be more
  • 73.
    Memory Synthesis UsingAI Methods 67 conservative than categorization. For example, the ordinaryL label asserts that an access is non-competing. Since hardware may exploit the ordinaryL label to use less strict event ordering, it is important that the ordinaryL label be used only for non-competing accesses. However, a non-competing access can be labeled conservatively as specialL (thus, in some cases a distinction between the categorization and the labeling of an access is made). To ensure that accesses labeled ordinaryL are indeed non-competing, it is important that enough competing accesses (i.e., accesses labeled as specialL) be labeled as acqL and relL. The difficulty of ensuring enough syncL labels for a program depends on the amount of information about the category of accesses, but it is shwon farther in this section that the problem can be solved by following a conservative labeling strategy. Because labels at the same level are disjoint and a label at a leaf implies all its parent labels, it follows that: (1) an acqL or relL label implies the syncL label, and (2) any specialL access that is not labeled as syncL is labeled as nsyncL, and (3) any sharedL access that is not labeled as specialL is labeled as ordinaryL. The LOAD and STORE accesses in a program are labeled based on their categorization. The atomic read-modify-write (such as the TEST&SET primitive) provided by most architectures is labeled by seeing it as a combination of a LOAD access and a STORE access and by labeling separately each access based on its categorization. The common labeling for a TEST&SET primitive is an acqL for the LOAD access and a nsyncL for the STORE access, because the STORE access does not function as a release. If the programmer or the compiler cannot categorize an RMW appropriately, then the conservative label for guaranteeing correctness is: acqL for the LOAD and relL for the STORE part of the RMW. When all accesses in a program are appropriately labeled, the program is called a properly- labeleled (PL) program. The conditions that ensure that aprogram is properly labeled are given in [13]: Condition for Properly-Labeled (PL) Programs A program is properly-labeled if the following hold: (shared access) ⊆ sharedL, competing ⊆ specialL, and enough special accesses are labeled as acqL and relL. There is no unique labeling to make a program a PL program, that is, there are several labelings that respect the previous subset properties. Given perfect information about the category of an access, the access can be easily labeled making the labels (Figure 9(b)) to correspond to the categorization of accesses (Figure 9(a)). When perfect information to make labeling is not available, proper labeling can still be provided by being conservative. The three possible labeling strategies (from conservative to aggressive) are: 1. If competing and non-competing accesses can not be distinguished, then all reads can be labeled as acqL and all writes can be labeled as relL. 2. If competing accesses can be distinguished from non-competing accesses but syn- chronization and non-synchronization accesses can not be distinguished, then all accesses distinguished as non-competing can be labeled as ordinaryL and all com- peting accesses are labeled as acqL and relL (as in strategy (1)).
  • 74.
    68 Memory SynthesisUsing AI Methods 3. If competing and non-competing accesses are distinguished and synchronization and non-synchronization accesses are distinguished, then all non-competing accesses can be labeled as ordinaryL, all non-synchronization accesses are labeled as nsyncL and all synchronization accesses are labeled as acqL and relL (as in strategy (1)). There are two practical ways for labeling accesses to provide properly-labeled (PL) pro- grams. The first involves parallelizing compilers that generate parallel code from sequential programs. Since the compiler does the parallelization, the information about which ac- cesses are competing and which accesses are used for synchronization is known to the compiler and can be used to label the accesses properly. The second way of producing PL programs is to use a programming methodology that leads itself to proper labeling. For example, a large class of programs are written such that accesses to shared data are pro- tected within critical sections. Such programs are called synchronized programs, whereby writes to shared locations are done in a mutually exclusive manner. In a synchronized pro- gram, all accesses (except accesses that are part of the synchronization constructs) can be labeled as ordinaryL. In addition, since synchronization constructs are predefined, the ac- cesses within them can be labeled properly when the constructs are first implemented. For this labeling to be proper, the programmer must ensure that the program is synchronized. 8.4 Sequential Consistency Model The strictest consistency model is called Sequential Consistency (SC) and has been defined by Lamport [2] as follows: A system is sequentially consistent if the result of any execution of a program is the same as if the operation of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. In other words, the sequential consistency model requires execution of the parallel program to appear as some interleaving of the execution of the parallel processes on a sequential machine. An interleaving that is consistent with the program order is called legal inter- leaving. Application of the above definition requires a specific interpretation of the terms operations and result. Operations are memory accesses (reads, writes, and read-modify- writes) and result refers to the union of values returned by all the read operations in the execution and the final state of memory. The definition of sequential consistency can be translated in the following two conditions: (1) all memory accesses appear to execute atomically in some total order, and (2) all memory accesses of each processor appear to execute in an order specified by its program, that is, in program order. Speaking in event ordering terms, a sequential consistent memory assures that the execu- tion of processes is such that there is a total order of memory accesses that is consistent with the program order of each process. Under sequential consistency, identification of accesses that form a competing pair can be achieved with the following criterion: Two
  • 75.
    Memory Synthesis UsingAI Methods 69 conflicting accesses a1 and a2 on different processes form a competing pair if there exists at least one legal interleaving where a1 and a2 are adjacent. Assuming the SC model, the following criterion (given in [13]) may be used for determining whether enough accesses are labeled as syncL (i.e., as acqL and relL) Condition for enough syncL labels Pick any two accesses u on processor Pu and v on processor Pv (Pu not the same as Pv), such that the two accesses conflict and at least one is labeled as ordinaryL. If v appears after (before) u under any interleaving consistent with the program order, then there needs to be at least one relL (acqL) access on Pu and one acqL (relL) on Pv separating u and v, such that the relL appears before the acqL. There are enough accesses labeled as syncL —that is, relL and acqL labeled accesses— if the above condition holds for all possible pairs u and v. The SC model ignores all access labelings past sharedL. In systems that are sequentially consistent we say that events are strongly ordered: the order in which events are generated by a processor is the same as the order in which all the other processors observe the events, and events generated by two different processors are observed in the same order by all other processors. 8.4.1 Conditions for Sequential Consistency Necessary and Sufficient Conditions for SC in Systems with Atomic Accesses It has been shown by Dubois et al. ([3]) that the necessary and suficient condition for a system with atomic memory accesses to be sequentially consistent is that memory accesses be performed in program order. In architectures with caches and general interconnection networks, where accesses are inherently non-atomic, special hardware and software mechanisms must be employed to assure sequential consistency. Here are the suficient conditions for sequential consistency (as given in [13]) in systems with non-atomic accesses: Sufficient Conditions for SC in Systems with Non-Atomic Accesses (1) before a LOAD is allowed to perform with respect to any other processor, all previous LOAD accesses must be globally performed and all previous STORE accesses must be per- formed, and (2) before a STORE is allowed to perform with respect to any other processor, all previ- ous LOAD accesses must be globally performed and all previous STORE accesses must be performed. 8.4.2 Consistency and Shared-Memory Architecture Let us examine the consistency model for the common shared-memory architectures: shared-bus systems without caches, shared-bus systems with caches, systems with gen- eral interconnection networks without caches, and systems with general interconnection networks with caches.
  • 76.
    70 Memory SynthesisUsing AI Methods Shared-bus systems without caches In this case, if memory accesses (i.e., LOADs, STORE and RMW) cycles are atomic (i.e., data elements are accessed and modified in indivisible operations), then each access to an element applies to the latest copy. Simultaneous accesses to the same element of data are serialized by the hardware. Sequential consistency is thus guaranteed for this architecture if the hardware assures that accesses of a processor are issued in program order and if reads are not allowed to bypass writes in write buffers (to maintain atomicity of accesses). Shared-bus systems with caches In these systems, accesses are inherently non-atomic. To guarantee sequential consistency, in addition to the requirements that accesses be issued in program order and reads do not bypass writes, special hardware mechanisms must be employed to make STOREs and LOADs appear to be atomic. A snooping cache protocol (as presented in Subsection 9.4.2) can be used for this purpose. The protocol exploits the simultaneous broadcast capability of buses: when a processor executes a STORE, it generates an invalidation signal on the bus; all cache controllers (and possibly the memory controller) are latching simultaneously the invalidation generated by a STORE request. As soon as each controller has taken the proper action on the invalidation, the access can be considered performed. Systems with general interconnection netorks without caches In these systems, accesses are inherently non-atomic because the time taken for an access to reach the target memory module depends on the path of the access and is generally inpredictible. To guarantee sequential consistency, in addition to the requirements that accesses be issued in program order and reads do not bypass writes, special hardware mechanisms must be employed to assure that accesses perform in program order. Systems with general interconnection networks with caches In these these systems, accesses are inherently non-atomic because of the caches and the general interconnection network properties. To guarantee sequential consistency, in addition to the requirements that accesses be issued in program order and reads do not bypass writes, special hardware mechanisms must be employed to assure that accesses perform in program order and appear to execute atomically. 8.4.3 Performance of Sequential Consistency Sequential consistency, while conceptually offering a simple programming model, imposes severe restrictions on the outstanding accesses that a process may have and prohibits many hardware optimizations that could increase performance. As it will be aparent from the cache-coherence protocols guaranteeing sequential consistency (for example, the Full- map directory protocol, Subsection 9.5.2) this strict model limits the performance. For many applications, such a model is too strict, and one can do with a weaker notion of consistency. As an example, consider the case of a processor updating a data structure within a critical section. If the computation requires several STOREs and the system is sequentially consistent, then each STORE will have to be delayed until the previous STORE is complete. But such delays are unnecessary because the programmer has already made sure that no other process can rely on that data structure being consistent until the critical section is exited. Given that all synchronization points are identified, the memory needs
  • 77.
    Memory Synthesis UsingAI Methods 71 only be consistent at those points. Several memory consistency models that attempt to relax the constraints on the allowable event orderings have been proposed and they are called relaxed consistency models. The most prominent relaxed consistency models are the processor consistency, weak consis- tency, and release consistency models. The larger latencies found in a distributed system, as compared to a shared-bus system, favor the relaxed consistency models because they allow more performant implementations than those allowed by the sequential consistency model. 8.5 Processor Consistency Model The processor consistency (PC) model requires that all writes issued from a processor may be only observed in the order in which they were issued, but allows that the order in which writes from two processors occur, as observed by themselves or a third processor, may not be identical. The conditions for processor consistency are defined in [13] as follows: Conditions for Processor Consistency (1) before a LOAD is allowed to perform with respect to any other processor, all previous LOAD accesses must be performed, and (2) before a STORE is allowed to perform with respect to any other processor, all previous accesses (LOADs and STOREs) must be performed. The above conditions allow reads following a write to bypass the write. To avoid deadlock, the implementation should guarantee that a write that appears previously in program order will eventually perform. The PC model ignores all access labelings aside from sharedL. 8.6 Weak Consistency Model The weak consistency model and the release consistency model (next Section) employ the categorization and labeling of memory accesses (Section 8.2 and 8.3) to relax the event ordering restrictions on the basis of extra information (provided by the programmer or the compiler) on the type of memory access. The weak consistency model proposed by Dubois et al. ([3]) is based on the idea that the interaction between parallel processes manifests itself through synchronization accesses that are used to order events and through ordinary shared accesses. If synchronization ac- cesses can be recognized, and sequential consistency is guaranteed only for synchronization accesses, then the ordinary accesses might proceed faster because they need to be ordered only with respect to synchronization accesses. This improves performance because ordi- nary accesses are more frequent than synchronization accesses. As an example, consider a processor updating a data structure within a critical section. If updating the structure re- quires several writes, each write in a sequentially consistent system will stall the processor untill all other cached copies of that location have been invalidated. But these stalls are unnecessary, as the programmer has already made sure that no other process can rely on the consistency of that data structure until the critical section is exited. If the synchro-
  • 78.
    72 Memory SynthesisUsing AI Methods nization points can be identified, then the memory need need only be consistent at those points. The weak consistency model exploits this idea and guarantees that the memory is consistent only following a synchronization operation. The conditions that ensure weak consistency are given in [13]: Conditions for Weak Consistency (1) before an ordinary LOAD or STORE access is allowed to perform with respect to any other processor, all previous synchronization accesses must be performed, (2) before a synchronization access is allowed to perform with respect to any other pro- cessor, all previous ordinary (LOADs and STOREs) accesses must be performed, and (3) synchronization accesses are sequentially consistent with respect to one another. Speaking in terms of access labeling, under the weak consistency model only the la- bels sharedL, ordinaryL, and specialL are taken into account, with an access labeled as specialL being treated as a synchronization access and as both an acquire and a release. In a machine supporting weak consistency (also called weak ordering of events [3,13,17]) the programmer should make no assumption about the order in which the events that a process generates are observed by other processes between two explicit synchronization points. Accesses to shared writable data should be executed in a mutually exclusive manner, controlled by synchronization operations, such as LOCK and UNLOCK. Only synchronization accesses are guaranteed to be sequentially consistent. Before a synchronization access can proceed, all previous ordinary accesses must be allowed to “settle down” (i.e., all shared memory accesses made before the synchronization point was encountered must be completed before the synchronization access can proceed). In such systems we say that events are weakly ordered. The advantage of the weak consistency model is that it provides the user with a reasonable programming model, while permitting multiple memory accesses to be pipelined, and thus allowing high-performance. For example, consider a multiprocessor with a buffered, mul- tistage, and packet-switched interconnection network. If strong ordering is to be enforced, then the interface between the processor and the network can send global memory requests only one at a time. The reason for this is that in such a network the access time is vari- able and unpredictible because of conflicts; in many cases waiting for an acknowledgement from the memory controller is the only way to ensure that global accesses are performed in program order. In the case of weak ordering the interface can send the next global access directly after the current global access has been latched ih the first stage of the interconnection network, resulting in better processor efficiency. However, the frequency of synchronization operations (such as LOCKs) will be higher in a program designed for a weakly ordered system. Therefore, weak consistency is expected to be more performant than sequential consistency in systems that do not synchronize frequently. The disadvantage of the weak consistency model is that the programmer or the com- piler must identify all synchronization accesses in order to support mutually exclusive access to shared writable data. Moreover, the synchronization accesses must be hardware- recognizable to enforce that they are sequentially consistent.
  • 79.
    Memory Synthesis UsingAI Methods 73 8.7 Release Consistency The release consistency model (RC) is an extension of the weak consistency model, in which the requirements on synchronization accesses and ordinary accesses ordering are relaxed. The release consistency model exploits the information conveyed by the labels at the leaves of the labeling tree, that is, the labelings ordinaryL, nsyncL, acqL, and relL are considered by the model. Basically, RC guarantees that memory is consistent only when a critical section is exited. The conditions for ensuring release consistency are given in [13] as follows: Conditions for Release Consistency (1) before an ordinary LOAD or STORE access is allowed to perform with respect to any other processor, all previous acquire accesses must be performed, (2) before a release access is allowed to perform with respect to any other processor, all previous ordinary (LOADs and STOREs) accesses must be performed, and (3) special accesses are processor consistent with respect to one another. The ordering condition stated by the weak consistency model for synchronization accesses is extended under the release consistency model to special accesses, that include all compet- ing accesses, both synchronization and non-synchronization accesses. On the other hand, four of the ordering restrictions in weak consistency are not present in release consistency: 1. First, ordinary LOAD and STORE accesses following a release access do not have to wait for the release access to be performed. Because the release synchronization access is intended to signal that previous LOAD and STORE accesses in a critical section are complete, it is not related to the ordering of the future accesses. Of course, the local dependences within a processor must still be respected by LOADs and STOREs. 2. Second, an acquire synchronization access need not be delayed for previous ordinary LOAD and STORE accesses to be performed. Because an acquire access is intended to prevent future accesses by other processors to a set of shared locations, and is not giving permission to any other process to access the previous pending locations, there is no reason for the acquire to wait for the pending accesses to complete. 3. Third, a non-syncronization special access does not wait for previous ordinary ac- cesses and does not delay future ordinary accesses; therefore, a non-synchronization access does not interact with ordinary accesses. 4. Fourth, the special accesses are only required to be processor consistent and not sequentially consistent. The reason for this is that, provided that the applications meet some restrictions, sequential consistency and processor consistency for special accesses give the same results. The restrictions that allow this relaxed requirement on special accesses are given in [13] and have been verified there to hold for the parallel applications available at the time the study has been conducted. Essentially, RC guarantees that the memory is consistent when a critical section is exited, by requiring that all ordinary memory operations be performed before the critical section
  • 80.
    74 Memory SynthesisUsing AI Methods is released. The reason that this requirement suffices is that when a processor is in its critical section modifying some shared data, no other process can access that data until this section is exited. The RC model provides the user with a reasonable programming model, since the pro- grammer is assured that when the critical section is exited, all other processors will have a consistent view of the modified data. The relaxed requirements on access ordering allows RC implementations to hide or mask the effects of memory access latency, that is, the effects of the memory access latency is delayed until the selected synchronization access occurs. 8.8 Correctness of Operation and Performance Issues The programmer must know the consistency model to be able to write correct programs because memory consistency determines the programming model presented by a machine for the programmer. In addition, the method for identifying an access as a competing access depends on the consistency model and is generally difficult. For example, it is possible for an access to be competing under processor consistency and non-competing under sequential consistency. Consistency models differ from the point of view of exploiting the information conveyed by the access labels. The sequential and processor consistency models ignore all labels aside from sharedL. The weak consistency model ignores all labelings past ordinaryL and specialL; in weak consistency an access labeled as specialL is treated as a synchronization access and as both an acquire and a release. In contrast, the release consistency model exploits the information conveyed by the labels at the leaves of the labeling tree (i.e., ordinaryL, nsyncL, acqL, and relL). Labeling the accesses to provide a properly labeled program may be done either using a parallelizing compiler or requiring the programmer to design synchronized programs (as shown in Section 8.3). The conditions for satisfying each consistency model have been formulated in Sections 8.4 – 8.7 such that a process needs to keep track of only requests initiated by itself. Thus, the compiler and hardware can enforce ordering on a per process(or) basis. The memory consistency model supported by an architecture directly affects the efficiency of the implementation (e.g., the amount of buffering and pipelining that can take place among the memory requests). Sequential consistency presents a simple programming paradigm but it reduces potential performance, especially in a machine with a large number of processors or long delays in the interconnection network. While weak consistency and release consistency models allow potentially greater performance, they require however that a proper labeling of memory accesses is provided (that is, extra information about the labeling of memory accesses is required from the programmer or the compiler) to exploit that potental. The correctness of a multiprocessor operation is related to the expected model of behavior for the machine. A programmer who expects a system to behave in a sequentially consis- tent manner will perceive the system to behave incorrectly if the system allows its processes
  • 81.
    Memory Synthesis UsingAI Methods 75 to execute accesses out of program order. For non-sequential consistent machines to pro- duce the same results as SC, the program must include synchronization operations that order competing accesses. Synchronization allows a program to give results independent of the execution rates of processors. Thus, the consistency model has a direct effect on the complexity of the programming model presented for the programmer and on performance. The challenge is to find the bal- ance between providing a reasonable programming model to the programmer and achieving high performance by providing freedom in the ordering among memory requests.
  • 82.
    76 Memory SynthesisUsing AI Methods 9 CACHE COHERENCE PROTOCOLS 9.1 Types of Protocols Caches in multiprocessors must operate coherently. The coherence problem is related to two types of events: two (or more) processors trying to update the value of a shared variable, or program migration between processors. Caches will operate consistently if for each processor its memory accesses are directed to the current active location of any variable whose true physical location can change. Solutions of different complexity are possible, but in general the simpler the solution, the greater will be the performance penalty incurred. A simple architectural solution is to disallow private caches and have only shared caches that are associated with the main memory modules. A network interconnects the processors to the shared cache modules and every data access is made to the shared cache. Because with this solution the advantage of caches reducing memory traffic is lost, it is not considered as a performant method. A cache-coherence protocol consists of the set of possible states in the local caches, the states in the shared memory, and the state transitions caused by the messages transported through the interconnection network to keep memory coherent. There are three classes of protocols followed to maintain cache coherence: • Snooping — Every cache that has a copy of the data from a block of physical memory also has information about it, but this information does not specifiy where other copies of that block are. Accesses to caches are broadcast on the interconnection network, so that all other caches can check the block address and determine whether or not they have a copy of the shared block. The caches are usually on a shared-memory bus, and all cache controllers monitor or snoop on the bus. Depending on what happens on a write, snooping protocols are of two types: 1. Write invalidate — the writing processor causes all copies in other caches to be invalidated (by broadcasting the address of data) before changing its local copy. This scheme allows multiple caches to read a data, but only only one cache can write it: this type of sharing is called Multiple Readers Single Writer (MRSW). 2. Write update (also called write broadcast) — the writing processor broadacasts the new data over the bus so that all copies are updated with the new value. This type of sharing is called Multiple Readers Multiple Writers (MRMW). • Directory based — Information about the state of every block in physical memory is kept partially in a directory entry and partially in every cache: 1. A directory entry is associated with every memory block and it is composed of a state bit together with a vector of pointers. The state bit indicates whether the line is not cached by any cache (uncached), shared in an unmodified state in one or more caches, or modified in a single cache (dirty). The pointers give the location of the caches that have a copy of the line.
  • 83.
    Memory Synthesis UsingAI Methods 77 2. An additional status bit, called the private bit, is appended to every cache line and, together with the valid bit, indicates the state of the cache block in this cache. A cache block in a processor’s cache, just as a memory block, may also be in one of three states: invalid, shared, or dirty. The shared state implies that there may be other processors caching that location. The dirty state implies that this cache contains an exclusive copy of the memory block, and the block has been modified in this cache and nowhere else. • Compiler-directed — Compile-time analysis is used to obtain information on accesses to a given line by multiple processors. Such information can allow each processor to manage its own cache without interprocessor runtime communication. Depending on which action is taken on a write to shared data — invalidate or update — cache-coherence protocols are categorized as write-invalidate or write-update protocols. The correctness of a coherence protocol is a function of the memory consis- tency model adopted by the architecture. The selection of a cache coherence protocol is related to the type of interconnection network. For a shared-memory bus architecture, snooping can be easily implemented because buses support the basic mechanism for broad- cast: bus transaction automatically assures that all receivers are listening to the bus when the transmitting processor gains access to the bus. Thus, any memory access made by one device connected to the bus can be “seen” by all other devices connected to the bus. Buses, although suited for broadcast, have the flaw that they can not support a heavy broadcast traffic which is likely to appear as the number of processors increases (an exam- ple is given in the next Section). For general scalable interconnection networks, such as Omega networks, or k-nary n-cubes, neither efficient broadcast capabiltiy nor convenient snooping mechanism are provided. To achieve high-performance, the coherence commands should be sent to only those caches that have a copy of that block. Because directory pro- tocols maintain for each memory block information about which caches have copies of the block, they are suited for general interconnection networks systems. However, directory coherence protocols may be used in bus-based systems as well, usually for multiprocessors with a large number of processors (about 100), where snooping protocols can not scale. An architecture is scalable if it achieves linear or near-linear performance growth as the number of processors increases. Since snooping schemes distribute information about which processors are caching which data items among the caches, they require that all caches see every memory request from every processor. This inherently limits the scal- ability of these machines because the individual processor caches and the common bus eventually saturate. With today’s high-performance RISC processors this saturation can occur with just a few processors. Directory structures avoid the scalability problems by removing the need to broadcast every memory request to all processor caches. This is because the directory maintains pointers to the processor caches holding a copy of each memory block, and since only the caches with copies can be affected by an access to the memory block, only those caches need to be notified of the access. Thus, the processor caches and interconnection network will not saturate due to coherence requests. Further- more, directory-based coherence is not dependent on any specific interconnection network like the bus used by most snooping schemes.
  • 84.
    78 Memory SynthesisUsing AI Methods 9.2 Rules enforcing Cache Coherence When a processor is writing a shared datum in its cache, the coherence protocol must locate all the caches that share the datum. The consequence of a write to shared data is either to invalidate all other copies or to broadcast the write to the shared copies in order to update them. When write-back strategy is used, the coherence protocol must also help read misses determine who has the most up-to-date value, because for this strategy the shared-memory may not have the current copy of data, but the current value of a data item may be in any of the caches. The two basic conditions that must be met to maintain cache coherence are: 1. If a read operation for a shared datum misses in the cache, then a means must exist to identify whether other cache(s) is (are) having the valid copy of the datum. 2. All write operations to a shared datum for which the processor does not have exclu- sive access must force copies of that datum in all other caches to be invalidated or updated. Observing these rules may introduce significant performance penalties, due to increasing coherence cache-accesses and network contention. For example, in a snooping protocol, all other caches in the system are checked using bus-broadcast interrogation both for read misses and for writes to shared data. The first rule requires a broadcast of the interrogation over the interconnection network to all caches followed by a cache read in every cache in the system. That tends to increase network contention and reduce available cache bandwidth. Since this operation takes place only on misses to shared data, its frequency should be just a few percent of the reads on any single processor. As the number of processors increases, however, the load on the communication network and cache traffic quickly approaches saturation. For example, a 1 percent miss ratio on shared data in each of 100 processors of a multiprocessor can generate 100 x 0.01 = 1 broadcast request and one cache read per clock cycle. This broadcast will saturate the communication system and the individual caches of all processors. The second rule can cause potentially greater degradation for a write-update snooping protocol, given that it requires a communication overhead on every write to a shared datum. Directory protocols try to avoid this by keeping information about which line is shared in which cache and avoid communication with caches that do not share the line, but generally hot spots accesses can appear. If two or more processes attempt to access and modify the same shared variable several times over a brief period of time, and if the requests by each processor are interleaved in some order, then the cache coherence protocol generally causes heavy traffic due to the access pattern that progressively moves the datum from one cache to another as it is read and modified repeatedly. This behavior appears in multiprocessor systems for barrier and lock variables (Section 7.4 and 7.5). 9.3 Cache Invalidation Patterns Knowledge of the access pattern to shared variables enables keeping memory system la- tency and contention as low as possible. The two write policies that can be used for
  • 85.
    Memory Synthesis UsingAI Methods 79 coherence protocols —write invalidate and write update— exhibit performance dependent on the sharing pattern. Snooping protocols may use either write invalidate or write update, while directory-based protocols use write invalidate. Write-invalidate schemes maintain cache coherence by invalidating copies of a memory block when the block is modified by a processor. For snooping-based protocols, the invalidation is broadcast and all caches are checking if they have a copy of the line that must be invalidated, while for directory-based protocols only the caches that actually share the line receive the invalidation message. The sharing pattern is characterized by several parameters, of particular importance be- ing the number of caches sharing a data object and the write-run. The write-run has been defined by Eggers and Katz [18] as the length of the uninterrupted sequence of write requests interspread with reads to a shared cache line by one processor. A write-run is terminated when another processor reads or writes the same cache line. The length of the write-run is the number of writes in that write-run. Every new write-run requires an invalidation and data transfer. When write-runs are short, the write-invalidate scheme generates frequent invalidations and the write update scheme generates equally frequent updates. Since the total time cost for invalidations and data transfer is higher than the cost of updating one word, write-invalidate schemes are inferior for this sharing pattern. On the other hand, for long write-runs, the write update scheme generates many updates that are redundant, given the length of the write-run. Therefore, write invalidate performs better for long write-runs because only the first write in a write-run causes invalidation of the shared copies of the written line. Furthermore, a write invalidate scheme in a directory-based protocol sends one invalidation request per write-run only to the caches that actually share the line. A study conducted on a simulated 32-processor machine [9] shows that, for a large number of applications, most writes cause invalidations to only a few caches, with only about 2% of all shared writes causing invalidation of more than 3 caches. Write invalidate protocols perform fairly well for a broad range of sharing pat- terns. However, there exist some sharing patterns for which unnecessary invalidations are generated. A notable example is the invalidation overhead associated with data structures that are accessed within critical sections. Typically, processors read and modify such data structures one at a time. Processors that access data this way cause a cache miss followed by an invalidation request being sent to the cache attached to the processor that most recently exited the critical section. This sharing behavior, denoted migratory sharing has been previously shown to be the major source of single invalidations (i.e., invalidation of one cache) by Gupta and Weber in [14]. An extension of the write-invalidate protocol that effectively eliminates most single invalidations caused by migratory sharing has been proposed by Stenstr¨om et al. in [19]. This scheme improves performance by reducing the shared access penalty and the network traffic. 9.4 Snooping Protocols 9.4.1 Implementation Issues A bus is a convenient device for ensuring cache coherence because it allows all processors in the system to observe ongoing memory transactions. In a snooping protocol each cache snoops on the transactions of other caches. When the cache controller sees an invalidation
  • 86.
    80 Memory SynthesisUsing AI Methods or update message broadcast over the bus, it takes the appropriate action on the local copy of the line. Snooping protocols allow all data to be cached and coherence is maintained by hardware. Sharing information is added to the valid bit in a cache line. This information is used in monitoring bus activities. Snooping protocols have the avantage that because the sharing information about a memory block is kept in the caches that have a copy the block, the amount of memory required to keep this information is proportional to the number of blocks in the cache, as opposite to the directory protocols where the directory memory is proportional to the number of blocks in main memory. A write update protocol broadcasts writes to shared data while write invalidate deletes all other copies so that there is only one local copy for subsequent writes. On a read miss on the bus, all caches check to see if they have a copy of the requested line and take the appropriate action, such as supplying the data to the cache that missed. Similarly, on a write miss on the bus, all caches check to see if they have a copy and if they find out that they have a copy of the written data they invalidate their copy or change it to the new value (depending on whether write invalidate or write broadcast is used). Write update protocols usually allow cache lines to be tagged as shared or private. Only shared data need to be updated on a write. If this information about data sharing is available, a write update protocol acts like a write-through cache for shared data (broad- casting to other caches) and a write-back cache for private data (the modified data leaves the cache only on a miss). Wite invalidate protocols maintain a state bit for each cache block, that in conjuction with the valid bit defines the state of the block. A block can be in one of the following three states: 1. clean (also called Read Only) —the copy of the block in cache is also in the main memory; that is, the block has not been modified in cache, or the modification has been updated in main memory; 2. dirty (also called Read/Write) —the block has been modified in cache, but not in main memory; 3. invalid —not valid data block. Most cache-based multiprocessors use write-back caches in order to reduce the bus traffic and allow more processors on a single bus. The dirty bit used by the write-back policy is also used by the cache-coherence protocol to define the state of the cache block, as described above. There is no obvious choice on which snooping protocol (write-invalidate or write-broadcast) is superior, because the performance of both variants is dependent on the sharing pattern of the application, as shown on Section 9.3. 9.4.2 Snooping Protocol Example Let us build the finite-state machine that implements a write-invalidation protocol based on write-back policy. The finite-state transition diagram is depicted in Figure 10.
  • 87.
    Memory Synthesis UsingAI Methods 81 Read/Write (dirty) Invalid (not valid cache block) Read only (clean) CPU write CPU write miss CPU Read miss CPU Read miss write back dirty block CPU Write (hit or miss) send invalidate if hit Cache state transitions us- ing signals from CPU Read/Write (dirty) Invalid (not valid cache block) Read only (clean) Read miss or write miss on bus for this block Invalidate or write miss on bus for this block Cache state transitions us- ing signals from bus Figure 10: A Write-Invalidate Snooping Cache-Coherence Protocol
  • 88.
    82 Memory SynthesisUsing AI Methods There is only one state-machine in a cache, with stimuli coming either from the attached CPU or from the bus, but the figure shows the three states of the protocol in duplicate in order to distinguish the transitions based on CPU actions, as opposed to transitions based on bus operations. Transitions happen on read misses, write misses, or write hits; read hits do not change cache state. When the CPU has a read miss, it will change the state of that block to Read only and write back the old block if it was in the Read/Write state (dirty). All the caches snoop on the read miss to see if this block is in their cache. If one cache has a copy and it is in the Read/Write state, then the block is written to memory and is then changed to the Invalid state (as shown in this protocol) or Read only. When a CPU writes into a block, that block goes to the Read/Write state. If the write was a hit, an invalidate signal goes out over the bus. Because caches monitor the bus, all check to see if they have a copy of that block; if they do, they invalidate it. If the write was a miss, all caches with copies go to the invalid state. For simplicity, write to clean data may be treated as a “write miss”, so that there is no separate signal for invalidation, but the same bus signal as for write miss is used. 9.4.3 Improving Performance of Snooping Protocol Reducing Interference Between Broadcasts and CPU Operation Since every bus transaction checks cache-address tags, it would interfere with the CPU accesses to cache if only a copy of the address tag were accessed both by CPU and snooping. To remove this problem, the address tag portion of the cache is duplicated so that an extra read port is available for snooping; these two identical copies of the address tag are called snoop tag and normal tag respectively. In this way, snooping interferes with the CPU’s accesses to the cache only when the tags must be changed, that is, when the CPU has a miss or when a coherence operation occurs. On a miss, the CPU arbitrates with the bus to change the snoop tags as well as the normal tags (to keep the address tags coherent). When a coherence operation occurs in the cache, the CPU will likely stall, since the cache is unavailable. Reducing Invalidation Interference Some designs ([20]) are queuing the invalidation requests. A list of the addresses to be invalidated is maintained in a small hardware-implemented queue called Buffer Invalida- tion Address Stack (BIAS). The BIAS has a high priority for cache cycles, and if the target line is found in the cache, it is invalidated. To reduce the interference between the invalidation accesses to cache and the normal CPU accesses, a BIAS filter memory ([20]) may be used. A BFM is associated with each cache and works by filtering out repeated requests to invalidate the same block in a cache. Snooping protocols are fairly simple and not expensive. For multiprocessors with a small number of processors they perform well. The disadvantage is that the snooping protocols are not scalable. Buses don’t have the bandwidth to support a large number of processors. The coherence traffic quickly increases with the number of processors because snooping protocols require that all caches see every memory request from every processor. The
  • 89.
    Memory Synthesis UsingAI Methods 83 shared bus and the need to broadcast every memory request to all processor caches inher- ently limit the scalability of snooping protocol-based machines, because the common bus and the individual processor caches eventually saturate. 9.5 Directory-based Cache Coherence 9.5.1 Classification of Directory Schemes A directory is a list of the locations of the cached copies for each line of shared data. A directory entry is associated with each memory block and contains a number of pointers to specify the locations of copies of the block and a state bit to specify whether ot not a unique cache has permission to write that line. Depending on the amount of information stored in a directory entry, directory protocols fall in two categories: • full-map directories —the directory stores for each block in global memory informa- tion about all caches in the system, so that every cache can simultaneously have a copy of any block of data. In this case, the pointers in a directory entry are simply presence bits associated with each cache. This type of protocol has the advantage of allowing full-sharing of any memory block, but is not scalable with respect to memory overhead. Indeed, assuming that the amount of shared memory increases linearly with the number of processors, N, then, because the size of a directory entry is proportional to the number of processors, and the number of entries is equal to the number of blocks that is proportionat to the memory size, it results that the size of directory is Θ(N) ∗ Θ(N) = Θ(N2). • limited directories —each directory entry has a fixed number of pointers, regardless of the number of caches in the system; they have the disadvantage of restricting the number of simultaneously cached copies of a memory block, but have the advan- tage of limiting the growth of the directory to a constant factor of the number of processors. When only an unique directory is accessed by all the caches in the system, the directory structure is called centralized directory. The architecture of a system using a full-map centralized-directory coherence protocol is shown in Figure 11. The main memory (and the directory) can be made up from several memory modules. 9.5.2 Full-Map Centralized-Directory Protocol The classical centralized-directory protocol is a full-map protocol and has been first pro- posed by Censier and Feautrier [1]. For each block of shared-memory there is a directory entry that contains: • one presence bit per processor cache. The set of presence bits is a bit vector called the presence vector;
  • 90.
    84 Memory SynthesisUsing AI Methods Cache Processor 1 Cache Processor 2 Cache Processor n . . . Interconnection Network Memory Data Directory . . . State bit Presence bits Figure 11: Full-Map Centralized-Directory architecture
  • 91.
    Memory Synthesis UsingAI Methods 85 • one state bit (also called dirty bit) that indicates whether the block is uncached (not cached by any cache; all presence bits are equal to zero), shared in multiple caches, or held exclusively by one cache. In the latter case, the block is called dirty. When the state bit shows a dirty block, only one presence bit is set, which indicates the cache that holds the current copy of data, that is, the owner of the block. Otherwise, the block is said to be clean. Every cache in the system maintains two bits of state per block. One bit indicates whether a block is valid; the other state bit indicates whether a valid block may be written, and is called the private bit. If the private bit is set in a cache, then that cache has the only valid copy of that line, that is, the line is diry; the corresponding directory entry has the dirty bit and the presence bit for that cache set. The cache that has the private bit set for a line is said to own the line. The cache-coherence protocol must keep the state bits in the directory (i.e., presence and dirty bits) and those in the caches (i.e., valid and private bits) consistent. Using the state and presence bits, the memory can tell which caches need to be invalidated when a location is written. Likewise, the directory indicates whether memory’s copy of the block is up-to-date or which cache holds the most recent copy. The full-map centralized-directory protocol presented below can be applied to systems with general interconnection networks and guarantees sequential consistency. Write back strategy is assumed. The initial state of a directory entry associated with a line X is when none of the caches in the system has a copy of that line. Therefore, the valid and private bits for line X are reset to zero in all caches, the directory entry for line X has all the presence bits and the dirty bit reset to zero—that is, the line is clean and not cached. When a cache, denoted by C1, misses the read of line X, it requests the line from main memory. The main memory sends the line to the cache and sets the presence bit for C1 in the directory entry to indicate that C1 has a copy of line X. The cache C1 fetches the line and sets the corresponding valid bit. Similarly, when another cache, C2, requests a copy of line X, the presence bit for C2 is set in the directory entry, C2 fetches the line, and sets the valid bit. Let us examine what happens when the processor P2 issues a WRITE to a word belonging to line X: 1. Cache C2 detects that the word belongs to line X that is valid, but does not have permission to write the block, because the private bit in the cache is not set; 2. Cache C2 issues a write request to the main memory and stalls processor P2; 3. The main memory issues an invalidate request to cache C1 that contains a copy of line X; 4. Cache C1 receives the invalidate request, resets the valid bit for line X to indicate that the cached information is no longer valid, and sends acknowledgement back to the main memory;
  • 92.
    86 Memory SynthesisUsing AI Methods 5. The main meory receives the acknowledgement, sets the dirty bit, clears the presence bit for cache C1, and sends write permission to cache C2; 6. Cache C2 receives the write permission message, updates line X, sets the private bit, and reactivates processor P2. If processor P2 issues another write to a word in line X and cache C2 still owns line X, then the write takes places immediately into the cache. If processor P1 attempts to read a word in line X, after P2 has got ownership of line X, then the following events will occur: 1. Cache C1 detects that the line X containing the word is in invalid state; 2. Cache C1 issues a read request to the main memory and stalls processor P1; 3. The main memory checks the dirty and presence bits in the directory entry for line X and finds out that line X is dirty and cache C2 has the only valid copy of line X. 4. The main memory issues a read request for line X to cache C2; 5. Cache C2 receives the read request for line X from main memory, clears the private bit, and sends the line to the main memory; 6. The main memory receives the line X from cache C2, clears the dirty bit, sends line X to cache C1, and sets the presence bit for C1; 7. Cache C1 fetches the line X, sets the valid bit, and reactivates processor P1. The disadvantage of this directory scheme is that it is not scalable with respect to the directory overhead because directory size is Θ(N2 ), where N is the number of processors, that is, the memory overhead scales as the square of the number of processors. 9.5.3 Limited-Directory Protocol The limited directory protocol, as proposed by Agarwal et al. [10] is designed to solve the directory size problem, by allowing a constant number of caches to share any block, so that the directory entry size does not change as the number of processors in the system increases. A directory entry in a limited-directory protocol contains a fixed number of pointers —denoted by i— which indicate the caches holding a copy of the line, and a dirty bit (with the same meaning as for the full-map directory) for the line. The limited- directory protocol is similar to the full-map directory, except in case when more than i caches request read copies of a particular line of data. An i-pointer directory may be viewed as an i-way set-associative cache of pointers to shared copies. When the i+1 -th cache requests a copy of line X, the main memory must invalidate one copy in one of the i caches currently sharing the line X, and replace its pointer with the pointer to the cache that will share the line. This process of pointer
  • 93.
    Memory Synthesis UsingAI Methods 87 replacement is called eviction. Since the directory acts as a set-associative cache, it must have a pointer replacement strategy. Pseudorandom eviction requires no extra memory overhead and is a good choice for replacement policy. A pointer in the directory entry encodes binary processor (and cache) identifiers, so that for a system with N processors a pointer requires log2N bits of memory. Therefore a directory entry requires i ∗ log2N bits and under the assumption that the amount of memory (and hence the number of memory lines) increases linearly with the number of processors it follows that the memory overhead of the limited-directory protocol is: Θ(i ∗ log2N) ∗ Θ(N) = Θ(N ∗ log2N) Because the memory overhead grows approximately linearly with the number of proces- sors, this protocol is scalable with respect to memory overhead. This protocol works well for data that are not massively shared, but for highly shared data, as a barrier synchroniza- tion variable, pointer thrashing occurs because many processors spin-lock on the barrier variable. Both limited-directory and full-map directory protocols present the drawback of using a centralized directory which may become a bottleneck when the number of processors increases. If the memory and directory are partitioned into independent units and con- nected to the processors by scalable interconnect, the memory system can provide scalable memory bandwidth — this is the idea of distributed directory and memory, presented in the next Subsection. 9.5.4 Distributed Directory and Memory The idea is to achieve scalability by partitioning and distributing both the directory and main memory, using a scalable interconnection network and a coherence protocol that can suitably exploit distributed directories. The architecture of such a system is shown in Figure 12, that depicts the Stanford DASH Multiprocessor [9] architecture. The name DASH is an abbreviation for Directory Architecture for Shared Memory. The architecture provides both the ease of programming of single-address-space machines (with caching to reduce memory latency) and the scalability that was previously achievable only with message-passing machines but not with cache-coherent shared-address machines. The DASH architecture consists of a set of clusters (also called processing nodes) connected by a general interconnection network. Each cluster consists of a small number (e.g., eight) of high-performance processors and a portion of the shared memory interconnected by a bus. Multiprocessing within the cluster may be viewed either as increasing the power of each processing node or as reducing the cost of the directory and network interface by amortizing it over a larger number of processors. The Dash architecture removes the scal- ability limitation of centralized-directory architectures by partitioning and distributing the directory and main memory, and by using a new coherence protocol that can suitably exploit distributed directories. Distributing memory with the processors is essential be- cause it allows the system to exploit locality. All private data and code references, along with some of the shared references, can be made locally to the cluster. These references avoid the longer latency of remote references and reduce the bandwidth demands on the global interconnection.
  • 94.
    88 Memory SynthesisUsing AI Methods Memory Cache Processor Cache Processor Directory Snooping Bus . . . ... Memory Cache Processor Cache Processor Directory Snooping Bus . . . I n t e r c o n n e c t i o n N e t w o r k Figure 12: Distributed-Directory architecture
  • 95.
    Memory Synthesis UsingAI Methods 89 The DASH architecture is scalable in that it achieves linear or near-linear performance growth as the number of processors increases from a few to a few thousand. The memory bandwidth scales linearly with the number of processors because the physical memory is distributed and the interconnection network is scalable. Distributing the physical memory among the clusters provides scalable memory bandwidth to data objects residing in local memory, while using a scalable interconnection network provides scalable bandwidth to remote data. The scalability of the network is not compromised by the cache coherence traffic because the use of distributed directories removes the need for broadcasts and the coherence traffic consists only of point-to-point messages between the processing nodes that are caching that location. Since these nodes must have originally fetched the data, the coherence traffic will be within some small constant factor of the original data traf- fic. The scalability may be potentially disrupted due to the nonuniform distribution of accesses across the machine. This happens when accesses are concentrated to data from the memory of a sinlge cluster over a short duration of time — this access patterns cause hot spots (Section 7.4) in memory. Hot spots can significantly reduce the memory and network throughput because the distribution of resources provided by the architecture is not exploited as it is under uniform access patterns. Many of the data hot spots can be avoided through caching of shared writable data and Dash allows caching of these data. Other hot spots are removed by software techniques; for example, the hot spot generated by the access to a barrier variable may be removed by using a software combining tree (Section 7.4.2). The issue of memory access latency becomes more prominent as an architecture is scaled to a larger number of nodes. There are two complementary approaches to reduce latency: 1. caching shared data —this significantly reduces the average latency for remote ac- cesses because of the spatial and temporal locality of memory accesses. Hardware- coherent caches provide this latency reduction mechanism. For references not sat- isfied by the cache, the protocol attempts to minimize latency using a memory hierarchy, as shown farther; 2. latency-hiding mechanisms —these mechanisms are intended to manage the inher- ent latencies of a large machine corresponding to interprocess communication; tech- niques used range from support of a relaxed memory consistency model — release consistency (Section 8.7) — to support of nonblocking prefetch operations. Regarding the scalability of the DASH machine with respect to the amount of directory memory required, assume that the physical memory in the machine grows proportionally with the number of processing nodes: M = N × Mc Mbit where N is the number of clusters, Mc is the megabits of memory per cluster, and M is the total physical memory (expressed in megabits) of the machine. Using a full-map directory (that is, a presence presence-bit vector to keep track of all clusters caching a memory block) requires a total amount of directory memory, denoted by D: D = N × M/L = N2 × Mc/L Mbit where L is the cache line-size in bits. Thus, the directory overhead is growing as N2/L with the cluster memory size, or as N/L with the amount of total memory. For small and
  • 96.
    90 Memory SynthesisUsing AI Methods medium N, this growth is tolerable. For examle, consider a machine in which a cluster contains 8 processors and has a cache line-size of 32 bytes. For N = 32, that is, 256 processors, the overhead for directory memory is only 12.5 percent of physical memory, which is comparable with the overhead of supporting an error-correcting code on memory. For larger machines, where the overhead would become intolerable with a the full-map directory, one can use the following approach to achieve scalability: the full-map directory is replaced with a limited directory that has a small number of pointers (Subsection 9.5.3) and, in the unusual case when the number of pointers is smaller the the number of clusters caching a line, invalidations are broadcast on the interconnection network. The reason that limited directory can be used is based on the data sharing and write-invalidate patterns exhibited by most applications. For example, it is shown in [9] that most writes cause invalidations to only a few caches, with only about 2% of shared writes causing invalidation of more than 3 caches. To the memory present within each cluster a directory memory is associated. Each di- rectory memory is contained in a Directory Controller (DC). The DC is responsible for maintaining the cache coherence across the clusters and serving as interface to the in- terconnection network. The clusters and their associated portion of main memory are categorized in three types, according to the role played in a given transaction: 1. the local cluster — is the cluster that contains the processor originating a given request; local memory refers to the main memory associated with the local cluster. 2. the home cluster — is the cluster that contains the main memory and directory for a given physical memory address. 3. a remote cluster — is any other cluster; remote memory is any memory whose home is not the local. Therefore, the Dash memory system can be logically broken into four levels of hierarchy, as shown in Figure 13. States in Directory and in Caches The directory memory is organized as an array of directory entries. There is one entry for each memory block of the corresponding memory module. A directory entry contains the following pieces of information: 1. a state bit that indicates whether the clusters have a read (shared) or read/write (dirty) copy of the data. 2. a presence bit-vector, which contains a bit for each of the clusters in the system. If the state bit indicates a read copy and none of the presence bits is set to one, then the block is said to be uncached. A memory block can be in one of three states, as indicated by the associated directory entry: 1. uncached-remote, that is, not cached by any remote cluster;
  • 97.
    Memory Synthesis UsingAI Methods 91 2. shared-remote, that is, cached in an unmodified state by one or more remote clusters; 3. dirty-remote, that is, cached in a modified state by a single remote cluster. As with memory blocks, a cache block in a processor’s cache may also be in one of three states: invalid, shared, and dirty. The shared state implies that there may be other processors caching that location. The dirty state implies that this cache contains an exclusive copy of the memory block, and the block has been modified. The Dash coherence protocol is an invalidation-based ownership protocol that uses the information about the state of the memory block, as indicated by the directory entry associated with each block. The protocol maintains the notion of owning cluster for each memory block. The owning cluster is nominally the home cluster. However, in the case that the memory block is present in the dirty state in a remote cluster, that cluster is the owner. Only the owning cluster can complete a remote reference for a given block and update the directory state. While the directory entry is always maintained in the home cluster, a dirty cluster initiates all changes to the directory state of a block when it is the owner (such update messages also indicate that the dirty cluster is giving up ownership). The order that operations reach the owning cluster determine their global order. The directory does not maintain information concerning whether the home cluster itself is caching a memory block because all transactions that change the state of a memory block are issued on the bus of the home cluster, and the snoopy bus protocol keeps the home cluster coherent. Issuing all transactions on the home cluster’s bus does not significantly degrade performance since most requests to the home cluster also require an access to main memory to retrieve the actual data. To illustrate the directory protocol, we shall consider in turn how read requests and write requests issued by a processor traverse the memory hierarchy. Read request servicing • Processor level — If the requested location is present in the processor’s cache, the cache simply supplies the data. Otherwise, the request goes to the local cluster. • Local cluster level — If the data resides within one of the other caches within the local cluster, the data is supplied by that cache and no state change is required at the directory level. If the request must be sent beyond the local cluster level, it goes first to the home cluster corresponding to that address. • Home cluster level — The home cluster examines the directory state of the memory location while simultaneously fetching the block from main memory. If the block is clean, the data is sent to the requester and the directory is updated to show sharing by the requester. If the location is dirty, the request is forwarded to the remote cluster indicated by the directory. • Remote cluster level — The dirty cluster replies with a shared copy of the data, which is sent directly to the requester. In addition, a sharing write-back message is
  • 98.
    92 Memory SynthesisUsing AI Methods Processor caches in remote clusters Remote cluster level Directory and main memory associated with a given ad- dress Home cluster level Other processor caches within local cluster Local cluster level Processor cache Processor level Figure 13: Memory Hierarchy of Dash
  • 99.
    Memory Synthesis UsingAI Methods 93 sent to the home level to update main memory and change the directory state to indicate that the requesting and remote cluster now have shared copies of the data. Having the dirty cluster respond directly to the requester, as opposed to routing it through the home cluster, reduces the latency seen by the requesting processor. Write request servicing • Procesor level — If the location is dirty in the writing processor’s cache, the write can complete immediately. Otherwise, a Read-exclusive request is issued on the local cluster’s bus to obtain exclusive ownership of the line and retrieve the remaining portion of the cache line. • Local cluster level — If one of the caches within the cluster already owns the cache line, then the read-exclusive request is serviced at the local level by a cache-to-cache transfer. This allows processors within a cluster to alternatively modify the same memory block without any intercluster interaction. If no local cache owns the block, then a read-exclusive request is sent to the home cluster. • Home cluster level — The home cluster can immediately satisfy an ownership request for a location that is in the uncached or shared state. In addition, if a block is in the shared state, then all cached copies must be invalidated. The directory indicates the clusters that have the block cached. Invalidation requests are sent to these clusters while the home concurrently sends an exclusive data reply to the requesting cluster. If the directory indicates that the block is dirty, then the read-exclusive request must be forwarded to the dirty cluster, as in the case of a read. • Remote cluster level — If the directory had indicated that the memory block was shared, then the remote clusters receive an invalidation request to eliminate their shared copy. Upon receiving the invalidation, the remote clusters send an acknowl- edgement to the requesting cluster. If the directory had indicated a dirty state, then the dirty cluster receives a read-exclusive request. As in the case of the read, the remote cluster responds directly to the requesting cluster and sends a dirty-transfer message to the home cluster indicating that the requesting cluster now holds the block exclusively. When the writing cluster receives all invalidation acknowledgements or the reply from the home or dirty cluster, it is guaranteed that all copies of the old data have been purged from the system. If the processor delays completing the write until all acknowledgements are received, then the new write value will become available to all other processors at the same time. However, invalidations involve round-trip messages to multiple clusters, resulting in potentially long delays. Higher processor utilization can be obtained by allowing the write to proceed immediately after the ownership reply is received from the home. This leads to the memory model of release consistency.
  • 100.
    94 Memory SynthesisUsing AI Methods 9.6 Compiler-directed Cache Coherence Protocols Software-based coherence protocols require compiler assistance. Compile-time analysis is used to obtain information on accesses to a given line by multiple processors. Such infor- mation can allow each processor to manage its own cache without interprocessor runtime communication. Compiler-directed management of of caches implies that a processor has to issue explicit instructions to invalidate cache lines. To eliminate the stale-date problem caused by process migration the following method can be used: Cache-flushing To eliminate the stale-data problem for cacheable, nonshared data, the processor can flush its cache each time a program leaves a processor. This guarantees that main memory becomes the current active location for each variable formerly held in cache. The cache flush approcah can also be used for the I/O cache-coherence problem (Section 5.13.1). While this solution prevents the stale-data from being used, the cache invalidations caused by flushes may have as effect an increase in Miss Rate. Regarding the coherence problem caused by accesses to shared data, there are two simple coherence schemes: Not caching shared data Each shared datum can be made noncacheable to eliminate the difficulty in finding its current location among caches and main memory. Data can be done noncacheable by several methods, for example, by providing a special range of addresses for noncacheable data, or by using special LOAD and STORE instructions that do not access cache at all. Not caching shared writable data This is an improvement of the previous method. Since for performance consideration it is desirable to attach a private cache to each CPU, one can prevent data inconsistency by not caching shared writable data, that is, by making such data noncacheable. Examples of shared writable data are locks, shared data structures such as process queues, and any other data protected by critical sections. When shared writable data are not cached, no coherence problem can occur. This solution is implemented using programmer’s directives that instruct the compiler to allocate shared writable data to noncacheable regions of memory. The drawback of this scheme is that large data structures cannot be cached, although most of the time it would be safe to do so. While these solutions have the advantage of simple implementation, they have a negative effect on performance because they reduce the effective use of cache. As pointed out in many works (for example, in [8]), shared-data accesses acount for a large portion of global memory accesses. Therefore, allowing shared writable data to be cached when it is safe to do so is crucial for performance. Efficient Compiler-directed Cache Coherence We shall present an efficient compiler-directed scheme following the ideas from Cheong and Veidenbaum [8]. Basically, caching is allowed when it is safe, and cache flush is made when coherence problems occur. The operating environment of the coherence algorithm
  • 101.
    Memory Synthesis UsingAI Methods 95 is: • Parallel task-execution model The execution of a parallel program is represented by tasks, each executed by a single pro- cessor. Task migration is not allowed. Tasks independent of each other can be scheduled for parallel execution. Dependent tasks will be executed in the order defined by program semantics. The execution order of dependent tasks is enforced through synchronization. The execution order is described by the dependence relationship among tasks, which can be modeled by a directed graph, G = {E, T}, where T is a set of nodes and E is a set of edges. A node, Ti ∈ T, represents a task, and a directed edge, eij ∈ E, represents that some statements in in Tj depend on other statements in Ti. Ti is called a parent node and Tj is called a child node. Task nodes are combined into a single node using the following criterion: two nodes Ti and Tj connected by an edge eij can be combined into one node if Ti is the only parent of Tj, and Tj is the only child of Ti. The task graph can be divided into levels L = {L0, . . ., Ln}, where each Li is a set of tasks such that the longest directed path from T0, the starting node, to each of the tasks in the set has i edges and tasks on each level are not connected by any directed edges. Therefore, tasks on the same level perform no write accesses or read-write accesses to the same data by different processors. Such tasks can be executed in parallel without interprocess synchronization. • Program Model Parallelism in a program is assumed to be expressed in terms of parallel loops. A parallel loop specifies starting execution of iterations of the loop by multiple processors. In a Doall type of parallel loop, all such iterations are independent and can be executed in any order. In a Doacross type loop, there is a dependence between iterations. In terms of tasks, one or more iterations of a Doall loop are bundled into a task, while in a Doacross loop, one iteration is a task and synchronization exists between tasks. • System Model An weakly ordered system model is assumed; while it does not guarantee sequential consis- tency, the program model is quite simple and allows performance higher than for strongly ordered systems. In terms of the task-execution model, this implies that the values written in a task level must be deposited in the shared memory before the task boundary can be crossed. Parallel execution without intertask synchronization is assumed. The memory references of a program consist of instruction fetches, private-data accesses, and shared- data accesses. Private data may only become a problem if task migration is considered. It is assumed that instructions, private data, and shared read-only data accesses can be recognized at runtime and will not be affected by the cache coherence mechanism. The value in the shared memory is assumed to be always current. Incoherence is defined as the condition when a processor performs a memory fetch of a value X, and a cache hit occurs, but the cache has a value different from that in main memory; otherwise, the fact that the memory and the cache have diferent values is not an error. The following instructions are assumed to be available for cache management: • Invalidate. This instruction invalidates the entire contents of a cache. Using reset- table static random-access memories for valid bits, this can be accomplished in one or two cycles with low hardware cost.
  • 102.
    96 Memory SynthesisUsing AI Methods • Cache-on. This instruction causes all global memory references to be routed through the cache. • Cache-off. This instruction causes all global memory references to bypass the cache and go directly to memory. The cache state, on or off, must be part of the processor state and must be saved/restored on a context switch. Processes are created in a cache-off state. • Cache Management Algorithm The necessary conditions for the cache incoherence to occur on a fetch of X require that: (1) a value of X is present in the cache of processor Pj, and (2) a new value has been stored in the shared memory by another processor after the access by Pj that brought X into the cache. The above conditions can be formulated in terms of data dependences, and a compiler can then check for a dependence structure that might result in coherence violations. However, this would be complex because first, the test will have to be performed for every read reference, and second, data dependence information does not specify whether the references involved are executed by different processors. Therefore, the compiler performs data dependency analysis to determine the loop type, and processor assignement is part of the loop execution model. By definition, any dependence between two statements inside a Doall loop is not across iterations. It follows that a statement Si in a Doall dependent on a statement Sj in the same loop is executed on the same processor as Sj. On the other hand, cross-iteration dependences are present in a Doacross loop. In a Doacross loop, two statements with a cross-iteration dependence are executed on different processors, whereas statements with a dependence on the same iteration are executed on the same processor. The algorithm uses loop types for its analysis as follows: (1) A Doall loop has no dependences between statements executed on different pro- cessors. Therefore, any shared-memory access in such a loop can be cached. Caching is turned on. (2) A serial loop is executed by a single processor, and shared-memory accesses can be cached. Caching is turned on. (3) Doacross or recurence loops do have cross-iteration dependences. Therefore, condi- tions for incoherence can be true. Caching is turned off. (4) An Invalidate instruction is executed by each processor entering a Doall or Doacross. The processor continuing execution after a Doall also executes an Invalidate instruction. In terms of a task graph, these points are equivalent to task-level boundaries. Consider the program example in Figure 14. At the beginning, every processor executes the Cache-on instruction. Cache management instructions inserted in parallel loops are executed once by every participating processor, not on every iteration of such a loop. The correctness of the algorithm is proven by showing that the conditions necessary for an incoherence to occur are not satisfied in programs processed by the algorithm. The al- gorithm preserves temporal locality at each task level. This algorithm eliminates runtime
  • 103.
    Memory Synthesis UsingAI Methods 97 Cache on Doall i=1,n Invalidate Y (i) = ... = W(i)...Y (i) = ...X(i) ... enddo Invalidate ... Doall j=1,n Invalidate ... = W(j)...Y (j) X(j) = ... ... enddo Invalidate ... Doall k=1,n Invalidate = W(k) = ...X(k) ... = ...Y (k) ... enddo Invalidate Doserial i=1,n ... = X(i) ... = X(f(i)) ... enddo Figure 14: Program Example
  • 104.
    98 Memory SynthesisUsing AI Methods communication for coherence maintenance, and keeps the time cost of invalidate indepen- dent of the number of invalidated lines. However, it does not allow caching for Doacross loops and requires each processor to execute an Invalidate instruction when entering a Doall or a Doacross, thereby increasing the Miss rate. 9.7 Line Size Effect on Coherence Protocol Performance Line size plays an important role in cache coherence. For example, consider that a single word is alternatively written and read by two processors. Whether a snooping protocol or a directory protocol is used, a line size of only a word has an advantage over a larger line size because it involves invalidation only for the data really changing. Therefore, smaller line sizes can decrease the coherence overhead. Another problem with large line size is the effect called false sharing. This effect appears when two different shared variables are located in the same cache block. This situation causes the block to be exchanged between the processors even though the processors are in fact accessing different variables. Compiler technology is important in allocating data with high processor locality to the same blocks and thereby reducing cache miss rate and avoiding false sharing. Success in this this field could increase the desirability of larger blocks for multiprocessors. Measurements to date indicate that shared data has lower spatial and temporal locality than observed for other types of data, independent of the coherence protocol.
  • 105.
    Memory Synthesis UsingAI Methods 99 10 MEMORY SYSTEM DESIGN AS A SYNERGY 10.1 Computer design requirements Computer architects must design a computer to meet functional as well as price and performance goals. Often, they also have to determine what the functional requirements are, and this can be a major task. The requirements may be specific features, inspired by the market. For example, the presence of a large market for a particular class of applications might encourage the designers to incorporate requirements that would make the machine competitive in that market. A classification of the functional requirements that need to be considered ([6]) when a machine is designed: Application area —this aspect refers to the target of the computer: • Special purpose : typical feature is higher performance for specific applications; • General purpose : typical feature is balanced performance for a range of tasks; • Scientific : typical feature is high-performance for floating point; • Commercial : typical features are support for data bases, transaction processing and decimal aritmetic. Level of software compatibility —this aspect determines the amount of existing soft- ware for the machine: • At programming language : this type of compatibility is the most flexible for designer but requires new compilers; • Object code or binary compatible: in this type of compatibility, the architecture is completely defined —little flexibility— but no investment needed in software or porting programs. Operating system requirements — the necessary features to support a particular operating system (OS) include: • Size of address space : this is a very important feature; it may limit applications; • Memory management : it may be flat, paged, segmented; • Protection: the OS and applications may have different needs: page versus seg- mented protection; • Context switch : this feature supports process interrupt and restart; • Interrupts and traps : the type of support for this features has impact on hardware design and OS.
  • 106.
    100 Memory SynthesisUsing AI Methods Standards —the machine may be required to comply with certain standards in the mar- ketplace: • Floating point: pertains to format and arithmetic; there are several standards such as IEEE, DEC, IBM; • I/O Bus: pertains to I/O devices for which standards such as SCSI,VME, Futurebus are defined; • Network : support for different networks such as Ethernet, FDDI; • Programming languages : this is related to the support of standards such as ANSI C, and affects instruction set. Once a set of functional requirements has been established, the architect must try to optimize the design. The optimal design may be considered the one that meets one of the three criteria: 1. high-performance design — no cost is spared in achieving performance; 2. low-cost design — performance is sacrificed to achieve lowest cost; 3. cost/performance — balancing cost over performance. Optimizing cost/performance is largely a question of where is the best place to implement some required function- ality: hardware or software ? Balancing hardware and software will lead to the best machine for the application domain. The performance of the machine can be quantified by using a set of programs that are chosen to represent that application domain. The measures of performance are: CPUExecution time, which indicates the composed performance of CPU and Memory hierarchy; Response time, which is a measure of the entire system performance, taking into account the operating system and Input/Output. The design of the the memory system involves three components: Main Memory, Cache Memory, and Interconnection network. The performance improvement of the CPU has been and still is faster than that of the main memory. The CPU performance has improved 25% to 50% per year after 1985, while the DRAM memory performance has improved only 7% per year. Cache memories are supposed to bridge the gap between CPU and main memory speeds, and thereby to decrease the CPU execution time. The elements that must be considered when evaluating the impact of caches on CPU execution time include: hit time, miss penalty, miss rate —and the effect of I/O and multiprocesing on miss rate—, memory system latency and contention — which is dependent of the system archi- tecture, memory consistency model supported by the machine, cache-coherence protocol, synchronzation methods, application behavior.
  • 107.
    Memory Synthesis UsingAI Methods 101 10.2 General Memory Design Rules In designing the memory hierarchy, one should take into account the pertaining rules of thumb. These are [6]: 1. Amdahl/Case Rule: A balanced computer system needs about 1 megabyte of main memory capacity and 1 megabit per second of I/O bandwidth per MIPS of CPU perfor- mance. 2. 90/10 Locality Rule: A program executes about 90% of its instructions in 10% of its code. 3. Address-Consumption Rule: The memory needed by the average program grows by about a factor of 1.5 to 2 per year; thus, it consumes between 1/2 and 1 address bit per year. 4. 90/50 Branch-Taken Rule: About 90% of backward-going branches are taken while about 50% of forward-going branches are taken. 5. 2:1 Cache Rule: The miss ratio of a direct-mapped cache of size X is about the same as a 2-way set associative cache of size X/2. 6. DRAM-Grwoth Rule: Density increases by about 60% per year, quadrupling in 3 years. 7. Disk-Growth Rule: Density increases by about 25% per year, doubling in 3 years. Two important remarks are worthwhile. The first is that the improvement in DRAM capacity (approximately 60% per year) has been faster in the recent past than the im- provement in CPU performance (which has been from 25% to 50% per year). The second fact is that the increase in DRAM speed is much slower than the increase in DRAM capacity, reaching about 7% per year (these data are from Hennessy and Patterson [6]). 10.3 Dependences Between System Components The optimum solution to the memory system design results from a synergy between a multiprocessor’s software and hardware components. Some dependences that should be considered when designing the memory hierarchy are: 1. The memory-consistency model supported by an architecture directly affects the comlexity of the programming model and the performance; 2. The correctness of a coherence protocol is a function of the memory consistency model adopted by the architecture. 3. The implementation of a synchronization method influences the memory and coher- ence traffic; 4. The cache organization, the replacement strategy, the write policy, and the coherence protocol influence the memory traffic.
  • 108.
    102 Memory SynthesisUsing AI Methods 5. The number of processors influences the application behavior; as the number of pro- cessors concurrently executing an application increases, each processor is expected to use a smaller amount of the address space but on the other side the interpro- cess communication overhead increases. Therefore, the synchroniztion activity is expected to increase the memory traffic as the number of processors increases. 6. The application behavior influences the memory traffic and therefore the latency of memory accesses. The application behavior can be characterized by several proper- ties: (a) locality of accesses; (b) size and number of shared data locations among processes: (c) the ratio between READs and WRITEs of a shared data by a process; (d) the length of the write-run; (e) the number of processes that access each shared data: data may be widely shared or shared by a small number of processes; (f) the frequency of accesses to shared data by processes; (g) the granularity of parallelism: coarse-grained, medium-grained, or fine-grained applications; 7. Compilers for parallel application are important in achieving high performance. A parallelizing compiler can improve access locality and it is also important for the support of synchronization (Section 8.3). The software support is of particular importance. The parallelizing compiler extracts par- allelism from programs written for sequential machines and tries to improve data locality. Locality may be enhanced by increasing cache utilization through blocking. Therefore, the entire multiprocessor system must be studied when designing its compo- nents, and within each component the dependences should be considered. It is worthwhile to emphasize the importance of software and of the parallelism exhibited by applications for achieving good performance on a highly parallel machine. 10.4 Optimizing cache design Cache design involves choosing the cache size, organization (associativity, line size, number of sets), write strategy, replacement algorithm, coherence protocol, and perhaps employing several performance improvements schemes — such as those presented in Chapter 6. The performance parameters depend of several design choices: • cache cost — it is affected by all design choices; • hit time — it is affected by the cache size, cache associativity, and write strategy; • miss rate — it is affected by the cache size, associativity, line size, replacement algorithm;
  • 109.
    Memory Synthesis UsingAI Methods 103 • miss penalty — it is affected by the line size, number of cache levels, cache coherence protocol, and memory latency and bandwidth; Subsections 10.4.1 to 10.4.6 show what criteria are taken into account when design choices are made. Subsection 10.4.7 presents some design alternatives to improve performance. The steps followed in the synthesis of the cache are then described in Section 10.5. 10.4.1 Cache Size Increasing the cache size has the positive effect of reducing the miss rate, more specifically the capacitiy misses and the conflict misses. However, the increase of cache size is limited by several factors: 1. the hit time of the cache must be at most one CPU clock cycle ; 2. the page size and the degree of associativity (equation (22), Section 6.2); 3. the silicon area (or, alternatively, the number of chips) required to implement the cache; As shown in section 6.2, the requirement to perform a cache hit in one CPU clock cycle imposes on the cache size, C, the restriction: C = n ∗ 2j+k ≤ n ∗ 2p (43) where n is the degree of associativity, 2j is the line size, 2k is the number of sets, and 2p is the page size. This is because the number of page-offset bits must be at least the sum between the number of bits that select the set, k, and the block offset bits, j: j + k ≤ p (44) Another limitation is the cost of the Tag memory. The number of addres-Tag bits, tb, required for each cache line is: tb = m − (j + k) (45) where m is the number of bits in the memory address. The memory required by the address-tags, T, is: T = n ∗ 2k ∗ (m − (j + k)) = C ∗ 2−j ∗ (m − (j + k)) (46) Because the cache must be fast, it is built from SRAM chips, while main memory is built from DRAM chips because it must provide large capacity. For memory designed in comparable technologies, the cycle time (i.e., the minimum time between requests to memory) of SRAMs is 8 to 16 times faster than DRAMs, while the capacity of DRAMs is roughly 16 times that of SRAMs. Because the cache hit time is a lower bound for the CPU clock cycle and the Miss penalty appears in the expression of CPU time (equation (12), Section 5.2) expressed in CPU clock cycles and is affected by the main memory access latency and bandwidth, it results that the ratio of the SRAM cycle time to the DRAM cycle time directly affects the Miss penalty.
  • 110.
    104 Memory SynthesisUsing AI Methods 10.4.2 Associativity The choice of associativity influences many performance parameters such as the miss rate, the cache access time and the silicon area (or, alternatively, the number of chips). The positive effect of increased associativity is the decrease in miss rate. The degree of associativity affects the size of the Tag memory (equation (46), Subsection 10.4.1) which in turn affects the total cost of the cache. If the total size of the cache, C, and the line size, 2j , are kept constant, then increasing the associativity n, increases the number of blocks per set, thereby decreasing the number of sets, 2k (as can be seen from equation (43), Subsection 10.4.1). But if k decreases, the number of address-tag bits per line increases (equation (45), Subsection 10.4.1) and the total size of the tag memory increases (equation (46)), thus the cost increases. The degree of associativity also determines the number of comparators needed to check the Tag against a given memory address, the complexity of the multiplexer required to select the line from the matching set, and increases the hit time. Associativity is therefore expensive in hardware, and may slow access time leading to lower overall performance. Therefore, increasing associativity —although having the benefic efect of reducing miss rate — is limited by several constraints: 1. increasing associativity makes the cache slower, affecting the hit time of the cache which must be kept lower than the CPU clock cycle; 2. increasing the associativity increases the cost of the cache and of the tag memory; 3. increasing associativity requires larger silicon area (or, alternatively, more chips) to implement the cache; Because direct-mapped caches allow only one data block to reside in the cache set specified by the Index portion of the memory address, they have a miss rate worse than that of a set-associative cache of the same total size. However, the higher miss rate is mitigated by the smaller hit time: a set-associative cache of the same total size always displays a higher hit time because an associative search of a set is required during each reference, followed by the multiplexing of the appropriate line to the processor. As shown in Section 6.1, a direct- mapped cache can sometimes provide a better performance than a 2-way set associative cache of the same size. Furthermore, direct-mapped caches are simpler and easier to design, do not need logic to maintain a least-recently-used replacement policy, and require less area than a set-associative cache of the same size. Overall, direct-mapped caches are often the most economical choice for use in workstations, where cost-performance is the most important criterion. Results obtained for many architectures and applications point that an associativity greater than 8 gives little or no decrease in miss ratio. Because greater associativity means slower cache and higher cost, the choice of the associativity is not usually greater than 8.
  • 111.
    Memory Synthesis UsingAI Methods 105 10.4.3 Line Size and Cache Fetch Algorithm The minimum line size is determined by the machine word length. The maximum line size is limited by several factors such as: the miss rate, the memory bandwidth (which determines the transfer time, and thereby the Miss penalty), and the impact of line size on the performance of cache-coherent multiprocessors (Section 9.7). A larger line size reduces the cost of the tag memory, T, as equation (46), Subsection 10.4.1, shows, where the line size is 2j. The line size influences other parameters in the following way: 1. increasing the line size decreases the compulsory misses (exploits spatial locality) and increases the conflict misses (does not preserve temporal locality). Therefore, for small line sizes increasing the line size is expected to decrease the miss rate —due to spatial locality—, while for large line sizes increasing the line size is expected to increase the miss ratio —because it does not preserve the temporal locality. 2. increasing the line size increases the miss penalty because the transfer time is in- creasing. 3. increasing the line size decreases the cost of the Tag memory; the Tag overhead becomes a smaller fraction of the total cost of the cache. 4. increasing the line size in multiprocessor architectures may increase the invalidation overhead and may cause false data sharing if the compiler does not enforce that different shared variables are located in different cache blocks (Section 9.7). On the other side, because more information is invalidated at once for larger line size, the frequency of the invalidations may decrease when the line size increases, but this is depending in a large extent of the data sharing patterns (Section 9.3) and of the compiler. The first two effects of the line size must be considered together because they affect the Average memory-access time (equation (15), Section 5.2). For example, Smith has found ([20]) that for the IBM 3033 (64-byte line, cache size 64 Kbyte) the line size that gives the minimum miss rate lies in the range 128 to 256 bytes. The reason for which IBM has chosen a line size of 64 bytes (which is not optimum with respect to the miss rate) is almost certainly that the transmission time for longer lines increases the miss penalty, and the main memory data path width required would be too large and therefore too expensive. Measurements on different cache organizations and computer architectures indicate ([6],[20]) that the lowest Average memory-access time is for line sizes ranging from 8 to 64-bytes. When choosing the line size for a multiprocessor architecture, the effect of the line size on the overall traffic (that is, both data and coherence traffic) should be considered as well. For example, A. Gupta and W. Weber have found ([14]) that for the DASH multiprocessor architecture the line size that gives the minimum overall traffic is 32 bytes. When some criteria (e.g., miss rate) point to using a larger line size and the transfer time
  • 112.
    106 Memory SynthesisUsing AI Methods affects significantly the miss penalty, then the methods explained in Sections 6.9 may be used to improve main memory bandwidth. The two available choices for cache fetch are demand fetching and prefetching. The purpose of prefetching (Section 6.5) is to reduce the miss rate for reads. However, prefecting introduces some penalties: it increases the memory traffic and introduces cache lookup accesses. The major factor in determining whether prefetching is useful is the line size. Small line sizes generally result in a benefit from prefetching, while large line sizes lead to the ineffectiveness of prefetching. The reason for this is that when the line is large, a prefetch brings in a great deal of information, much or all of which may not be needed, and removes an equally large amount of information, some of which may still be in use. Smith ([20]) has found that for line sizes greater than 256 bytes prefetching brings no improvement. The fastest hardware implementation of prefetching is provided by one block lookahead (OBL) prefetch. Always prefetch provides a greater decrease in miss ratio than prefetch on misses, but introduces also a greater memory and cache overhead. As shown in Section 6.5, special steps must be taken to keep prefetching interference with normal program accesses at an acceptable level. As a general rule, one-block-lookahead prefetch with a line size of L bytes is a better choice than demand fetch with a line size 2L bytes, because the former choice allows the processor to proceed while the bytes L + 1 . . . 2L are fetched. 10.4.4 Line Replacement Strategy Cache-line replacement strategy affects the miss rate and the memory traffic. The two candidate choices are random, and LRU (or its approximation) replacement. FIFO re- placement is not considered as a good choice because it has been shown to generally perform poorer than random and its hardware cost is greater than that of random. The replacement policy plays a greater role in smaller caches than in larger caches where there are more choices of what to replace. Although LRU performs better than random, its edge over random from the point of view of miss rate is less significant for large cache size and the virtue of random of being simple to build in hardware may become more important. LRU is the best choice for small and medium size caches because it preserves temporal locality. 10.4.5 Write Strategy For uniprocessor architectures, either write through or write back can be used. Write back usually makes fetch-on-write — with the hope that the written line will be referenced again soon, either by write or read accesses—, while write through usually does not make write allocate — with the purpose of keeping room in the cache for data that is read and because the subsequent writes to that block still have to go to memory. Using write through for uniprocessors simplifies the cache-coherence problem for I/O because with this policy the memory has an up-to-date copy of information and special schemes must be used to prevent inconsistency only for I/O input but not for output (as shown in Subsection 5.13.1).
  • 113.
    Memory Synthesis UsingAI Methods 107 For multiprocessor architectures the typical write strategy is write-back with fetch-on- write, in order to reduce the interconnection network traffic. With write back, write-hits occur at the speed of the cache memory, and multiple writes within a line require only one write to main memory. Since not every write is going to memory, write back uses less memory bandwidth, which is an important aspect in multiprocessors. 10.4.6 Cache Coherence Protocol The cache coherence protocol is selected depending on the type of interconnection network, the number of processors present in the system, and the sharing pattern of the applica- tions. In shared-bus systems, either snooping or directory protocols may be selected, and generally the choice is related to cost and to the number of processors: snooping protocols are less costly, but they are not able scale to many processors. For general interconnec- tion networks and scalable multiprocessor architectures, directory protocols are the good choice. A design aspect of the coherence protocol is the write policy: write invalidate or write up- date. Directory-based protocols are based on the write-invalidation strategy. For snooping protocols, there is no clear hint whether write-invalidate or write-update is better. Some applications perform better with write invalidate, others with write-update. This behavior is due to the fact that the performance of both schemes is sensitive to the sharing pattern, of particular importance being the write-run (Section 9.3). The choice can be made on the basis of the sharing pattern of the applications for which the system is mainly targeted. The length of the write-run points to the following choice for write policy: — for long write-runs, write-invalidate is the good choice; — for short write-runs, write-update is the good choice. 10.4.7 Design Alternatives When only by adjusting the design parameters it is not possible to achieve the desired performance, then performance improvement techniques (described in Chapter 6) can be applied, depending on the problem: 1. Miss penalty — The read miss penalty can be reduced employing early restart or out of order fetch (Section 6.3). For write-back caches, a write buffer (Section 6.7) can be used. The write buffer is also useful for write-through caches because it reduces the write stalls. Two-level caches (Section 6.8) also provide a reduction in the miss penalty. Another approach is to reduce the transfer time by increasing the main memory bandwidth (Section 6.9). 2. Miss rate — The miss rate can be reduced using prefetching (Section 6.5). For direct-mapped caches a victim cache or a column associative scheme may be used (Section 6.4).
  • 114.
    108 Memory SynthesisUsing AI Methods 3. Hit time — The read hit time for cache organizations that do not satisfy condition (44) can be reduced by pipelining the TLB (Section 6.2). The write hit time can be reduced by pipelining the writes or using subblock placement for write-through direct-mapped caches (Section 6.6). 10.5 Design Cycle The synthesis of the memory hierarchy is achieved by dividing the global design goal into subgoals (i.e., subtasks) and achieving these subtasks. This process is called goal reduction and is imposed by the complexity of the initial task. Knowledge is maintained by IDAMS in a modular form, to reflect the knowledge base involved in solving the subtasks that are carried out as part of the goal tree. Goal reduction is achieved using an agenda that specifies the subtasks. Initially, the subtasks in the agenda may look like this: 1. Main Memory, 2. Cache Memory, and 3. Interconnection network The subtasks are solved by agents; as the agents solve subtasks, they erase them from the agenda and possibly replace them with other simpler tasks. Clear definiton of goals is an important requirement on input to IDAMS in order to obtain relevant results. The steps followed in designing the cache level of the memory hierarchy are: 1. input information analysis — this step collects information about the System under Design (SUD) that affects the memory synthesis: (a) uni/multiprocessor architecture; (b) shared or distributed memory architecture; (c) memory consistency model; (d) type of interconnection network; (e) instruction set; (f) application programs; (g) compiler technology. 2. extractions — this step extracts information from step 1 and translates the informa- tion into more detailed parameters: (a) abstract model of the System Under Design (SUD); (b) application model behavior —working set size, frequency of shared accesses, ratio between reads and writes accesses by one processor, length of the write- runs. (c) constraints for memory: such as available hit time, coherence protocol consis- tent with the type of interconnection network and the consistency model.
  • 115.
    Memory Synthesis UsingAI Methods 109 3. main design part — adjustment of parameters using information from step 2 and rules. 4. performance check — when analytical models are not available, simulation is used to find out the performance; the achieved performance is checked against the required performance. If the requirements are met, the design is complete; otherwise step 5 is taken, or the designer is asked to input additional information. 5. performance improvement — an analysis of some more specific parameters is per- formed to determine on which aspects improvements should be made; specific pa- rameters analyzed to choose an improvement include: compulsory misses, capacity misses, conflict misses, miss rate, transfer time, miss penalty. Using the insight provided by the analysis, some design parameters are changed (for example: if great conflict rate then increase associativity) or some specifiec techniques (such as column-associative cache, memory interleaving, prefetching, write pipelining, two- level caches) are employed. 6. repeat step 4. 11 CONCLUSIONS The report has addressesd issues related to the synthesis of the upper level of memory hierarchy, the cache, tackling cache design choices — along with their internal and exter- nal dependences—, and performance improvement strategies. The specialized knowledge involved in the design of cache — both for uniprocessor and multiprocessor architectures — has been provided, and the performance impact of design choices has been pointed out. The domain specific knowledge presented for cache synthesis is to be applied in build- ing the IDAMS tool. The steps followed by IDAMS in the synthesis of the cache have been described. The results point out the importance of the generate and test strategy to evaluate design alternatives. Of course, the rule-based system paradigm is used as a problem solving strategy in building the IDAMS —check of design rules and constraints, selection of design parameters and alternatives are following this strategy. Future work for the IDAMS will address structuring the acquired knowldege and finding an adequate representation for the knowledge.
  • 116.
    110 Memory SynthesisUsing AI Methods REFERENCES 1. L.M. Censier and P. Feautrier, “A New Approach to Coherence Probblems in Multi- cache Systems.” IEEE Trans. Computers, Vol. C-27,No.12, Dec.1978,pp.1112–1118. 2. L. Lamport, “How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs.” IEEE Trans. Computers, No.9, Sept. 1979, pp.690–691. 3. M. Dubois, C. Scheurich, and F.A. Briggs, “Synchronization, Coherence, and Or- dering of Events in Multiprocessors.” Computer, Vol.21, No.2, Feb.1988, pp. 9–21. 4. M. Dubois and S. Thakkar, “Cache Architectures in Tightly Coupled Multiproces- sors.” Computer, Vol. 23, No. 6, June 1990, pp. 9–11. 5. J. Archibald and J.L. Baer, “Cache Coherence Protocols: Evaluation using a Mul- tiprocessor Simulated Model.” ACM Transaction on Computers, No.4, Nov 1986, pp.273–298. 6. D.A. Patterson and J.L. Hennessy, “Computer Architecture A Quantitative Ap- proach.” Morgan Kaufmann Publishers, Inc., San Mateo, CA, 1990. 7. H.S. Stone,“High-Performance Computer Architecture.” second edition, Addison- Wesley, Reading, Mass., 1990. 8. H. Cheong and A.V. Veidenbaum, “Compiler-Directed Cache Management in Mul- tiprocessors.” Computer, Vol. 23, No.6, June 1990, pp.39–47. 9. D. Lenovski, J. Laudon, W. Weber, A. Gupta, J .Hennessy, M. Horowitz. and M. Lam “The Stanford Dash Multiprocessor.” Computer, Vol. 25, No.3, March 1992, pp.63–79. 10. A. Agarwal et al., “An evaluation of Directory Schemes for Cache Coherence.” Proc. The 15th Annual Intl. Symp. Computer Architecture, IEEE Computer Society Press, Los Alamitos, California, June 1988, pp.280–289. 11. A. Agarwal and S.D. Pudar, “A Technique for Reducing the Miss Rate of Direct- Mapped Caches.” Proc. The 20th Annual Intl. Symp. Computer Architecture, IEEE Computer Society Press, Los Alamitos, Calif., May 1993, pp.179–189. 12. N.P. Jouppi, “Improving Direct-Mapped Cache Performance by Addition of a Small Fully-Associative Cache and Prefecth Buffers.” Proc. The 17th Annual Intl. Symp. Computer Architecture, IEEE Computer Society Press, Los Alamitos, Calif., May 1990,pp.364–373. 13. K. Garachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J.Hennessy, “Memory Consistency and Event Ordering in Scalable Shared-Memory Multiproces- sors.” Proc. The 17th Annual Intl. Symp. Computer Architecture, IEEE Computer Society Press, Los Alamitos, Calif., May 1990, pp.15–26. 14. A. Gupta and W. Weber, “Cache Invalidation Patterns in Shared-Memory Multi- processors.” IEEE Trans. Computers, Vol. 41, No.7, July 1992,pp.794–810.
  • 117.
    Memory Synthesis UsingAI Methods 111 15. G. Pfister and V. Norton, “Hot Spot Contention and Combining in Multistage Inter- connection Networks.” IEEE Trans. Computers, Vol. C-34, Oct. 1985, pp.943–948. 16. P. Yew, N. Tzeng, and D. Lawrie, “Distributing Hot-Spot Addressing in Large-Scale Multiprocessors.” IEEE Trans. Computers, Vol. C-36, No.4, April 1987, pp.388– 395. 17. D. Lenoski, J. Laudon, K. Garachorloo, A. Gupta, and J. Hennessy, “The Directory- Based Cache Coherence Protocol for the DASH Multiprocessor.” Proc. The 17th Annual Intl. Symp. Computer Architecture, IEEE Computer Society Press, Los Alamitos, Calif., May 1990, pp.148–159. 18. S.J. Eggers and R.H. Katz, “A Characterization of Sharing in Parallel Programs and its Application to Coherency Protocol Evaluation.” Proc. The 15th Annual Intl. Symp. Computer Architecture, IEEE Computer Society Press, Los Alamitos, Calif., June 1988, pp.373–382. 19. P. Stenstr¨om, M. Brorsson, and L. Sandberg, “An Adaptive Cache Coherence Pro- tocol Optimized for Migratory Sharing.” Proc. The 20th Annual Intl. Symp. Com- puter Architecture, IEEE Computer Society Press, Los Alamitos, Calif., May 1993, pp.109–118. 20. A.J. Smith, “Cache Memories.” Computer Surveys, Vol.14, No.3, Sept. 1982, pp.473–530. View publication statsView publication stats