Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
IMPLEMENTATION OF COARSE-GRAIN COHERENCE
TRACKING SUPPORT IN RING-BASED MULTIPROCESSORS
by
EDMOND A. COT ´E
A thesis submi...
Abstract
As the number of processors in multiprocessor system-on-chip devices continues to in-
crease, the complexity requ...
of nodes of bus-based multiprocessors, and each node includes a common memory, two
or more pipelined 32-bit processors wit...
Acknowledgements
I would like to thank, first and foremost, Dr. N. Manjikian for his patience, constructive
criticism, and ...
Table of Contents
Abstract i
Acknowledgements iii
Table of Contents iv
List of Tables vii
List of Figures viii
Chapter 1:
...
Chapter 3:
Multiprocessor System Architecture . . . . . . . . . . . . . . . . . 32
3.1 Pipelined Processor . . . . . . . ....
Chapter 6:
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.1 Future Work . . . . . . . . . ....
List of Tables
2.1 Directory organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Supported instruc...
List of Figures
2.1 Cache coherence example . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Example parallel p...
4.1 Location of NCRH and RNSRT filters . . . . . . . . . . . . . . . . . . . . 55
4.2 NCRH structure and address mapping . ...
Chapter 1
Introduction
In recent years, System-on-Chip (SoC) integration has emerged as one of the primary meth-
ods of co...
CHAPTER 1. INTRODUCTION 2
quite distinct. The primary aim of the design of an application-specific MPSoC is to min-
imize b...
CHAPTER 1. INTRODUCTION 3
ordering properties of a unidirectional ring are also helpful when snooping cache coherence
is c...
CHAPTER 1. INTRODUCTION 4
1.1 Thesis Contributions
This thesis makes two main contributions. The first contribution is the ...
CHAPTER 1. INTRODUCTION 5
The complete system that was just described is developed using a hardware-description
language, ...
Chapter 2
Background
This chapter begins with a discussion of shared-memory multiprocessor architectures and
the issues of...
CHAPTER 2. BACKGROUND 7
the message-passing programming model where processors exchange data and control in-
formation usi...
CHAPTER 2. BACKGROUND 8
P1
(Write X=9)
Cache
(X: 7)
Memory
(X: 7)
Interconnect
P0
Cache
(X: 7)
Figure 2.1: Cache coherence...
CHAPTER 2. BACKGROUND 9
Initially (A, B, U, V) = (0, 0, 0, 0),
P0 code P1 code P0 mem. ops. P1 mem. ops.
A = 1 B = 1 store...
CHAPTER 2. BACKGROUND 10
models [AG96]. These models specify constraints on the order in which memory opera-
tions must co...
CHAPTER 2. BACKGROUND 11
overlap the execution of memory accesses. Specifically, read operations are allowed to by-
pass un...
CHAPTER 2. BACKGROUND 12
P0
Cache
P1
Cache
P2
Cache
P3
Cache
Memory
Arbiter
Figure 2.3: Bus-based shared memory
interconne...
CHAPTER 2. BACKGROUND 13
Invalid
Modified
(read/write)
Shared
(read-only)
Processor initiated transitions: load, store
Bus...
CHAPTER 2. BACKGROUND 14
taken by processors to ensure cache coherence are determined by a cache coherence pro-
tocol. The...
CHAPTER 2. BACKGROUND 15
P1
(Write X=9)
Cache
(X: 7)
Memory
(X: 7)
P0
Cache
(X: 7)
Upgrade requestSnoop/invalidate
P1
Cach...
CHAPTER 2. BACKGROUND 16
To emphasize the importance of snooping in a bus-based multiprocessor and to further
describe the...
CHAPTER 2. BACKGROUND 17
use of signal repeaters that increase the likelihood that the interconnect falls on the system’s
...
CHAPTER 2. BACKGROUND 18
Read X Reply X
Read X Read Y
Reply XReply Y
Atomic bus
Split-transaction bus
Request bus
Reply bu...
CHAPTER 2. BACKGROUND 19
centralized nature of a single bus by arranging multiple buses in the form of a tree struc-
ture ...
CHAPTER 2. BACKGROUND 20
P0
Cache
P1
Cache
Memory
Node 1
Ring
Interface
Router
Out
In
Figure 2.8: Ring interconnection net...
CHAPTER 2. BACKGROUND 21
appropriate action as determined by the cache coherence protocol. Whereas processors on
a bus sno...
CHAPTER 2. BACKGROUND 22
Ring Greedy Ordering
A second approach to ensure the total ordering of requests in a ring network...
CHAPTER 2. BACKGROUND 23
Interconnect
Directory MMemory 0
P0
Cache
P1
Cache
Memory
P2
Cache
P3
Cache
Memory
Node 1 Node 2
...
CHAPTER 2. BACKGROUND 24
and/or forwards requests to other nodes that may have modified data or copies of shared
data, who ...
CHAPTER 2. BACKGROUND 25
to the directory at the home node of the requested address.
2.1.6 Token Coherence
An alternative ...
CHAPTER 2. BACKGROUND 26
31 015
Region
32-bit
Address
Address Range Region
0x00000000 - 0x0000FFFF 0
0x00010000 - 0x0001FF...
CHAPTER 2. BACKGROUND 27
table that contains a fixed number of entries similar to a Bloom filter [Blo70] or to inclusive-
JE...
CHAPTER 2. BACKGROUND 28
Through simulation experiments, RegionScout is shown to favorably exploit the coarse-
grain shari...
CHAPTER 2. BACKGROUND 29
2.3.1 Flexible Snooping
One of the main problems associated with ring snooping is the increased l...
CHAPTER 2. BACKGROUND 30
response to a passing request for the specified address.
For Flexible Snooping, three different pr...
CHAPTER 2. BACKGROUND 31
2.4 Summary
This thesis focuses on the design of shared-memory multiprocessors for application-sp...
Chapter 3
Multiprocessor System Architecture
A multiprocessor system was developed to support the research goals of this t...
CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 33
P0
Ring
router/
register
Ring
interface
Inbound
Outbound
Request bus
Resp...
CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 34
IF/
ID
EX/
MEM
ID/
EX
MEM
/WB
PC
Register
file
Branch
logic
Control
ALU
F...
CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 35
a valid block of data are returned from the cache controller.
3.1.2 Instr...
CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 36
Table 3.1: Supported instruction set
Category Instruction Description
Ari...
CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 37
been implemented are sufficient to support compiler-generated code that ca...
CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 38
0x1234
READ
1
Address
Command
Tag
NACK
Inhibit
Clock
1 2
Figure 3.3: Requ...
CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 39
outstanding request for either the instruction cache or the data cache, a...
CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 40
From
request and/or
response bus
(optional)
To
response
bus
Controller
Bu...
CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 41
N1
N2
N3
N0
P0
Ring
router
Ring
interface
Inbound
Outbound
PN-1
I/O MEM
F...
CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 42
Table 3.2: Address mapping
Address Range Home Node
0x00000000 - 0x0FFFFFF...
CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 43
Table 3.3: Rules for outbound packet generation
Bus
command
Bus
address
B...
CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 44
reply for a previously requested block of remote data. Depending on the t...
CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 45
that only a single packet is granted access to the outbound buffer at a g...
CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 46
preceding ring node, the local router may dequeue a packet from the outbo...
CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 47
spins in its cache on a variable that is written by a processor in a seco...
CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 48
will never be any contention between processors located in different node...
CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 49
Upgrade A
(Broadcast ring
invalidation packet) Load B
t
t
Upgrade B
(Broa...
CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 50
earlier in Figure 2.2 is executed by two different processors that reside...
CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 51
Writes are therefore completed with respect to the processors on the loca...
Chapter 4
Region-Level Filtering of Coherence
Traffic
The multiprocessor system described in the previous chapter generates...
CHAPTER 4. REGION-LEVEL FILTERING OF COHERENCE TRAFFIC 53
incurred snooping overhead.
The purpose of this chapter is to in...
CHAPTER 4. REGION-LEVEL FILTERING OF COHERENCE TRAFFIC 54
suitable for single-chip implementation. Similar to the related ...
CHAPTER 4. REGION-LEVEL FILTERING OF COHERENCE TRAFFIC 55
Ring router
Inbound/
outbound
Request bus
Response bus
Multiproc...
CHAPTER 4. REGION-LEVEL FILTERING OF COHERENCE TRAFFIC 56
To further distinguish between the NCRH and the RNSRT, both stru...
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Implementation of coarse-grain coherence tracking support in ring-based multiprocessors
Upcoming SlideShare
Loading in …5
×

Implementation of coarse-grain coherence tracking support in ring-based multiprocessors

408 views

Published on

As the number of processors in multiprocessor system-on-chip devices continues to increase, the complexity required for full cache coherence support is often unwarranted for application-specific designs. Bus-based interconnects are no longer suitable for larger-scale systems, and the logic and storage overhead associated with the use of a complex packet-switched network and directory-based cache coherence may be undesirable in single-chip systems. Unidirectional rings are a suitable alternative because they offer many properties favorable to both on-chip implementation and to supporting cache coherence. Reducing the overhead of cache coherence traffic is, however, a concern for these systems. This thesis adapts two filter structures that are based on principles of coarse-grained coherence tracking, and applies them to a ring-based multiprocessor. The first structure tracks the total number of blocks of remote data cached by all processors in a node for a set of regions, where a region is a large area of memory referenced by the upper bits of an address. The second structure records regions of local data whose contents are not cached by any remote node. When used together to filter incoming or outgoing requests, these structures reduce the extent of coherence traffic and limit the transmission of coherent requests to the necessary parts of the system. A complete single-chip multiprocessor system that includes the proposed filters is designed and implemented in programmable logic for this thesis. The system is composed of nodes of bus-based multiprocessors, and each node includes a common memory, two or more pipelined 32-bit processors with coherent data caches, a split-transaction bus with separate lines for requests and responses, and an interface for the system-level ring interconnect. Two coarse-grained filters are attached to each node to reduce the impact of coherence traffic on the system. Cache coherence within the node is enforced through bus snooping, while coherence across the interconnect is supported by a reduced-complexity ring snooping protocol. Main memory is globally shared and is physically distributed among the nodes. Results are presented to highlight the system's key implementation points. Synthesis results are presented in order to evaluate hardware overhead, and operational results are shown to demonstrate the functionality of the multiprocessor system and of the filter structures.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Implementation of coarse-grain coherence tracking support in ring-based multiprocessors

  1. 1. IMPLEMENTATION OF COARSE-GRAIN COHERENCE TRACKING SUPPORT IN RING-BASED MULTIPROCESSORS by EDMOND A. COT ´E A thesis submitted to the Department of Electrical and Computer Engineering in conformity with the requirements for the degree of Master of Science (Engineering) Queen’s University Kingston, Ontario, Canada October 2007 Copyright c Edmond A. Cot´e, 2007
  2. 2. Abstract As the number of processors in multiprocessor system-on-chip devices continues to in- crease, the complexity required for full cache coherence support is often unwarranted for application-specific designs. Bus-based interconnects are no longer suitable for larger-scale systems, and the logic and storage overhead associated with the use of a complex packet- switched network and directory-based cache coherence may be undesirable in single-chip systems. Unidirectional rings are a suitable alternative because they offer many properties favorable to both on-chip implementation and to supporting cache coherence. Reducing the overhead of cache coherence traffic is, however, a concern for these systems. This thesis adapts two filter structures that are based on principles of coarse-grained coherence tracking, and applies them to a ring-based multiprocessor. The first structure tracks the total number of blocks of remote data cached by all processors in a node for a set of regions, where a region is a large area of memory referenced by the upper bits of an address. The second structure records regions of local data whose contents are not cached by any remote node. When used together to filter incoming or outgoing requests, these structures reduce the extent of coherence traffic and limit the transmission of coherent requests to the necessary parts of the system. A complete single-chip multiprocessor system that includes the proposed filters is de- signed and implemented in programmable logic for this thesis. The system is composed i
  3. 3. of nodes of bus-based multiprocessors, and each node includes a common memory, two or more pipelined 32-bit processors with coherent data caches, a split-transaction bus with separate lines for requests and responses, and an interface for the system-level ring in- terconnect. Two coarse-grained filters are attached to each node to reduce the impact of coherence traffic on the system. Cache coherence within the node is enforced through bus snooping, while coherence across the interconnect is supported by a reduced-complexity ring snooping protocol. Main memory is globally shared and is physically distributed among the nodes. Results are presented to highlight the system’s key implementation points. Synthesis results are presented in order to evaluate hardware overhead, and operational results are shown to demonstrate the functionality of the multiprocessor system and of the filter struc- tures. ii
  4. 4. Acknowledgements I would like to thank, first and foremost, Dr. N. Manjikian for his patience, constructive criticism, and guidance that he has provided to me over the past two years. His wealth of experience and attention to detail have left a lasting impression which that will not soon be forgotten. I am grateful for the financial support provided by the National Sciences and Engineer- ing Research Council of Canada, Communications and Information Technology Ontario, and Queen’s University, and to Canadian Microelectronics Corporation for providing the hardware and software necessary for my research. I am especially grateful to Hugh Pollitt- Smith, Todd Tyler, and Susan Xu at CMC Microsystems for helping me resolve synthesis tool issues, as well as the technical services staff at Queen’s University. Finally, I would like to thank my parents, colleagues, teammates, family, and friends, both new and old, whose consistent encouragement and sometimes unwanted distractions, have helped see me through this thesis. iii
  5. 5. Table of Contents Abstract i Acknowledgements iii Table of Contents iv List of Tables vii List of Figures viii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 2: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Shared-Memory Multiprocessor Architectures . . . . . . . . . . . . . . . . 6 2.2 Coarse-Grained Coherence Tracking . . . . . . . . . . . . . . . . . . . . . 25 2.3 Enhanced Ring Snooping for Cache Coherence . . . . . . . . . . . . . . . 28 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 iv
  6. 6. Chapter 3: Multiprocessor System Architecture . . . . . . . . . . . . . . . . . 32 3.1 Pipelined Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Split-Transaction Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Cache Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4 Memory and I/O Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.5 Unidirectional Ring Network . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.6 Reduced-Complexity Cache Coherence . . . . . . . . . . . . . . . . . . . 46 3.7 Memory Consistency Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Chapter 4: Region-Level Filtering of Coherence Traffic . . . . . . . . . . . . . 52 4.1 Node Cached-Region Hash . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.2 Remote Non-Shared Region Table . . . . . . . . . . . . . . . . . . . . . . 63 4.3 Filtering Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Chapter 5: System Implementation and Results . . . . . . . . . . . . . . . . . 70 5.1 Implementation Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2 Implementation Specifications . . . . . . . . . . . . . . . . . . . . . . . . 75 5.3 Overview of Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.4 Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.5 Operational Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 v
  7. 7. Chapter 6: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 vi
  8. 8. List of Tables 2.1 Directory organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1 Supported instruction set . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 Address mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3 Rules for outbound packet generation . . . . . . . . . . . . . . . . . . . . 43 4.1 Comparison of NCRH and RNSRT structures . . . . . . . . . . . . . . . . 55 5.1 System address map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2 NCRH control logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.3 System resource utilization . . . . . . . . . . . . . . . . . . . . . . . . . . 87 vii
  9. 9. List of Figures 2.1 Cache coherence example . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Example parallel program to demonstrate concept of memory consistency . 9 2.3 Bus-based shared memory . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 MSI cache coherence protocol . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 Bus-based snooping for cache coherence . . . . . . . . . . . . . . . . . . . 15 2.6 Split-transaction bus example . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.7 Two-level snooping hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.8 Ring interconnection network and multiprocessor node . . . . . . . . . . . 20 2.9 Ordering of ring requests . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.10 Directory-based multiprocessor architecture . . . . . . . . . . . . . . . . . 23 2.11 Address to 64-kbyte region mapping . . . . . . . . . . . . . . . . . . . . . 26 3.1 Multiprocessor system node overview . . . . . . . . . . . . . . . . . . . . 33 3.2 Instruction pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3 Request bus timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4 Buffering for memory and I/O devices . . . . . . . . . . . . . . . . . . . . 40 3.5 Ring network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.6 Ring buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.7 Memory consistency timeline . . . . . . . . . . . . . . . . . . . . . . . . . 49 viii
  10. 10. 4.1 Location of NCRH and RNSRT filters . . . . . . . . . . . . . . . . . . . . 55 4.2 NCRH structure and address mapping . . . . . . . . . . . . . . . . . . . . 57 4.3 Additional request bus lines supporting the operation of the NCRH . . . . . 60 4.4 RNSRT structure and address mapping . . . . . . . . . . . . . . . . . . . . 63 4.5 RNSRT filtering and flag-based synchronization . . . . . . . . . . . . . . . 67 5.1 Synthesis and simulation flow . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2 Software toolchain flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3 Close-up of the data cache . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.4 Cache and snooping control logic . . . . . . . . . . . . . . . . . . . . . . . 80 5.5 Ring packet format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.6 Ring router . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.7 NCRH datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.8 RNSRT datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.9 Chip floorplan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.10 Behavior of outstanding request table . . . . . . . . . . . . . . . . . . . . . 91 5.11 Servicing of remote read request on ring network . . . . . . . . . . . . . . 92 5.12 NCRH increment operation . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.13 NCRH decrement due to replacement operation . . . . . . . . . . . . . . . 94 5.14 NCRH early decrement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.15 NCRH late decrement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.16 NCRH filtering and RNSRT update . . . . . . . . . . . . . . . . . . . . . . 97 ix
  11. 11. Chapter 1 Introduction In recent years, System-on-Chip (SoC) integration has emerged as one of the primary meth- ods of combining all components of an electronic system into a single integrated circuit. This increased level of integration is made possible by advances in semiconductor fabrica- tion technology. A typical SoC product consists of one or more microprocessors, memory blocks, and external interfaces, and these SoC devices are widely used in embedded sys- tems. The same advances in fabrication technology have had a different impact for general- purpose processors that have traditionally aimed at providing the most performance possi- ble. These processors have approached limits related to power consumption and operating frequency, hence designers have begun resorting to integrating multiple processing cores per chip in order to keep pace with Moore’s Law on the rate of increase in components per chip. This thesis is concerned with System-on-Chip devices, in particular, Multiprocessor System-on-Chip (MPSoC) devices whose processor count is increasing in step with chips containing general-purpose processors. The requirements of these devices, however, are 1
  12. 12. CHAPTER 1. INTRODUCTION 2 quite distinct. The primary aim of the design of an application-specific MPSoC is to min- imize both hardware implementation and software development complexity, as opposed to the aims of traditional general-purpose multiprocessors. Adopting a shared-memory pro- gramming model is a first step in reducing the complexity of software development, but traditional techniques to providing hardware support for cache coherence in these systems may introduce additional hardware complexity that is not desired and may not be warranted. One type of software application that is executed on MPSoC devices involves the pro- cessing of streaming data, multimedia or otherwise. The execution of these types of appli- cations can be broken up into different parts and can be performed in a pipelined fashion. All the processors in the system will not operate in parallel on the same data set. One group of processors can be programmed to perform one task, and the following task can be assigned to a second group of processors. Upon completion, the computed results are transferred between groups. Full cache coherence support for this class of operations is not necessarily required. The choice of system interconnect must also be considered when a larger number of processors are integrated on a single chip. The shared bus is a popular interconnect for shared-memory multiprocessors, but it does not scale well when additional processors are added to a system because its bandwidth saturates rapidly. The use of a packet-switched network and directory-based coherence is another approach, but it is undesirable in the con- text of the increasing number of cores in single-chip multiprocessors due to its potentially high storage and logic requirements. The use of a ring interconnect for interprocessor communication has been the subject of much research [SST06, MH06]. Its short point-to-point wires and its low-complexity routing logic [JW05] make the architecture favorable for on-chip implementation. The
  13. 13. CHAPTER 1. INTRODUCTION 3 ordering properties of a unidirectional ring are also helpful when snooping cache coherence is considered. Each request placed on the ring is unconditionally broadcast to all nodes of the ring, and whereas processors on a bus snoop requests at the instant that one is seen on the interconnect, processors on a ring snoop these requests individually as they go by. Reducing the overhead of cache coherence traffic is yet another concern. Coarse- grained coherence tracking techniques have been proposed to reduce the impact of this type of traffic in bus-based multiprocessors. These techniques track the status of large, contigu- ous regions of memory that span several kilobytes or more, and they use this information to filter unnecessary coherence operations and to reduce energy consumption. Cantin et al. [CLS05] demonstrate that for a set of commercial, scientific, and multiprogrammed workloads, coarse-grained coherence tracking techniques can be used to reduce 55-97% of all coherence-related broadcasts on a bus and improve system performance by 8.8% on av- erage. These techniques are implemented using relatively simple hardware structures that could be adapted to, and be of potential benefit to, MPSoC designs. The work in this thesis aims to combine the above ideas and presents an architecture suitable for application-specific multiprocessors. A ring network composed of nodes of bus-based multiprocessors is developed, and coarse-grained filter structures are adapted and integrated at each node of the network in order to support cache coherence with re- duced complexity. The architecture is demonstrated by means of a complete, system-level prototype hardware implementation that was developed using programmable logic. This approach focuses on highlighting the details related to an actual implementation for full system functionality.
  14. 14. CHAPTER 1. INTRODUCTION 4 1.1 Thesis Contributions This thesis makes two main contributions. The first contribution is the development of two filter structures that apply the principles of coarse-grained coherence tracking to a ring-based multiprocessor in order to efficiently support system-wide cache coherence with reduced complexity. The design of the filters for the large-scale ring-based architecture is adapted from the RegionScout approach [Mos05] where a similar set of filters is used in a smaller-scale bus-based multiprocessor. The first structure as adapted in this thesis tracks the total number of blocks of remote data cached by all processors in a node for a set of regions, where a region is a large area of memory referenced by the upper bits of an address. The second structure records regions of local data whose contents are not cached by any remote node. When used together to filter incoming or outgoing coherence requests, these structures reduce the extent of coherence traffic and limit the transmission of coherence requests to the necessary parts of the system. The second contribution of this thesis is the design and implementation of a full multi- processor system that includes the above proposed filters. The system is composed of nodes of bus-based multiprocessors, and each node includes a common memory, two or more MIPS-derived pipelined 32-bit processors with coherent data caches, a split-transaction bus with separate lines for requests and responses, and an interface for the system-level ring interconnect. Each node of the ring contains a register to store circulating packets and an appropriate amount of routing logic. The two structures that use the principles of coarse- grained coherence tracking to filter traffic on the system are attached to each node. Cache coherence within the node is enforced through bus snooping, while coherence across the interconnect is supported by a reduced-complexity ring snooping protocol. Main memory is globally shared and is physically distributed among the nodes.
  15. 15. CHAPTER 1. INTRODUCTION 5 The complete system that was just described is developed using a hardware-description language, and has been implemented in programmable logic. Synthesis results for the system are presented in order to evaluate hardware overhead, and operational results are shown to demonstrate the behavior of the multiprocessor system and of the filter structures. 1.2 Thesis Organization This thesis is organized as follows. Chapter 2 presents background information related to shared-memory multiprocessor architectures, techniques for supporting cache coherence on a bus and on a ring, the concept of coarse-grained coherence tracking, and improvements to traditional snooping on a ring network. Chapter 3 describes the architecture of the ring- based multiprocessor system that was developed for the purposes of this work. Chapter 4 discusses the coarse-grained filter structures that are used at each node of the ring interface. Chapter 5 presents the key implementation points of the system, synthesis results from its implementation in programmable logic, and operational results obtained from register- transfer level simulation. Finally, Chapter 6 concludes with a summary of this work, and a discussion of possible avenues for future work.
  16. 16. Chapter 2 Background This chapter begins with a discussion of shared-memory multiprocessor architectures and the issues of cache coherence and memory consistency, including different approaches for enforcing cache coherence. An overview of the concept of coarse-grained coherence track- ing is then provided. Finally, extensions for enforcing coherence in ring-based multipro- cessors are reviewed. 2.1 Shared-Memory Multiprocessor Architectures Parallel computers enable the execution of a single program to be divided amongst mul- tiple processing elements in order to reduce the program’s overall execution time and to increase processing throughput. One important class of parallel computers comprises shared-memory multiprocessors [CSG99]. These systems use a shared-address program- ming model that is natively supported by hardware. Communication and data exchange between processes in this model is performed implicitly using conventional load and store instructions that operate on shared memory. A second class of parallel computers uses 6
  17. 17. CHAPTER 2. BACKGROUND 7 the message-passing programming model where processors exchange data and control in- formation using explicit send and receive commands because the memory associated with each processor is not logically shared among the processors. The different programming models for these two classes of parallel computers have various advantages and drawbacks related to ease of programming or performance optimization, depending on the scale of the parallel computer system. This thesis is focused on shared-memory multiprocessors with a sufficiently large scale to warrant consideration of interconnection networks other than a simple bus. The remain- der of this section reviews the issues of cache coherence and memory consistency in such multiprocessors, and presents a number of approaches for enforcing coherence. 2.1.1 Cache Coherence In the field of computer architecture, caches have long been an essential requirement in the design of uniprocessors and multiprocessors. A cache is a small memory structure typically built using static random access memory (SRAM) that is used to hold recently accessed blocks of data that have been retrieved from slower, but larger, main memory. The time required to access a block of data in cache memory is at least an order of magnitude less than the time required to access the same block in main memory. Locality of access in typical programs enables most data requests to be serviced by the cache despite its small size, hence performance is improved due to the shorter cache access latency. The cache coherence problem emerges when caches are used in a multiprocessor sys- tem. This problem occurs when a block of data from main memory is shared by multiple caches and when one processor wishes to write a new value to this shared location. For example, processor P0 in Figure 2.1 must be informed of processor P1’s write to the shared
  18. 18. CHAPTER 2. BACKGROUND 8 P1 (Write X=9) Cache (X: 7) Memory (X: 7) Interconnect P0 Cache (X: 7) Figure 2.1: Cache coherence example address X, otherwise, the data read by processor P0 will not reflect the change in proces- sor P1’s cache, and the result of a subsequent read operation by P0 will be incorrect. The cache coherence problem is resolved by ensuring that the multiprocessor system maintains the application’s program order at the memory system, and enforces the property that the result of any processor’s read operation is the always the value of the last write performed to that location by any processor. Program order, in this context, is loosely defined as the sequence of instructions in assembly code that is presented to the hardware. The most common approach for propagating the occurrence of writes in a shared- memory system with caches is to use a write-invalidation protocol in order to ensure that the most recent value of the requested block of shared data is always supplied to the re- questing processor. Under a write-invalidation protocol, each memory store request for a block of memory generates an invalidation request that is broadcast to all processors con- tained in the system. The invalidation ensures that any cached copies of the block of data in other processors are no longer valid. As soon as this invalidation request is acknowledged by all processors, the writing processor becomes the only processor in the system with a valid copy of the block of data, and only then can its store operation be allowed to proceed. Details pertaining to hardware-assisted solutions to the cache coherence problem vary
  19. 19. CHAPTER 2. BACKGROUND 9 Initially (A, B, U, V) = (0, 0, 0, 0), P0 code P1 code P0 mem. ops. P1 mem. ops. A = 1 B = 1 store A store B U = B V = A load B load A store U store V Figure 2.2: Example parallel program to demonstrate concept of memory consistency depending on the system’s processor-memory interconnection network. A number of dif- ferent hardware implementations for enforcing cache coherence are described in the fol- lowing sections, including two snooping protocols and a directory-based scheme, but first, an issue that is complementary to cache coherence must be discussed: memory consistency. 2.1.2 Memory Consistency The above discussion on cache coherence applies to actions related to a single memory location. Memory consistency is concerned with the order of accesses to different mem- ory locations. Processors do not necessarily service their requests in program order, i.e., in the order in which instructions appear in assembly code. The example program in Fig- ure 2.2 helps demonstrate the issue of memory consistency in shared-memory systems. At first glance, one would expect that the result of executing the parallel program would be (U, V) = (1, 1). An optimizing compiler may correctly determine that processor P1’s store A and load B instructions are independent and possibly reorder them. Alterna- tively, the processor and/or memory system may complete the requests in a different order than the original code. If processor P0’s load B operation completes prior to processor P1’s store B operation, the final value of U seen by processor P0 remains unchanged and retains its initial value of 0. The issues discussed above have been formalized by a number of memory consistency
  20. 20. CHAPTER 2. BACKGROUND 10 models [AG96]. These models specify constraints on the order in which memory opera- tions must complete with respect to all processors. Low-level software programmers use these models as a guideline to understand multiprocessor behavior when developing oper- ating systems and their low-level synchronization primitive libraries. High-level program- mers must also be aware of the consistency model in order to interpret program results or possibly constrain the access order if necessary. Sequential Consistency One of the simplest memory consistency models is sequential consistency [Lam79], which states that each processor appears to issue memory operations one at a time, in program order, and that the result of parallel execution is consistent with a linear interleaving of all memory operations from all processors. Implementing sequential consistency can be achieved by requiring processors to observe a number of sufficient conditions [AH90a]. Each processor must issue its memory operations in program order, all processors must observe write operations in the same order, and all loads and stores to shared memory must be completed atomically. Processor Consistency A second memory consistency model, processor consistency [GGH91], relaxes some of the constraints inherent to sequential consistency. Whereas sequential consistency does not, strictly speaking, allow any memory access reordering by the cache controller, processor consistency allows for further request pipelining, thereby enabling the use of interconnec- tion networks with variable latency and the use of compiler optimizations that reorder or
  21. 21. CHAPTER 2. BACKGROUND 11 overlap the execution of memory accesses. Specifically, read operations are allowed to by- pass unrelated write operations that are waiting to be serviced, and the ordering constraints between writes and other processor’s reads are relaxed. Many common parallel program- ming idioms used in sequentially consistent systems, such as flag-based synchronization, can still be executed under processor consistency without modification to their original program code. Weak Consistency Models A number of alternative memory consistency models have been proposed that allow for more extensive request pipelining. The weak consistency [AH90b] model does not im- pose any restrictions on memory access reordering by the memory system for segments of code located between application synchronization points. Release consistency [GLL+90] extends this concept one step further by distinguishing between the different types of syn- chronization operations. Multiprocessor systems that implement weak consistency models are harder to use from the perspective of a low-level system software programmer [Hil98]. Generally speaking, weak consistency models have fallen out of favor because they do not provide sufficient additional performance to justify their resulting increase in hardware and software complexity when high-performance speculative out-of-order processors are used whose hardware makes it appear that memory operations complete in-order [Hil98]. The use of simple memory consistency models extends to recent commercial implemen- tations [CH07] and to the work in this thesis, where processors that support a single out- standing memory operation are used. The easiest and most straightforward way to provide hardware support for cache co- herence and sequential consistency in a multiprocessor system is by using a shared-bus
  22. 22. CHAPTER 2. BACKGROUND 12 P0 Cache P1 Cache P2 Cache P3 Cache Memory Arbiter Figure 2.3: Bus-based shared memory interconnect and a bus snooping protocol. This system organization is the subject of the following section. 2.1.3 Bus-Based Snooping for Cache Coherence The bus interconnect provides natural support for cache coherence. The system illustrated in Figure 2.3 contains a bus interconnect located between the caches and shared memory. The shared nature of the bus forces processors to take turns using the interconnect. Each processor must first request access to the shared resource using an arbiter. This property of the bus is important as it forces the serialization of cache-related operations and ensures the visibility and atomicity of write operations, i.e., memory operations are performed one at a time and are simultaneously observed by all processors. This feature of the bus interconnect is employed to help provide cache coherence. When one processor is driving the bus, all other processors will inspect, or snoop, the bus address and command lines and take action as necessary. The actions taken by processors that are snooping the bus and by those that are driving a coherence request on the bus are described in the following section where the MSI cache coherence protocol is discussed.
  23. 23. CHAPTER 2. BACKGROUND 13 Invalid Modified (read/write) Shared (read-only) Processor initiated transitions: load, store Bus initiated transitions: (read), (read-ex or upgrade) load (read) store (read-ex) store (upgrade) (read) (read-ex or upgrade) (read-ex or upgrade) Figure 2.4: MSI cache coherence protocol MSI Cache Coherence Protocol Caches in a multiprocessor system are more sophisticated than those in a uniprocessor sys- tem because a distinction must be made between a shared (read-only) cache block and a modified (read/write) cache block. As discussed in Section 2.1.1, this distinction is impor- tant to ensure that only a single processor is allowed to write to a shared address at a given time. The rules governing the transitions between different cache line states and the actions
  24. 24. CHAPTER 2. BACKGROUND 14 taken by processors to ensure cache coherence are determined by a cache coherence pro- tocol. The MSI cache coherence protocol is made up of the three states that are shown in Figure 2.4. The actions that initiate a transition from one state to another and any bus operation resulting from the transition are indicated by the arcs of the graph. Three distinct bus operations result from processor activity: read, read-exclusive, and upgrade. A read request is used to retrieve a cache block in a shared or read-only state. A read-exclusive re- quest is used to retrieve a cache block in a modified or read-write state. An upgrade request is issued when a processor is caching a block in the shared state and wishes to modify its contents. The upgrade request changes, or upgrades, the state of the requested cache block from shared to modified. When any of these processor-driven requests are seen on the bus, all processors other than the one currently driving the bus must probe their caches to determine whether they are holding a valid copy of the requested block. This operation is known as snooping and the term snoop hit is used when a processor finds a matching block in its cache. On a snoop hit, the snooping processor may need to take action on its copy of the requested block, as determined by the cache coherence protocol. The type of action depends on the state of its copy and the type of request seen on the bus. On a snoop hit for a read-exclusive or upgrade request where the state of the block is shared, the specified cached block is invalidated. On a snoop hit for a read or read-exclusive request where the state of the block is modified, the main memory must be inhibited from responding, and the processor with the modified copy of the block provides the data reply on the bus and changes the state of its block to shared for a read request, or invalid for a read-exclusive request. If the reply of modified data is for a read request, then the memory accepts the data as a sharing writeback, in addition to the requesting processor waiting for the data.
  25. 25. CHAPTER 2. BACKGROUND 15 P1 (Write X=9) Cache (X: 7) Memory (X: 7) P0 Cache (X: 7) Upgrade requestSnoop/invalidate P1 Cache (X: 9) Memory (X: 9) P0 (Read X) Cache (X: 7) (X: 9) Read request Snoop/ Provide response 1 2 Figure 2.5: Bus-based snooping for cache coherence
  26. 26. CHAPTER 2. BACKGROUND 16 To emphasize the importance of snooping in a bus-based multiprocessor and to further describe the cache coherence protocol discussed above, the coherence problem that was presented in Figure 2.1 can be reconsidered. Figure 2.5 shows processor P1’s store opera- tion to shared address X that is cached in a shared state. The cache coherence protocol, in Figure 2.4, indicates that processor P1 must broadcast an upgrade request on the bus inter- connect. Once the request is on the bus, processor P0 immediately inspects, or snoops, the upgrade request issued by processor P1. After processor P0’s cache controller determines that it holds a valid copy of the requested block X, it proceeds to invalidate its shared copy, as required by the cache coherence protocol. The following attempt to read address X by processor P0 reveals that there is no valid copy in the cache. The coherence protocol indi- cates that processor P0 must broadcast a read request on the bus. Processor P1 snoops this request, determines that its cache contains a copy of the requested block in the modified state, initiates a data reply operation that inhibits the normal operation of memory, supplies a copy of the modified block of data to processor P0, and changes the state of the supplied block to shared. The same data is also accepted by the memory (which was inhibited by P1) as a sharing writeback. Performance and Implementation Issues The major problem associated with bus snooping is that its design does not scale well when additional processors are added to the system. A bus interconnect provides a fixed amount of total memory bandwidth and each additional processor added to the system reduces the amount of bandwidth available to each processor. For systems that implement an on-chip bus, energy efficiency and the long, global wires required are also primary concerns. Implementing these global wires that span large sections of a chip requires the
  27. 27. CHAPTER 2. BACKGROUND 17 use of signal repeaters that increase the likelihood that the interconnect falls on the system’s critical path. A longer critical path reduces the system’s clock speed and further reduces available interconnect bandwidth. Ho et al. [HMH01] provide a circuit-level description of these issues and conclude that the physical implementation issues related to global on-chip communication forces the development of alternative latency-aware architectures. Improvements for Bus-Based Snooping A number of solutions have been proposed to overcome the problems mentioned in the previous section. These improvements include increasing the size of cache blocks, the use of a split-transaction bus, and adopting a hierarchical snooping scheme. Increasing the cache block size and the width of the data bus is an initial step to increase the available memory system bandwidth because more data is transferred during each bus request. Unfortunately, the resulting increase in the false sharing of coherent data limits this technique’s effectiveness. False sharing occurs when multiple processors access different words contained in the same cache block. If a processor writes to an individual word, the other processors caching the same block would incur an unnecessary invalidation, i.e., not one due to accessing the same word in the block. The larger the cache block size, the more likely this situation is to occur. A split-transaction bus increases available memory system bandwidth by decoupling requests and responses, and by using physically separate buses to pipeline the operation of the memory system. As shown in Figure 2.6, the split-transaction bus differs from an atomic bus as it allows the overlap of pending memory transactions. Specifically, the idle time between when read request X in Figure 2.6 is placed on the bus, and when a response is provided, can be used by others processors to initiate new requests. The split-transaction
  28. 28. CHAPTER 2. BACKGROUND 18 Read X Reply X Read X Read Y Reply XReply Y Atomic bus Split-transaction bus Request bus Reply bus Figure 2.6: Split-transaction bus example P0 Cache P1 Cache Memory P2 Cache P3 Cache Memory Coherence Monitor Coherence Monitor Node 1 Node 2 Figure 2.7: Two-level snooping hierarchy bus also provides support for out-of-order transactions where responses can arrive in a different order than their respective requests were issued. To support this type of operation, a mechanism to match each outstanding request with its eventual response is required. Request matching is performed by assigning a unique identifier tag to each bus transaction. After a request is serviced by the selected slave device, its tag is transferred to the generated response. Processors monitor the reply bus and only accept a response whose tag matches the tag of one of their outstanding requests. Hierarchical snooping is another technique that can be considered. It distributes the
  29. 29. CHAPTER 2. BACKGROUND 19 centralized nature of a single bus by arranging multiple buses in the form of a tree struc- ture [CSG99]. Figure 2.7 illustrates such an architecture with a two-level hierarchy that assembles pre-existing bus-based multiprocessors into a larger system. This technique al- lows for these existing designs to scale in performance. Each leaf of the tree contains a bus-based multiprocessor, and the upper level of the tree does not contain processors. The top level is only used to propagate coherence traffic between leaf nodes. An alternative to using a bus interconnect for hierarchical snooping is to use a unidi- rectional ring interconnect. The work in this thesis uses two of the techniques that were described above, a wider data bus and a split-transaction bus organization, in addition to the ring network that will be discussed in more detail in the following section. 2.1.4 Ring-Based Snooping for Cache Coherence This section describes the use of the unidirectional ring interconnect as an alternative to the bus interconnect for the design of shared-memory multiprocessors. While the unidirec- tional ring interconnect is not as widely used as the bus for the design of multiprocessors, a number of systems do implement this type of architecture [CJJ98]. Unidirectional ring networks, such as the one shown in Figure 2.8, have properties favorable to the design and physical implementation of shared-memory multiprocessors. These properties include short point-to-point wires, decentralized arbitration, a simple ac- cess control mechanism, trivial routing logic, unconditional broadcasting, and the preser- vation of the order of requests. Together, these properties are exploited to help provide cache coherence to the shared-memory system. Much like in bus snooping, each processor, or set of processors, connected to a node on a ring must snoop all requests initiated by other processors and, if applicable, take
  30. 30. CHAPTER 2. BACKGROUND 20 P0 Cache P1 Cache Memory Node 1 Ring Interface Router Out In Figure 2.8: Ring interconnection network and multiprocessor node P1 P2 P3 P0 A B Figure 2.9: Ordering of ring requests
  31. 31. CHAPTER 2. BACKGROUND 21 appropriate action as determined by the cache coherence protocol. Whereas processors on a bus snoop requests at the instant that one is seen on the interconnect, processors on a ring snoop these requests individually as they go by. The delay between the time a request is made and the time the request is snooped by a remote node introduces specific, but manageable, issues regarding the total ordering of ring requests and how cache coherence is provided for these systems. The example in Figure 2.9 illustrates how processor P1 receives the circulating messages in {A, B} order, while processor 3 sees the messages in {B, A} order. Under most memory consistency models, the order of requests must be maintained between different processors. Several solutions have been proposed to enforce the total ordering of ring requests in order to satisfy memory consistency in a ring-based multiprocessor. Two approaches will be discussed in this section: the use of an ordering point [MH06] and the use of a greedy- ordering scheme[BD91, MH06, SST06]. Ring Ordering Points Ordering points may be used in all types of unordered networks to enforce the total ordering of requests. All requests must travel through a special ordering node, one at a time, before the request activates. This mechanism is particularly useful in a ring-based system where requests naturally travel through the ordering point instead of having to be explicitly routed through the network. While ordering points can be used in other types of interconnect, their use is not common since they have the potential to create a network traffic bottleneck at the input of the ordering node. Request latency is also higher when an ordering point is used because all packets must travel to the specified node before they are routed to their destination. This latency is effectively doubled in a ring when an ordering point is used.
  32. 32. CHAPTER 2. BACKGROUND 22 Ring Greedy Ordering A second approach to ensure the total ordering of requests in a ring network involves mod- ifying the cache coherence protocol to add pending read and write states to each cache line. Unlike when an ordering point is used, all requests are activated immediately and the first request to reach its destination is the one serviced. This accepted request immediately initiates a transition to a pending state for the specified cache line. Subsequent requests for a cached memory block in a pending state are negatively acknowledged and are retried by the requesting processor at a later time. This approach ensures that no action can be taken on a block while a corresponding request or response packet is in-flight on the network. Although a lower average request latency is obtained with a greedy-order request policy compared to the use of an ordering-point, these locked cached lines have been shown to result in a potentially unbounded number of retries that may cause the starvation of re- quests [MH06]. Providing support for cache coherence using snooping is limited in its scalability be- cause all requests stemming from cache misses must travel to all processors when a coher- ent operation is performed. Therefore, for shared-memory multiprocessors to scale beyond the number of processing elements that can be practically implemented using a bus or ring interconnect, alternative architectures must be explored. One possible solution is to use a more complex interconnection network and a directory protocol that associates an entry in a hardware structure for each of the system’s memory blocks. This type of system is described in further detail in the following section.
  33. 33. CHAPTER 2. BACKGROUND 23 Interconnect Directory MMemory 0 P0 Cache P1 Cache Memory P2 Cache P3 Cache Memory Node 1 Node 2 NI Interconnect Directory NI Directory Figure 2.10: Directory-based multiprocessor architecture Table 2.1: Directory organization Memory Directory Address Data Node 1 Node 2 State 0x10000000 100 1 0 SHARED 0x10000010 525 0 1 INVALID 0x10000020 49 1 1 MODIFIED 2.1.5 Directory-Based Cache Coherence Directory-based cache coherence [CF78, HHG99] is used to enable the design of scalable multiprocessors. This organization is illustrated in Figure 2.10 where memory is shown to be physically distributed among the nodes of the system. A directory structure is used to track the state of each block of memory in the system. Each memory block has an associated entry that indicates its state (typically shared, modified, and invalid) and a bit to indicate its presence in each of the system’s different processors, or nodes. Table 2.1 illustrates an example directory for the two-node system seen in Figure 2.10 that has a cache block size of 16 bytes. In directory-based systems, requests from processor cache misses first consult the di- rectory structure located at the memory block’s home node. The directory then responds
  34. 34. CHAPTER 2. BACKGROUND 24 and/or forwards requests to other nodes that may have modified data or copies of shared data, who in turn, provide a response or an acknowledgment to satisfy the original re- quest. The amount and frequency of these requests will vary depending on the applica- tion’s memory sharing patterns. The key difference between a directory protocol and a snooping protocol is that under a directory protocol, the location of shared blocks of data is known in advance, while under a snooping protocol, the information must first be gath- ered by querying all of the processors. The principle disadvantage of a directory protocol is the indirection delay from the forwarding of requests that can result in longer latencies for servicing misses. The storage requirements of a directory are also prohibitive as the total amount of storage in bits required is proportional to the product of the total number of memory blocks, and of the number of nodes. For example, a 64-node system with 64 GB of total memory organized with a cache block size of 256 bytes requires 2 GB of directory storage. Many schemes have been proposed to mitigate directory storage overhead [JH98, CSG99]. Reducing the number of bits per directory entry can be achieved by storing a limited num- ber of pointers per entry that track which processors are caching a block or by using a coarse vector scheme where each of the entry’s bits indicates a group of processors. An- other technique for reducing the number of directory entries is to organize the directory as a tagged structure, i.e., as a cache, where in the event of a directory-cache miss, the controller can simply broadcast the request to all nodes. A more recent approach is to introduce additional logic and storage throughout the net- work in order to permit the service for certain operations to be initiated more quickly [EPS06]. For example, if an in-flight read request crosses a node that contains a shared copy of the re- quested block, it may obtain the block directly from this node instead of having to continue
  35. 35. CHAPTER 2. BACKGROUND 25 to the directory at the home node of the requested address. 2.1.6 Token Coherence An alternative approach to handling cache coherence that does not require a totally-ordered interconnect and avoids the request indirection delay of directory schemes is Token Cohe- rence[MHW03b, MHW03a, MBH+05]. This approach associates N tokens to each mem- ory block, where N is the number of processors in the system. A processor must obtain at least one of the block’s tokens before a read can be performed and it must obtain all of the block’s tokens before a write can be performed. This approach therefore ensures the serialization of writes and can be used on any type of interconnect network. Tokens are physically exchanged between the different processors in the form of a token packet. Token starvation may occur for writes when a processor fails to receive all tokens for the re- quested block of memory in a bounded amount of time. This scenario is resolved by using a special type of high priority request that forces remote processors to immediately forward all their tokens for the requested block to the writing processor. While token coherence is conceptually simple, it presents design implementation issues related to network livelock and its performance figures are yet unproven in actual hardware. Directory-like storage is also required at main memory because the memory will retain all tokens for uncached blocks. 2.2 Coarse-Grained Coherence Tracking Recent studies have shown that the spatial locality of cache activity extends beyond indi- vidual cache blocks to larger regions of memory, where a region is a contiguous portion of
  36. 36. CHAPTER 2. BACKGROUND 26 31 015 Region 32-bit Address Address Range Region 0x00000000 - 0x0000FFFF 0 0x00010000 - 0x0001FFFF 1 ... ... 0xFFFF0000 - 0xFFFFFFFF 65535 Figure 2.11: Address to 64-kbyte region mapping memory whose size is of a power of two [Mos05]. Figure 2.11 illustrates the mapping from memory address to a 64-kbyte region index. Cantin et al. [CLS05] demonstrate that for a set of commercial, scientific and multiprogrammed workloads, coarse-grained coherence tracking techniques can be used to reduce 55-97% of all coherence-related broadcasts and improve system performance by 8.8% on average. The following section describes two region-based techniques that have been proposed for bus-based multiprocessors: Region- Scout and Region Coherence Arrays. 2.2.1 RegionScout RegionScout [Mos05, CSL+06] exploits extended spatial locality to dynamically detect most non-shared regions using two imprecise filters. Coarse-grained information is col- lected and is used to filter unnecessary coherence broadcasts and avoid unneeded snoop operations. Benefits include an increase in available memory bandwidth and a reduction in energy expended for snooping. The first of the two structures, the Cached Region Hash (CRH), records a superset of all the regions that are locally cached by a single processor. The CRH is a non-tagged hash
  37. 37. CHAPTER 2. BACKGROUND 27 table that contains a fixed number of entries similar to a Bloom filter [Blo70] or to inclusive- JETTY [MMFC01]. Each entry is associated with multiple regions of memory. Mapping more than one region to an entry is not required, but it keeps the size of the structure as small as possible without having a significant impact on filtering performance. The size of this structure described in previous work [Mos05, CSL+06] is of the order of 1-4 kbits. Each entry in the table contains both a count field and an optional presence bit. The count field for each entry indicates the number of blocks cached in the set of all regions that map to the entry. The presence bit for each entry is set when the count field is equal to zero. A set presence bit indicates that, for the set of regions that map to the entry, the processor is not caching a single block and that these regions are non-shared for this processor. The CRH is updated by monitoring all local cache allocations, replacements, and evictions for a single processor. When a cache block is allocated, the count of the corresponding CRH entry is incremented, and when a cache block is replaced or evicted, the count of the corresponding CRH entry is decremented. The second structure, the non-shared region table (NSRT), is a cache of memory re- gions discovered to be non-shared for all other processors. The contents of the NSRT are maintained by virtue of snooping of the CRH structures in other processors. For each re- quest on the bus, if all snooping processors report that the CRH counts are zero for the entry corresponding to the requested address, the requester assumes that the region is non- shared. Based on this information, entries are added and removed from the NSRT. The information maintained in the NSRT is used by processors to avoid unnecessary coherence broadcasts and snoop operations, in order to increase available memory bandwidth and to reduce energy use.
  38. 38. CHAPTER 2. BACKGROUND 28 Through simulation experiments, RegionScout is shown to favorably exploit the coarse- grain sharing patterns of shared-memory applications. With the simple structures described above, coherence broadcasts are reduced between 34% and 88%, and the potential to reduce the number of snoop operations remains above 30% for the largest region size (16 kbytes) and 37% for the smallest region size (256 bytes), for most applications [CSL+06]. 2.2.2 Region Coherence Arrays Region Coherence Arrays (RCA) [CLS05] use cache-like tagged structures to precisely record regions of memory for which blocks are cached and to associate a region coherence state with each entry in the structure. Like RegionScout, the information in RCAs is used to avoid unnecessary coherence broadcasts and to filter unnecessary cache tag lookups. A parallel can be drawn between the state of a region and the state of a much smaller cache block; just as a bus snooping protocol determines the actions for individual cache lines, the region protocol determines the actions for regions of memory. The region protocol deter- mines the region state, the transition to the next region state, and the filtering status of each region cached in the RCA. For example, a region in the dirty-invalid state does not require any coherent broadcasts or a region in the dirty-clean state only requires broadcasts for modifiable blocks. When compared to RegionScount filters, RCAs require more storage, and are more complex. They provide slightly better filtering accuracy, however. 2.3 Enhanced Ring Snooping for Cache Coherence The following section discusses a recent enhancement for snooping ring multiprocessors. The technique under investigation aims to minimize snoop request latency.
  39. 39. CHAPTER 2. BACKGROUND 29 2.3.1 Flexible Snooping One of the main problems associated with ring snooping is the increased latency for cache misses that is incurred from all requests being snooped by all nodes on the ring. The Flexible Snooping algorithm [SST06] has been proposed to address this issue. Unlike RegionScout, the Flexible Snooping algorithm does not aim to filter coherence traffic. Its goal is to reduce latency in a unidirectional ring multiprocessor by modifying the algorithm used to ensure cache coherence. The basis for Flexible Snooping is related to two existing approaches for ring snooping. The lazy forwarding algorithm states that a processor, or set of processors, connected to a node on the ring must snoop each passing coherent request and wait for the snoop reply before forwarding the request to the following node. The second approach, eager forward- ing, allows the coherent request to be forwarded to the following node before initiating its snoop operation. Upon completion of the snoop operation, the resulting snoop reply is combined with the snoop outcome of the proceeding nodes, which is a second message that trails the original request. The revised snoop outcome is forwarded to the next node where the process is repeated. These two approaches represent opposite ends of a spectrum. On one hand, lazy forwarding has a higher snoop request latency and generates a lower amount of ring traffic, whereas eager forwarding has a comparatively lower latency, but generates on average double the amount of ring traffic. The Flexible Snooping algorithm introduces an intermediate solution to the two algo- rithms described above. In this new algorithm, a node first consults a predictor table before deciding on whether it should first service an inbound snoop request and then forward it to the following node, or whether it should first forward the request then perform the snoop operation. The predictor structures in question track whether or not the node can provide a
  40. 40. CHAPTER 2. BACKGROUND 30 response to a passing request for the specified address. For Flexible Snooping, three different predictor structures with varying characteris- tics are described for snoop responders on a ring. The subset algorithm uses a cache-like tagged structure that tracks a small number of addresses for which a node is a known supplier to a matching request. The superset supplier predictor algorithm is a non-tagged JETTY [MMFC01] filter augmented with an exclude cache. The superset algorithm is sim- ilar to the CRH structure used in RegionScout, however, unlike the CRH that uses a single count per address, the structure used in Flexible Snooping divides the address into multiple fields and assigns a count to each field. To determine whether an address is not contained in the table, the superset supplier predictor must verify that all count entries are equal to zero. An exclude cache is used to reduce the number of false positives inherent from the use of these types of filters. False positives occur if for a given address a node is said to, but does not actually supply, a response. Finally, the exact algorithm is an enhanced subset algorithm that makes its prediction using information stored at the cache-line granularity by using additional state bits and a more complex cache coherence protocol. Supporting the Flexible Snooping algorithm requires an enhanced MESI coherence protocol [CSG99]. The new coherence protocol adds an additional tagged state and a global/local qualifier to the shared state. These modifications require more complex cache controllers that are undesirable in the context of the work in this thesis. This thesis uses a structure that tracks data in a similar fashion as the superset supplier predictor, but it is used to filter packets from entering the ring network instead of providing routing hints to help improve snooping latency.
  41. 41. CHAPTER 2. BACKGROUND 31 2.4 Summary This thesis focuses on the design of shared-memory multiprocessors for application-specific implementations. Implementing a shared-memory multiprocessor requires providing a so- lution to the cache coherence problem. Traditional solutions such as using a bus inter- connect and a snooping protocol do not scale when a large number of processors are con- sidered. Snooping on a unidirectional ring-based network is an alternative solution. Al- though directory-based cache coherence and an arbitrary interconnect can be used, when application-specific implementations are considered, the associated memory and logic over- head are undesirable. Of particular significance to the work in this thesis is the background on coarse-grained coherence tracking techniques and the methods to reduce the latency of coherent requests in ring-based multiprocessors. This thesis explores, by means of proto- type implementation of an architectural enhancement for ring multiprocessors, an interme- diate solution that combines all of the above techniques for large-scale application-specific implementations. One key goal is to reduce the complexity of such techniques while main- taining the necessary functionality. The following chapter describes the multiprocessor system that was developed as the basis for the prototype implementation and a subsequent chapter discusses the proposed enhancements.
  42. 42. Chapter 3 Multiprocessor System Architecture A multiprocessor system was developed to support the research goals of this thesis. An overview of the different components of this system is illustrated in Figure 3.1. The ar- chitecture is composed of two or more bus-based symmetric multiprocessor nodes that are interconnected using a unidirectional bit-parallel slotted ring. The multiprocessor node in Figure 3.1 includes a common memory, two or more MIPS-derived pipelined 32-bit pro- cessors with coherent data caches, a split-transaction bus with separate lines for requests and responses, and an interface for the system-level ring interconnect. This ring interface is attached to the system interconnect using a bidirectional pair of FIFO buffers. In each of the nodes of the ring, a register is used to store circulating packets, and an appropriate amount of control logic is used to route packets between adjacent nodes. Processors access a globally shared-memory that is physically distributed among the different multiprocessor nodes on the ring. Cache coherence within the multiprocessor node is enforced through bus snooping, while coherence between the different nodes on the ring is supported by a reduced-complexity ring snooping protocol. This chapter provides a high-level description of each of the components described 32
  43. 43. CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 33 P0 Ring router/ register Ring interface Inbound Outbound Request bus Response bus PN-1 I/OMEM Figure 3.1: Multiprocessor system node overview above, including the pipelined processor, the cache controller, the split-transaction bus, the I/O peripherals, and the unidirectional ring network. The chapter then concludes with an overview of the reduced-complexity cache coherence protocol and a description of the system’s memory consistency model. 3.1 Pipelined Processor The 32-bit pipelined processor that is used in the multiprocessor system in Figure 3.1 is the subject of this section. Several features of the processor are discussed below, including its instruction pipeline, the structure of its instruction and data caches, and its supported instruction set. 3.1.1 Instruction Pipeline The instruction pipeline is the key feature of the processor under consideration. A pipeline is used to shorten the processor’s cycle time and to reduce the number of required clock cycles per instruction (CPI), thereby increasing overall processing throughput [HP03]. The
  44. 44. CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 34 IF/ ID EX/ MEM ID/ EX MEM /WB PC Register file Branch logic Control ALU Forward Writeback Bus controller Request bus Response bus State Tag Data ... ... ... Valid Tag Data ... ... ... Instruction cache Data cache Fetch Decode Execute Memory access Writeback State Tag ... ... Dup. tags Figure 3.2: Instruction pipeline pipeline itself, shown in Figure 3.2, contains the following stages: instruction fetch (IF), instruction decode and register fetch (ID), execute (EX), memory access (MEM), and write- back (WB). The instruction fetch stage retrieves the instruction at the address indicated by the processor’s program counter register. The instruction decode stage generates control signals for the fetched instruction and reads the requested register values from the register file. The execute stage provides the instruction’s arguments to the ALU and computes the desired result. The memory access stage performs any load or store instructions on the data cache. Finally, the writeback stage writes the final result back to the register file. Pipeline hazards are resolved using a forwarding mechanism and by stalling specific instructions. Branches are resolved in the memory access stage using a static branch predictor that as- sumes all branches are not taken. In the event of a taken branch, the pipeline is flushed to remove all incorrectly predicted instructions from the pipeline. A miss in the instruction cache or the data cache causes the entire pipeline to stall until both an acknowledgment and
  45. 45. CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 35 a valid block of data are returned from the cache controller. 3.1.2 Instruction and Data Caches The processor’s instruction and data caches are shown in Figure 3.2. The instruction cache stores read-only machine code and requires a single bit per block to represent the block’s state information, i.e., valid (V) or invalid (I). The data cache uses two bits per block to distinguish between a valid block that is clean and a valid block that has been modified, as well as an invalid block. The three states of interest are represented as: modified (M), shared (S), and invalid (I). Blocks held in the data cache are subject to the MSI cache coherence protocol described in Section 2.1.3 with a write-back, write-invalidate policy. The data cache uses duplicate tag and state information to support the snooping protocol. 3.1.3 Instruction Set Architecture The processor supports a subset of the MIPS R3000 instruction set architecture [Kan89]. The supported instructions are summarized in Table 3.1. The pipeline provides no support for exception handling. Furthermore, it does not contain an unaligned load/store unit, a floating-point unit, a translation lookaside buffer (TLB), or a multiply and divide unit. The omission of these functional units is reflected in the list of supported instructions. The Gnu C compiler and a custom linker script are used to translate high-level C and assembly code to MIPS machine code. Full binary compatibility is not maintained by the implemented processor for MIPS executable files. For example, the use of delay slots for branch and load instructions is not supported, nor is the use of unaligned load and store operations. The limited number of available instructions impacts the nature of applications supported by the processor, but for the purposes of this work, the instructions that have
  46. 46. CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 36 Table 3.1: Supported instruction set Category Instruction Description Arithmetic add, addu, addi, addiu add sub, subu subtract Logical and, andi logical and or, ori logical or xor, xori logical xor nor logical nor Shift sll, srl, sra shift left, right sllv, srlv, srav shift left, right variable Data transfer lw load word sw store word lui load upper immediate Conditional branch beq branch on equal bne branch not equal slt, sltu set less than slti, sltiu set less than immediate Unconditional jump j jump jr jump register jal jump and link
  47. 47. CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 37 been implemented are sufficient to support compiler-generated code that can be used to validate the operation of the system-level interconnect and of the filter structures that are proposed in Chapter 4. 3.2 Split-Transaction Bus In a coherent multiprocessor system, processors must be able to communicate with each other as well as with memory and I/O devices. The following section discusses this com- munication mechanism in more detail. Each multiprocessor node in the system architecture under consideration uses a split- transaction bus to communicate between master and slave bus devices. A master device may initiate, or respond to, a bus transaction, whereas a slave device may only provide a response to a bus transaction. As was discussed in Section 2.1.3, a split-transaction bus decouples requests from responses and allows for increased parallelism and throughput by assigning a unique tag to each pending bus operation. This tag is used by processors to match each outstanding request with its eventual response. One important complication arises from the use of a split-transaction bus. Under a certain scenario, a processor may wish to issue a read request for an address that has pre- viously been requested for exclusive access by another processor that seeks to write to the same address, but whose response has not been received. This overlap between such pend- ing read and write requests for the same block from different processors is known as a conflicting request. Conflicting requests must be avoided in order to guarantee correctness and to ensure cache coherence. Tag tracking tables located in each processor’s cache con- troller are used by many systems [CSG99] to avoid these conflicting requests. The system for this thesis employs a simplified version of a tag tracking table that is attached to the
  48. 48. CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 38 0x1234 READ 1 Address Command Tag NACK Inhibit Clock 1 2 Figure 3.3: Request bus timing bus as a slave device and uses a negative-acknowledge wired-or line to signal all devices that a new request is conflicting. The table maintains a list of current outstanding requests and negatively acknowledges any conflicting request after it is issued on the bus. Having a single table reduces implementation complexity when compared to placing separate ta- bles in each processor’s cache controller, and this approach is appropriate for a prototype implementation. 3.3 Cache Controller The processor’s cache controller has two primary functions. The first is to service processor requests that miss in a cache. Each time a processor requests information from memory that is not presently available in its cache, the cache controller retrieves the requested block in the requested state. In a multiprocessor cache, a store operation cannot proceed until the block of memory is cached in a modified state, and a load operation cannot proceed until the block is cached in a shared or modified state. The controller supports, at most, a single
  49. 49. CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 39 outstanding request for either the instruction cache or the data cache, and the processor itself is stalled while the miss is serviced. The other function of the cache controller is to inspect, or snoop, the requests of other processors on the bus and to take appropriate action as determined by the MSI cache coherence protocol described in Section 2.1.3. Bus operations are performed by the controller using the two-clock cycle schedule shown in Figure 3.3. At the beginning of the first clock cycle, the request address, com- mand, and tag lines are asserted by the cache controller. The tag for each processor is its numeric identifier; because only one outstanding request is permitted per processor, the tag will be unique. In the same cycle, all other processors snoop the request that is cur- rently on the bus using their duplicate tags. During the second clock cycle, the snooping processors will update the state of a matching valid block in their respective caches and, if applicable, assert their memory inhibit lines and initiate a response to the current read or read-exclusive request. Finally, the request may be negatively acknowledged in either of the two clock cycles due to a conflicting request or lack of buffer space, which will force a subsequent retry. 3.4 Memory and I/O Devices Three types of slave devices on the bus support system operation: an embedded RAM, an embedded ROM, and a general-purpose input-output (GPIO) register. An input buffer is used to store pending requests for each device in order to support the pipelined nature of the split-transaction bus. The location of this buffer, in relation to the slave device and to the bus, is shown in Figure 3.4.
  50. 50. CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 40 From request and/or response bus (optional) To response bus Controller Bus interface Request bus Response bus To/from device Figure 3.4: Buffering for memory and I/O devices 3.5 Unidirectional Ring Network Among the different interconnection networks that can be used for interprocessor commu- nication, unidirectional ring networks are a natural choice for large-scale shared-memory multiprocessors. The organization and memory hierarchy of the large-scale multiproces- sor considered in this thesis are illustrated in Figure 3.5. In this section, the unidirectional bit-parallel slotted ring network that is used to connect nodes of bus-based shared-memory multiprocessors is described. The multiprocessor nodes contain the processors described in Section 3.1, and the cache controller and the split-transaction bus described in Section 3.2. The primary function of the unidirectional ring network is to provide processors with ac- cess to the globally-shared memory that is physically distributed among the different nodes on the ring. The interface between the multiprocessor node and the ring interconnect is made up of
  51. 51. CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 41 N1 N2 N3 N0 P0 Ring router Ring interface Inbound Outbound PN-1 I/O MEM Figure 3.5: Ring network Ring router Ring interface Inbound OutboundRequest bus Response bus Multiprocessor node Figure 3.6: Ring buffers
  52. 52. CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 42 Table 3.2: Address mapping Address Range Home Node 0x00000000 - 0x0FFFFFFF 0 0x10000000 - 0x1FFFFFFF 1 ... ... 0xF0000000 - 0xFFFFFFFF 15 three components that are illustrated in Figure 3.6. The ring interface connects the multi- processor node’s split-transaction bus to the ring router using two separate FIFO buffers. One of the buffers in Figure 3.6 is used for inbound requests and the other is used for outbound requests. Inbound requests originate from the ring network and travel to the mul- tiprocessor node’s bus, and outbound requests originate from the node’s bus and travel to the ring. The ring interface monitors both the request bus and the reply bus for specific coherent requests and data responses that must be sent outside of the local multiprocessor node. The ring interface also acts as a bus master for incoming requests, or responses, from the ring that must be serviced locally. The remainder of this section addresses the high-level architecture of the ring interface, the ring router, and issues related to flow control. First, however, a distinction must be made between local and remote addresses in the context of this system. 3.5.1 Local and Remote Addresses Each memory address in a distributed shared-memory system is assigned a home node. The home node indicates the physical location of the memory address. Table 3.2 illustrates the mapping strategy used in this system where the top four address bits represent the index of the address’ home node. To facilitate future discussion, local and remote addresses must be distinguished with respect to their home node.
  53. 53. CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 43 Table 3.3: Rules for outbound packet generation Bus command Bus address Broadcast to all ring nodes Ring packet destination Read Local N/A N/A Remote No Home node of requested address ReadEx Local Yes (as an invalidation) Originating node (after visiting all nodes) Remote (*) No Home node of requested address Upgrade Local Yes (as an invalidation) Originating node (after visiting all nodes) Remote (*) N/A N/A (*) Non-coherent operation A local address indicates that the physical location, or the home node, of the memory address is contained within the requesting processor’s node. A remote address indicates that the home node of the specified memory address is located outside of the requesting processor’s node. For example, with reference to Table 3.2, address 0x10000000 is a local address for processors contained in node 1 and a remote address for all other processors. 3.5.2 Ring Interface The ring interface in Figure 3.6 is used to send coherent requests and data responses from the multiprocessor node’s bus to the ring, and to issue coherent requests or responses orig- inating from the ring to the node’s bus. Outgoing requests and responses are assembled for transmission on the ring using a packet structure that is large enough to accommodate the data for a cache block. Packets are moved between nodes in the packet-switched network using the packet router at each node. The information contained in the packet’s header is used to control these routers. An inbound packet can represent a remote request for a local block of data, or a data
  54. 54. CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 44 reply for a previously requested block of remote data. Depending on the type of packet, the ring interface acts either as a master device or as a slave device, and it behaves similarly to the blocking cache controller described in Section 3.3. Incoming remote requests are buffered by the ring interface and are serviced one at a time, in the same fashion as a cache miss. The ring interface also inspects the request bus and the response bus for any informa- tion that must be transmitted to remote nodes. The interface snoops both buses in a similar fashion to the cache controller. Outbound packets are assembled for transmission on the ring using the current request or response seen on the bus and the rules listed in Table 3.3. These rules do not ensure full cache coherence and result in a reduced-complexity cache coherence protocol that is discussed in Section 3.6. The assembled packet is then latched in the outbound buffer for the interface where it waits in FIFO order to be transmitted on the ring. The rules for outbound ring packet generation listed in Table 3.3 specify two different types of ring packets. Any read or read-exclusive request for a remote address generates the first type of packet: a ring read/read-exclusive request that is sent only to the home node of the remote address seen on the bus. The second type of packet is generated by any read- exclusive or upgrade request for a local address: a ring invalidation packet is broadcast to all nodes. This broadcast packet causes an upgrade request to appear on the local bus of each visited node to ensure the propagation of writes. This operation effectively invalidates all remotely cached copies of the requested block. All other types of requests seen on the local bus are ignored by the ring interface. Ring outbound buffer contention arises if both a request and a response that require out- bound servicing by the ring interface appear on the node’s split-transaction bus in the same clock cycle. If such contention arises, a simple arbitration mechanism is used to ensure
  55. 55. CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 45 that only a single packet is granted access to the outbound buffer at a given time. Priority is given to the data found on the response bus, and the request found on the request bus is negatively acknowledged. This negative acknowledgment forces the requesting processor to retry its request at a later time. If the outbound buffer to the ring is full, the response from the slave device may also be negatively acknowledged, and both the requester and the responder must arbitrate with other devices on the bus for a retry. 3.5.3 Routing Logic The ring router in Figure 3.6 is responsible for ensuring that packets are delivered to the specified destination node or are broadcast once around the ring. The routing information that is embedded in the packet’s structure enables the ring router to make the appropriate routing decisions. The function of the ring router is two-fold. First, in each ring clock cycle, a routing decision is made for a valid packet originating from the preceding ring node using the routing information found in the packet’s header. The router is capable of three different operations: it can accept the packet into the inbound buffer to the multiprocessor, forward the packet to the following node, or remove the packet from the ring altogether. In some circumstances, it may perform more than one of these operations simultaneously, for instance, the router may accept a packet and forward it to the following node. When an incoming packet is removed from the ring, the ring register’s valid bit is reset and no queued outgoing packet from the local bus is placed in the ring register. This behavior creates a free slot that is passed to the following node in order to help reduce the likelihood of starvation. The second function of the ring router is to initiate the transmission of an outgoing packet from the local node. When a free slot, as described above, is made available by the
  56. 56. CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 46 preceding ring node, the local router may dequeue a packet from the outbound buffer of the local node and transmit its contents to the following node. A new packet may be broadcast to all nodes or routed directly from its source node to a destination node specified in the packet’s header. 3.5.4 Flow Control A flow control mechanism is necessary to avoid packet loss in the event of buffer overflow. If any one of the inbound buffers at the ring nodes in Figure 3.6 becomes full, a signal is asserted to the other nodes that temporarily prevents any new packets from entering the ring. Once a packet is consumed from the full buffer, the signal is deasserted and normal ring operation will resume. If a ring outbound buffer is full, each new outbound request or response seen on the node’s local bus is negatively acknowledged and is retried at a later time. 3.6 Reduced-Complexity Cache Coherence In many application-specific workloads, full cache coherence support is not crucial. For example, consider software applications that perform pipelined computations on streaming data where a producer-consumer relationship is established between nodes. This type of system operation is achieved by assigning each node of the multiprocessor system a portion of the streaming application’s required computation. The producers write to local memory and the consumers read from remote memories. Flag-based synchronization primitives are used to signal the consumer(s) when to initiate their read of the produced data. These flags are supported through an ordinary spinning mechanism where a processor in one node
  57. 57. CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 47 spins in its cache on a variable that is written by a processor in a second node. Once data is transferred using remote read operations, the consumer node can begin its portion of the required computation on the streaming data. Full cache coherence through bus snooping is still applicable, and typically quite useful, for processors within individual nodes. The design of the ring interface and of the ring router described in the previous section implements support for such a reduced-complexity cache coherence protocol. The protocol provides limited coherence support for processors between nodes and full coherence sup- port to those contained within individual nodes. The architecture of this system is suitable for application-specific designs, particularly for those with streaming processing workloads as described above. The protocol restricts coherent processor write operations to addresses contained in the portion of the shared memory that is local to the node containing the processor. This limitation provides some benefits in reducing implementation complexity and represents a trade-off between architectures supporting either full system-wide cache coherence for shared memory, or explicitly-scheduled message passing without shared memory. Imple- mentation complexity is reduced because the logic and storage overhead associated with either directory-like structures or with more general ring snooping that would otherwise be required at each node are not necessary for this system architecture. In a fully cache-coherent system, the appearance of a total, serial order on memory accesses to the same shared location must be observed. In this system, the total order is enforced by each multiprocessor’s node for its local addresses only using the serialization properties of the node’s bus interconnect. The implemented behavior of the ring interface and of the ring router simplifies issues related to the total ordering of remote write opera- tions. By adopting the convention whereby coherent remote writes are not supported, there
  58. 58. CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 48 will never be any contention between processors located in different nodes for ownership of, or permission to write to, a block of remote data. The only source of write contention that may occur for data is between individual processors in the home node for an address, a situation that is already resolved by the serial nature of the bus. The main benefit of the reduced-complexity approach is that for a remote read request; nodes other than the home node for the address are not required to take any snoop-initiated action as the read request traverses the ring. Therefore, the read request can be routed di- rectly to its destination node. Caches in the nodes located between the source and destina- tion node do not need to snoop the circulating read request because they will never contain a modified copy of the requested block, and therefore, these caches will never supply a response for a circulating read request. Furthermore, a read request for a local address does not exit its home node. This behavior is comparable to the Oracle approach in the Flexible Snooping algorithm [SST06], where the responding node is effectively known in advance. The ring invalidation packet that is generated by the ring interface as a result of a read- exclusive or upgrade request for a local address causes an upgrade request to be initiated on each of the remote nodes, thereby invalidating all remotely cached copies of the re- quested block in the system and ensuring coherence. It should be noted, however, that the requesting processor is allowed to proceed while the corresponding ring invalidation is still circulating. The implications of this behavior manifest themselves in the system’s memory consistency model that is discussed in the following section. 3.7 Memory Consistency Issues The issue of memory consistency must be addressed for this multiprocessor system. The purpose of this section is to relate the operation of this system to the memory consistency
  59. 59. CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 49 Upgrade A (Broadcast ring invalidation packet) Load B t t Upgrade B (Broadcast ring invalidation packet) Load A Node 0 A=1 U=B Node 1 B=1 V=A Figure 3.7: Memory consistency timeline models that were described in Section 2.1.2. To understand how the memory system operates, consider the operation of the cache controller. The implemented controller stalls the processor each time that it must service a cache miss and allows the processor to resume operation once the data reply for a read, or read-exclusive, request is returned from memory. For upgrade requests, the processor resumes operation once the request is granted access to, and is driven on, the request bus. When all processors on the bus operate exclusively on local addresses and the ring interface is not involved, the sequential consistency model is observed because all read and write operations are atomic and they will complete in the same order as which they were issued. Unidirectional ring networks do not observe the total ordering of requests unless an ordering point or a pending/retry mechanism is used. Both of these methods were dis- cussed in Section 2.1.4. The system for this thesis does not include the added complexity for the total ordering on the ring, hence a different consistency model applies when re- mote addresses are considered, and local writes and remote reads are performed: processor consistency as discussed in Section 2.1.2. Consider the example in Figure 3.7, where the simple parallel program that was shown
  60. 60. CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 50 earlier in Figure 2.2 is executed by two different processors that reside in two different nodes of the system. Each of the four variables is in a separate block of memory. At the beginning of the program’s execution, the four memory blocks of interest are cached by both processors in a shared state and each contain a value equal to zero. The timelines in Figure 3.7 illustrate the sequence of operations by the processor in Node 0, and the processor in Node 1, respectively. After both upgrade requests are broadcast on the ring in the form of a ring invalidation request, each processor continues executing and at some point initiates a read for the value that was written by the other processor. The dotted line represents the transmission of an invalidation packet on the ring network. Due to the inherent latency of the ring network, the transmitted packet may potentially arrive on the destination’s local bus after the read for the soon-to-be invalidated block. The packet’s arrival time varies with respect to the amount of activity on the bus and on the ring. The longer transmission time observed between Node 1 and Node 0 than between Node 0 and Node 1 is due to the fact that the invalidation request must travel around the ring through Nodes 2, 3, 4, and so forth, before returning to the node that precedes it. The processor consistency model is often used to describe a multiprocessor system that contains write buffers within each processor. A write buffer allows read operations to bypass unrelated write operations that are held in the buffer. The above example illustrates a write operation that logically precedes a read operation, but does not complete with respect to all processors before the read operation does. This observed behavior is representative of a relaxed memory consistency model where load operations may bypass unrelated store operations from the same processor. The properties described above allow a similar type of bypassing to take place for remote addresses only. The processors used in this system do not contain a write buffer and stall until the cache miss is observed on the node’s local bus.
  61. 61. CHAPTER 3. MULTIPROCESSOR SYSTEM ARCHITECTURE 51 Writes are therefore completed with respect to the processors on the local bus, but not with respect to all of the remote processors on the ring. The node’s outbound buffer to the ring and the unidirectional ring serve the same purpose as a processor write buffer, therefore given these circumstances, the consistency model of the multiprocessor’s memory system is representative of processor consistency. The processor consistency model that was discussed in Section 2.1.2 allows many of the common parallel programming idioms available in sequentially consistent systems, includ- ing flag-based synchronization, to be used without any modifications to their code. Proces- sor consistency is also favorable from a hardware/software complexity standpoint because the additional complexity of more relaxed consistency models is unwarranted [Hil98]. 3.8 Summary This chapter presented the architecture of the multiprocessor system that was developed to support the research goals of this thesis. A pipelined processor with its cache controller and a split-transaction bus interconnect were described, followed by an overview of the system- level ring interconnect, and the memory consistency model of the system. Implementation details for these components is presented in Chapter 5. The next chapter, however, de- scribes an enhancement to the ring-based multiprocessor.
  62. 62. Chapter 4 Region-Level Filtering of Coherence Traffic The multiprocessor system described in the previous chapter generates a ring invalidation packet for each read-exclusive or upgrade request for a block of local data that appears on a node’s bus. An invalidation packet is formed from these types of requests and is queued on the node’s outbound buffer to the ring. The outbound buffer to the ring is dequeued when the ring node has permission to place a new packet on the ring interconnect. Invalidation packets are always broadcast for transmission around the ring. As the invalidation packet circles the ring, it places a copy of itself in the inbound buffers of each visited node in the system. The packet is injected into each inbound buffer whether or not the node contains a copy of the soon-to-be invalidated block indicated by the invalidation packet’s address. If the caches in a given node do not contain a copy of the soon-to-be invalidated block, then the invalidation request that eventually is driven on the bus for that node will not result in a snoop hit by any of the node’s processors. In this case, the ring invalidation packet need not have entered that particular node, thereby avoiding the delay for bus arbitration and 52
  63. 63. CHAPTER 4. REGION-LEVEL FILTERING OF COHERENCE TRAFFIC 53 incurred snooping overhead. The purpose of this chapter is to introduce a two-part solution for reducing the impact of the circulating ring invalidation packets on the system by applying the region-based tracking concepts that were discussed in Section 2.2. The first element of the proposed solution involves filtering ring invalidation packets before they enter a node. This type of filtering is useful when some, but not all, nodes in the system are caching a copy of any block in the same region as the block that requires invalidation, as indicated by the address contained in the ring invalidation packet. By tracking the contents of each node’s caches, packets may be filtered from entering the nodes that are known not to hold a copy of any block in the region of interest. The second element of the proposed solution considers the case when none of the nodes in the system are caching a copy of any block in the same region as the block that requires invalidation. As long as this condition holds true, the invalidation packet generated inside a node may be filtered from exiting the node and from entering the ring altogether. To collect the system-wide filtering information that will be used to enable this type of operation, each circulating ring invalidation packet is annotated with a single bit of sharing information. The bit of sharing information is updated by each visited node to reflect whether any block in the region indicated by the packet’s address may be cached by a processor in that node. If the invalidation packet returns with a positive non-shared bit, indicating that none of the visited nodes are caching a copy in the region of interest, then all of the following invalidation packets for the same region can be safely filtered prior to entering the ring, until the condition no longer holds. Tracking the status of individual cache-line-sized blocks of data is possible using di- rectories. The logic and storage overhead required by these structures is not necessarily
  64. 64. CHAPTER 4. REGION-LEVEL FILTERING OF COHERENCE TRAFFIC 54 suitable for single-chip implementation. Similar to the related work on coarse-grained co- herence tracking, all filtering decisions in this system are made based on the contents of a larger set of addresses. These sets of addresses are collectively known as regions of mem- ory that encompass a power-of-two number of cache lines. The mapping from address to region and the specifics of the filtering technique are both based on the RegionScout ap- proach for coarse-grained coherence tracking for bus-based systems that was described in Section 2.2.1. The two structures proposed in RegionScout are adapted in this work for use in a unidirectional ring-based multiprocessor. In the original implementation, the first structure tracks the cache contents of a single processor, whereas in the work for this thesis, a similar structure is used to track the cache contents of all processors in a node, but only for remote data from other nodes. The second structure in RegionScout maintains a list of regions that do not contain a single cached block in the entire system. In the work for this thesis, a memory structure is used to maintain a list of regions corresponding to local data from a node that are not cached by any remote nodes. RegionScout uses its two structures to avoid unnecessary bus broadcasts and to reduce snoop tag lookups, while the proposed structures in this thesis are used to enable the filtering of ring invalidation packets produced by the reduced-complexity cache coherence protocol. The two structures that are added to the baseline multiprocessor system of Chapter 3 are required to support the region-level filtering of invalidation packets. The location of these structures with respect to the multiprocessor node is illustrated in Figure 4.1. The first structure is the node cached-region hash (NCRH). The NCRH is used to prevent incoming ring invalidation packets from entering the node and also to update the non-shared bit of circulating ring invalidation packets. The filtering information maintained by the NCRH is obtained by tracking and counting the presence of remote blocks of data in all of the caches
  65. 65. CHAPTER 4. REGION-LEVEL FILTERING OF COHERENCE TRAFFIC 55 Ring router Inbound/ outbound Request bus Response bus Multiprocessor node Ring interface NCRH RNSRT Figure 4.1: Location of NCRH and RNSRT filters Table 4.1: Comparison of NCRH and RNSRT structures NCRH RNSRT Invalidation packet filtering direction Incoming (to the multiprocessor node) Outgoing (from the multiprocessor) Type of filtering information Count of remote blocks in local caches List of non-shared regions of remote nodes Source of filtering information Requests appearing on multi- processor node local bus Returning invalidations and incoming remote read requests on ring in the associated multiprocessor node. The presence of remote blocks of data is counted for individual sets of regions and a count of zero for any such set indicates a possible filtering opportunity. The second structure in each node is the remote non-shared region table (RNSRT). The RNSRT monitors the arrival of ring read requests from other nodes and also the status of the non-shared bit of returning invalidation packets as they are removed from the ring. These two types of packets are used to maintain a list of current non-shared regions that are known not to be cached in remote nodes outside of the RNSRT’s node. A returning invalidation packet with its non-shared bit set creates an entry in the structure for its specified region of memory. An arriving remote read request will invalidate, if applicable, a valid entry in the RNSRT for its specified region of memory.
  66. 66. CHAPTER 4. REGION-LEVEL FILTERING OF COHERENCE TRAFFIC 56 To further distinguish between the NCRH and the RNSRT, both structures are compared in Table 4.1. The NCRH filters incoming packets to the multiprocessor and the RNSRT fil- ters outgoing packets from the multiprocessor. The NCRH maintains filtering information at the block level for caches inside its multiprocessor node, while the RNSRT maintains information at the region level for nodes outside its associated node. Finally, the NCRH monitors the multiprocessor node’s local bus to maintain its filtering information and the RNSRT monitors ring traffic. The remainder of this chapter discusses the design and operation of the NCRH and RN- SRT structures in more detail to highlight the issues that arise from adapting the bus-level region-based concepts to the larger scale of a ring-based multiprocessors. In particular, two methods for maintaining filtering information in the NCRH are presented, and finally, the correctness of full-system operation is explained. 4.1 Node Cached-Region Hash The NCRH is a small memory structure that is used to filter inbound ring invalidation pack- ets from entering the multiprocessor node and to provide region-level sharing information to circulating ring invalidation packets through the update of each packet’s non-shared bit. The NCRH counts the total number of blocks of remote data that are cached by all pro- cessors in a multiprocessor node for a set of regions. Filtering is achieved using imprecise information about the regions cached in each node, a behavior that results in the occur- rence of false positives. When the NCRH is used as a filter, a false positive occurs when a block is reported to be shared when it is, in fact, non-shared. These false positives result in missed filtering opportunities and in a reduction of the filter’s accuracy but they never violate correctness. The reduction in accuracy represents a trade-off between the filter’s

×