Be the first to like this
As the number of processors in multiprocessor system-on-chip devices continues to increase, the complexity required for full cache coherence support is often unwarranted for application-specific designs. Bus-based interconnects are no longer suitable for larger-scale systems, and the logic and storage overhead associated with the use of a complex packet-switched network and directory-based cache coherence may be undesirable in single-chip systems. Unidirectional rings are a suitable alternative because they offer many properties favorable to both on-chip implementation and to supporting cache coherence. Reducing the overhead of cache coherence traffic is, however, a concern for these systems. This thesis adapts two filter structures that are based on principles of coarse-grained coherence tracking, and applies them to a ring-based multiprocessor. The first structure tracks the total number of blocks of remote data cached by all processors in a node for a set of regions, where a region is a large area of memory referenced by the upper bits of an address. The second structure records regions of local data whose contents are not cached by any remote node. When used together to filter incoming or outgoing requests, these structures reduce the extent of coherence traffic and limit the transmission of coherent requests to the necessary parts of the system. A complete single-chip multiprocessor system that includes the proposed filters is designed and implemented in programmable logic for this thesis. The system is composed of nodes of bus-based multiprocessors, and each node includes a common memory, two or more pipelined 32-bit processors with coherent data caches, a split-transaction bus with separate lines for requests and responses, and an interface for the system-level ring interconnect. Two coarse-grained filters are attached to each node to reduce the impact of coherence traffic on the system. Cache coherence within the node is enforced through bus snooping, while coherence across the interconnect is supported by a reduced-complexity ring snooping protocol. Main memory is globally shared and is physically distributed among the nodes. Results are presented to highlight the system's key implementation points. Synthesis results are presented in order to evaluate hardware overhead, and operational results are shown to demonstrate the functionality of the multiprocessor system and of the filter structures.