This document discusses enhancing cache coherent architectures for manycore embedded systems by taking advantage of regular memory access patterns. It proposes adding pattern storage and detection capabilities to cores to reduce coherence traffic. Called CoCCA (Codesigned Cache Coherent Architecture), it modifies the baseline cache coherence protocol to allow speculative fetching of cache lines according to detected patterns, defined during compilation. This could improve scalability over the baseline approach by reducing traffic from repetitive accesses to shared data following predictable patterns.
The document proposes a new multithreaded execution model and multi-ring architecture to exploit instruction-level parallelism. The model uses multiple instruction threads that are scheduled for execution based on data availability. The instructions from different threads are grouped together for parallel execution. The proposed multi-ring architecture features large resident thread activations, a high-speed buffer to avoid load/store stalls, and a dynamic instruction scheduler that selects instructions from multiple threads each cycle for execution on multiple pipelines. Initial simulation results show the architecture can achieve parallel execution of 7 instructions per cycle.
This document proposes a framework for modeling source traffic in a Network on Chip (NoC) that originates from a single source but is destined for multiple destinations, known as multicasting. It presents a model to characterize such traffic as a single stream at the source based on the probabilistic demultiplexing of that stream into multiple streams. The model shows that the burst parameters of the demultiplexed streams are related to those of the original stream. The model is implemented in an NoC simulator and experimental results validate that the demultiplexed streams remain bursty even as their burst parameters change according to the model.
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...VLSICS Design
With the shrinking technology, reduced scale and power-hungry chip IO leads to System on Chip. The design of SOC using traditional standard bus scheme encounters with issues like non-uniform delay and routing problems. Crossbars could scale better when compared to buses but tend to become huge with increasing number of nodes. NOC has become the design paradigm for SOC design for its highly regularized interconnect structure, good scalability and linear design effort. The main components of an NoC topology are the network adapters, routing nodes, and network interconnect links. This paper mainly deals with the implementation of full custom SRAM based arrays over D FF based register arrays in the design of input module of routing node in 2D mesh NOC topology. The custom SRAM blocks replace D FF(D flip flop) memory implementations to optimize area and power of the input block. Full custom design of SRAMs has been carried out by MILKYWAY, while physical implementation of the input module with SRAMs has been carried out by IC Compiler of SYNOPSYS.The improved design occupies approximately 30% of the area of the original design. This is in conformity to the ratio of the area of an SRAM cell to the area of a D flip flop, which is approximately 6:28.The power consumption is almost halved to 1.5 mW. Maximum operating frequency is improved from 50 MHz to 200 MHz. It is intended to study and quantify the behavior of the single packet array design in relation to the multiple packet array design. Intuitively, a
common packet buffer would result in better utilization of available buffer space. This in turn would translate into lower delays in transmission. A MATLAB model is used to show quantitatively how performance is improved in a common packet array design.
This document summarizes several dynamic cache replication mechanisms: Victim Replication replicates cache lines evicted from the local cache to reduce access latency. Adaptive Selective Replication dynamically adjusts replication based on estimated costs and benefits. Adaptive Probability Replication replicates blocks based on predicted reuse probabilities. Dynamic Reusability-based Replication replicates blocks with high reuse. Locality-Aware Data Replication only replicates high-locality blocks to reduce misses while maintaining low replication overhead. The document provides details on these schemes and compares their approaches to dynamic cache block replication.
This document summarizes a paper that proposes and evaluates the performance of a multithreaded architecture capable of exploiting both coarse-grained parallelism and fine-grained instruction-level parallelism. The architecture distributes processing across multiple processing elements connected by an interconnection network. Each processing element supports multiple concurrently executing threads by grouping instructions from different threads. The architecture introduces a distributed data structure cache to reduce network latency when accessing remote data. Simulation results indicate the architecture achieves high processor throughput and the data structure cache significantly reduces network latency.
The document provides an overview of basic concepts related to parallelization and data locality optimization. It discusses loop-level parallelism as a major target for parallelization, especially in applications using arrays. Long running applications tend to have large arrays, which lead to loops that have many iterations that can be divided among processors. The document also covers data locality and how the organization of computations can significantly impact performance by improving cache usage. It introduces concepts like symmetric multiprocessors and affine transform theory that are useful for parallelization and locality optimizations.
This document presents an analytical model to evaluate the performance of multithreaded multiprocessors with distributed shared memory. The model uses a multi-chain closed queuing network to model the processor, memory, and network subsystems in an integrated manner. This captures the strong coupling between subsystems. The model shows that high performance is achieved when the memory request rate matches the weighted sum of memory bandwidth and average remote memory access distance. The model is validated using a stochastic timed Petri net model.
The document proposes a new multithreaded execution model and multi-ring architecture to exploit instruction-level parallelism. The model uses multiple instruction threads that are scheduled for execution based on data availability. The instructions from different threads are grouped together for parallel execution. The proposed multi-ring architecture features large resident thread activations, a high-speed buffer to avoid load/store stalls, and a dynamic instruction scheduler that selects instructions from multiple threads each cycle for execution on multiple pipelines. Initial simulation results show the architecture can achieve parallel execution of 7 instructions per cycle.
This document proposes a framework for modeling source traffic in a Network on Chip (NoC) that originates from a single source but is destined for multiple destinations, known as multicasting. It presents a model to characterize such traffic as a single stream at the source based on the probabilistic demultiplexing of that stream into multiple streams. The model shows that the burst parameters of the demultiplexed streams are related to those of the original stream. The model is implemented in an NoC simulator and experimental results validate that the demultiplexed streams remain bursty even as their burst parameters change according to the model.
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...VLSICS Design
With the shrinking technology, reduced scale and power-hungry chip IO leads to System on Chip. The design of SOC using traditional standard bus scheme encounters with issues like non-uniform delay and routing problems. Crossbars could scale better when compared to buses but tend to become huge with increasing number of nodes. NOC has become the design paradigm for SOC design for its highly regularized interconnect structure, good scalability and linear design effort. The main components of an NoC topology are the network adapters, routing nodes, and network interconnect links. This paper mainly deals with the implementation of full custom SRAM based arrays over D FF based register arrays in the design of input module of routing node in 2D mesh NOC topology. The custom SRAM blocks replace D FF(D flip flop) memory implementations to optimize area and power of the input block. Full custom design of SRAMs has been carried out by MILKYWAY, while physical implementation of the input module with SRAMs has been carried out by IC Compiler of SYNOPSYS.The improved design occupies approximately 30% of the area of the original design. This is in conformity to the ratio of the area of an SRAM cell to the area of a D flip flop, which is approximately 6:28.The power consumption is almost halved to 1.5 mW. Maximum operating frequency is improved from 50 MHz to 200 MHz. It is intended to study and quantify the behavior of the single packet array design in relation to the multiple packet array design. Intuitively, a
common packet buffer would result in better utilization of available buffer space. This in turn would translate into lower delays in transmission. A MATLAB model is used to show quantitatively how performance is improved in a common packet array design.
This document summarizes several dynamic cache replication mechanisms: Victim Replication replicates cache lines evicted from the local cache to reduce access latency. Adaptive Selective Replication dynamically adjusts replication based on estimated costs and benefits. Adaptive Probability Replication replicates blocks based on predicted reuse probabilities. Dynamic Reusability-based Replication replicates blocks with high reuse. Locality-Aware Data Replication only replicates high-locality blocks to reduce misses while maintaining low replication overhead. The document provides details on these schemes and compares their approaches to dynamic cache block replication.
This document summarizes a paper that proposes and evaluates the performance of a multithreaded architecture capable of exploiting both coarse-grained parallelism and fine-grained instruction-level parallelism. The architecture distributes processing across multiple processing elements connected by an interconnection network. Each processing element supports multiple concurrently executing threads by grouping instructions from different threads. The architecture introduces a distributed data structure cache to reduce network latency when accessing remote data. Simulation results indicate the architecture achieves high processor throughput and the data structure cache significantly reduces network latency.
The document provides an overview of basic concepts related to parallelization and data locality optimization. It discusses loop-level parallelism as a major target for parallelization, especially in applications using arrays. Long running applications tend to have large arrays, which lead to loops that have many iterations that can be divided among processors. The document also covers data locality and how the organization of computations can significantly impact performance by improving cache usage. It introduces concepts like symmetric multiprocessors and affine transform theory that are useful for parallelization and locality optimizations.
This document presents an analytical model to evaluate the performance of multithreaded multiprocessors with distributed shared memory. The model uses a multi-chain closed queuing network to model the processor, memory, and network subsystems in an integrated manner. This captures the strong coupling between subsystems. The model shows that high performance is achieved when the memory request rate matches the weighted sum of memory bandwidth and average remote memory access distance. The model is validated using a stochastic timed Petri net model.
Characteristics of an on chip cache on nec sxLéia de Sousa
The document discusses characteristics of an on-chip cache, called a vector cache, for the NEC SX vector architecture. It evaluates the performance of the vector cache with varying memory bandwidth rates from 1 to 4 bytes per flop. The evaluation uses kernel loops and five leading scientific applications from areas like GPR simulation, APFA simulation, PRF simulation, and SFHT simulation. The results show that the vector cache can boost computational efficiency for lower memory bandwidth systems, with a 2 bytes per flop system achieving performance comparable to a 4 bytes per flop system when cache hit rates exceed 50%.
This document summarizes a lecture on speculative multithreading. The lecture discussed two papers: one that categorized hardware support for speculative multithreading, and one that described a software-only approach using transactional memory. The class discussion covered dividing responsibilities between hardware and software, sources of parallelism for speculation beyond just loops, scaling speculation, and programming support options.
This thesis proposes a design methodology for dynamically reconfigurable multi-FPGA systems. The methodology includes three main phases: design extraction from VHDL, static global layout partitioning and placement, and reuse of blocks through dynamic reconfiguration when needed to minimize delays. The major contribution is a multi-FPGA design flow that exploits dynamic reconfiguration to reuse blocks and reduce the application area requirements. Experimental results show the proposed approaches partition and place designs efficiently. Future work includes improving clustering metrics, routing algorithms, and time estimation for dynamic block reuse.
This document summarizes a research paper that proposes a new cache coherence protocol called Phase-Priority Based (PPB) cache coherence. PPB aims to optimize directory-based cache coherence protocols for multicore processors. It introduces the concepts of "phase" and "priority" for coherence messages to reduce unnecessary transient states and message stalling. PPB differentiates messages into inner and outer phases based on their place in the coherence transaction ordering. It also prioritizes messages in the on-chip network to improve efficiency. Analysis shows PPB outperforms traditional MESI, reducing transient states and stalls by up to 24% with a 7.4% speedup.
Performance evaluation of ecc in single and multi( eliptic curve)Danilo Calle
The document discusses performance evaluation of ECC (Elliptic Curve Cryptography) implementation on FPGA-based embedded systems using single and dual processor architectures. It explores implementing ECC using a single MicroBlaze soft processor core and a dual MicroBlaze core design with shared memory for inter-processor communication. Experimental results show the dual core design improves throughput by 3.3x over the single core design, encrypting data 3.3 times faster, but utilizes more resources and power due to the additional processor core.
This document evaluates the performance of four memory consistency models (sequential consistency, processor consistency, weak consistency, and release consistency) for shared-memory multiprocessors using simulation studies of three applications (MP3D, LU, and PTHOR). The results show that sequential consistency performs significantly worse than the other models. Surprisingly, processor consistency performs almost as well as release consistency and better than weak consistency for one application, indicating that allowing reads to bypass pending writes provides more benefit than allowing writes to pipeline.
Crdom cell re ordering based domino on-the-fly mappingVLSICS Design
This Domino logic is often the choice for designing high speed CMOS circuits. Often VLSI designers
choose library based approaches to perform technology mapping of large scale circuits involving static
CMOS logic style. Cells designed using Domino logic style have the flexibility to accommodate wide range
of functions in them. Hence, there is a scope to adopt a library free synthesis approach for circuits
designed using Domino logic. In this work, we present an approach for mapping a domino logic circuit
using an On-the-fly technique. First, we present a node mapping algorithm which maps a given Domino
logic netlist using On-the-fly technique. Next, using an Equivalence Table, we re-order the cells along the
critical path for delay, area benefit. Finally, we find an optimum re-ordering set which can obtain
maximum delay savings for a minimum area penalty. We have tested the efficacy of our approach with a
set of standard benchmark circuits. Our proposed mapping approach (CRDOM) obtained 21%
improvement in area and 17% improvement in delay compared to existing work.
Parallel platforms can be organized in various ways, from an ideal parallel random access machine (PRAM) to more conventional architectures. PRAMs allow concurrent access to shared memory and can be divided into subclasses based on how simultaneous memory accesses are handled. Physical parallel computers use interconnection networks to provide communication between processing elements and memory. These networks include bus-based, crossbar, multistage, and various topologies like meshes and hypercubes. Maintaining cache coherence across multiple processors is important and can be achieved using invalidate protocols, directories, and snooping.
An octa core processor with shared memory and message-passingeSAT Journals
Abstract This being the era of fast, high performance computing, there is the need of having efficient optimizations in the processor architecture and at the same time in memory hierarchy too. Each and every day, the advancement of applications in communication and multimedia systems are compelling to increase number of cores in the main processor viz., dual-core, quad-core, octa-core and so on. But, for enhancing the overall performance of multi processor chip, there are stringent requirements to improve inter-core synchronization. Thus, a MPSoC with 8-cores supporting both message-passing and shared-memory inter-core communication mechanisms is implemented on Virtex 5 LX110T FPGA. Each core is based on MIPS III (Microprocessor without interlocked pipelined stages) ISA, handling only integer type instructions and having six-stage pipeline with data hazard detection unit and forwarding logic. The eight processing cores and one central shared memory core are inter connected using 3x3 2-D mesh topology based Network-on-chip (NoC) with virtual channel router. The router is four stage pipelined supporting DOR X-Y routing algorithm and with round robin arbitration technique. For verification and functionality test of above fully synthesized multi core processor, matrix multiplication operation is mapped onto the above said. Partitioning and scheduling of multiple multiplications and addition for each element of resultant matrix has been done accordingly among eight cores to get maximum throughput. All the codes for processor design are written in Verilog HDL. Keywords: MPSoC, message-passing, shared memory, MIPS, ISA, wormhole router, network-on-chip, SIMD, data level parallelism, 2-D Mesh, virtual channel
All new computers have multicore processors. To exploit this hardware parallelism for improved
perf
ormance, the predominant approach today is multithreading using shared variables and locks. This
approach has potential data races that can create a nondeterministic program. This paper presents a
promising new approach to parallel programming that is both
lock
-
free and deterministic. The standard
forall primitive for parallel execution of for
-
loop iterations is extended into a more highly structured
primitive called a Parallel Operation (POP). Each parallel process created by a POP may read shared
variable
s (or shared collections) freely. Shared collections modified by a POP must be selected from a
special set of predefined Parallel Access Collections (PAC). Each PAC has several Write Modes that
govern parallel updates in a deterministic way. This paper pre
sents an overview of a Prototype Library
that implements this POP
-
PAC approach for the C++ language, including performance results for two
benchmark parallel programs.
Artificial Neural Network Implementation on FPGA – a Modular ApproachRoee Levy
This document presents an FPGA implementation of an artificial neural network using a modular approach. Key points:
- The implementation uses a multilayer perceptron topology trained with the backpropagation algorithm. It allows networks of any size to be synthesized quickly.
- The design achieves peak performance of 5.46 million connection updates per second during training and 8.24 million predictions per second during computation.
- It was tested on a breast cancer classification problem, achieving 96% accuracy.
- The paper emphasizes important FPGA design principles that make neural network development modular and parameterized. This allows the system to solve various neural network problems efficiently.
This document summarizes the key aspects of a design flow framework that aims to simplify the development of partially reconfigurable systems. The framework hides complexity related to reconfiguration from designers and supports different architectural paradigms and communication infrastructures. It was developed in three phases: studying existing approaches, realizing the framework based on separated tools, and validating it with a new communication protocol. The framework generates architectures from a system description and allows designers to focus on writing modules while handling reconfiguration details.
A talk on Transformers at GDG DevParty
27.06.2020
Link to Google Slides version: https://docs.google.com/presentation/d/1N7ayCRqgsFO7TqSjN4OWW-dMOQPT5DZcHXsZvw8-6FU/edit?usp=sharing
Migration To Multi Core - Parallel Programming ModelsZvi Avraham
The document discusses multi-core and many-core processors and parallel programming models. It provides an overview of hardware trends including increasing numbers of cores in CPUs and GPUs. It also covers parallel programming approaches like shared memory, message passing, data parallelism and task parallelism. Specific APIs discussed include Win32 threads, OpenMP, and Intel TBB.
The document proposes a new framework for hardware/software co-design that raises the abstraction level to allow uniform system description and faster simulation. The framework uses parameterizable hardware cores along with estimation to explore design spaces and generate optimized implementations. It has been applied to the RoadRunner project and focuses on further developing the hardware aspects. Future work includes expanding to software domains and refining system simulation and estimation techniques.
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...Ritu Arora
Often, HPC software outlives the HPC systems for which they are initially developed. The innovations in the HPC platforms’ hardware and parallel programming standards drive the modernization of HPC applications so that they continue being performant. While such code modernization efforts may not be very challenging for HPC experts and well-funded research groups, many domain-experts may find it challenging to adapt their applications for latest HPC platforms due to lack of expertise, time, and funds. The challenges of such domain-experts can be mitigated by providing them high-level tools for code modernization and migration.
This document reviews Network-on-Chip (NoC) architectures that prioritize selected data streams to reduce communication latency. It categorizes the architectures based on the effect of prioritization (per end-to-end connection, per router, or per path segment) and discusses their pros and cons. Architectures that prioritize at the core-to-core level provide the highest latency reduction by bypassing the NoC, while those prioritizing per router or path segment require redetermining priority at each hop.
System on Chip Design and Modelling Dr. David J GreavesSatya Harish
The document provides an overview of a course on system on chip design and modeling techniques. The course covers topics like register transfer language, SystemC components, basic SoC components, assertion-based design, network on chip structures, and architectural design exploration. It aims to cover the front end of the design automation process, including specification, modeling at different levels of abstraction, and logic synthesis. A running example evolves over the lectures to demonstrate a simple SoC.
The document summarizes research on parallelizing genetic algorithms to improve scalability when solving concept location problems. Four distributed architectures were developed and tested: 1) a simple client-server model with no data sharing, 2) a database configuration, 3) a hash-database configuration, and 4) a hash configuration where each server caches its own data locally. Experimental results showed the hash configuration performed best, reducing computation time by over 140 times compared to a single machine by efficiently storing and accessing already-computed data locally on each server. Future work aims to test different communication protocols and problems to validate the findings.
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
Oak Ridge National Lab is home of Titan, the largest GPU accelerated supercomputer in the world. This fact alone can be an intimidating experience for users new to leadership computing facilities. Our facility has collected over four years of experience helping users port applications to Titan. This talk will explain common paths and tools to successfully port applications, and expose common difficulties experienced by new users. Lastly, learn how our free and open training program can assist your organization in this transition.
Characteristics of an on chip cache on nec sxLéia de Sousa
The document discusses characteristics of an on-chip cache, called a vector cache, for the NEC SX vector architecture. It evaluates the performance of the vector cache with varying memory bandwidth rates from 1 to 4 bytes per flop. The evaluation uses kernel loops and five leading scientific applications from areas like GPR simulation, APFA simulation, PRF simulation, and SFHT simulation. The results show that the vector cache can boost computational efficiency for lower memory bandwidth systems, with a 2 bytes per flop system achieving performance comparable to a 4 bytes per flop system when cache hit rates exceed 50%.
This document summarizes a lecture on speculative multithreading. The lecture discussed two papers: one that categorized hardware support for speculative multithreading, and one that described a software-only approach using transactional memory. The class discussion covered dividing responsibilities between hardware and software, sources of parallelism for speculation beyond just loops, scaling speculation, and programming support options.
This thesis proposes a design methodology for dynamically reconfigurable multi-FPGA systems. The methodology includes three main phases: design extraction from VHDL, static global layout partitioning and placement, and reuse of blocks through dynamic reconfiguration when needed to minimize delays. The major contribution is a multi-FPGA design flow that exploits dynamic reconfiguration to reuse blocks and reduce the application area requirements. Experimental results show the proposed approaches partition and place designs efficiently. Future work includes improving clustering metrics, routing algorithms, and time estimation for dynamic block reuse.
This document summarizes a research paper that proposes a new cache coherence protocol called Phase-Priority Based (PPB) cache coherence. PPB aims to optimize directory-based cache coherence protocols for multicore processors. It introduces the concepts of "phase" and "priority" for coherence messages to reduce unnecessary transient states and message stalling. PPB differentiates messages into inner and outer phases based on their place in the coherence transaction ordering. It also prioritizes messages in the on-chip network to improve efficiency. Analysis shows PPB outperforms traditional MESI, reducing transient states and stalls by up to 24% with a 7.4% speedup.
Performance evaluation of ecc in single and multi( eliptic curve)Danilo Calle
The document discusses performance evaluation of ECC (Elliptic Curve Cryptography) implementation on FPGA-based embedded systems using single and dual processor architectures. It explores implementing ECC using a single MicroBlaze soft processor core and a dual MicroBlaze core design with shared memory for inter-processor communication. Experimental results show the dual core design improves throughput by 3.3x over the single core design, encrypting data 3.3 times faster, but utilizes more resources and power due to the additional processor core.
This document evaluates the performance of four memory consistency models (sequential consistency, processor consistency, weak consistency, and release consistency) for shared-memory multiprocessors using simulation studies of three applications (MP3D, LU, and PTHOR). The results show that sequential consistency performs significantly worse than the other models. Surprisingly, processor consistency performs almost as well as release consistency and better than weak consistency for one application, indicating that allowing reads to bypass pending writes provides more benefit than allowing writes to pipeline.
Crdom cell re ordering based domino on-the-fly mappingVLSICS Design
This Domino logic is often the choice for designing high speed CMOS circuits. Often VLSI designers
choose library based approaches to perform technology mapping of large scale circuits involving static
CMOS logic style. Cells designed using Domino logic style have the flexibility to accommodate wide range
of functions in them. Hence, there is a scope to adopt a library free synthesis approach for circuits
designed using Domino logic. In this work, we present an approach for mapping a domino logic circuit
using an On-the-fly technique. First, we present a node mapping algorithm which maps a given Domino
logic netlist using On-the-fly technique. Next, using an Equivalence Table, we re-order the cells along the
critical path for delay, area benefit. Finally, we find an optimum re-ordering set which can obtain
maximum delay savings for a minimum area penalty. We have tested the efficacy of our approach with a
set of standard benchmark circuits. Our proposed mapping approach (CRDOM) obtained 21%
improvement in area and 17% improvement in delay compared to existing work.
Parallel platforms can be organized in various ways, from an ideal parallel random access machine (PRAM) to more conventional architectures. PRAMs allow concurrent access to shared memory and can be divided into subclasses based on how simultaneous memory accesses are handled. Physical parallel computers use interconnection networks to provide communication between processing elements and memory. These networks include bus-based, crossbar, multistage, and various topologies like meshes and hypercubes. Maintaining cache coherence across multiple processors is important and can be achieved using invalidate protocols, directories, and snooping.
An octa core processor with shared memory and message-passingeSAT Journals
Abstract This being the era of fast, high performance computing, there is the need of having efficient optimizations in the processor architecture and at the same time in memory hierarchy too. Each and every day, the advancement of applications in communication and multimedia systems are compelling to increase number of cores in the main processor viz., dual-core, quad-core, octa-core and so on. But, for enhancing the overall performance of multi processor chip, there are stringent requirements to improve inter-core synchronization. Thus, a MPSoC with 8-cores supporting both message-passing and shared-memory inter-core communication mechanisms is implemented on Virtex 5 LX110T FPGA. Each core is based on MIPS III (Microprocessor without interlocked pipelined stages) ISA, handling only integer type instructions and having six-stage pipeline with data hazard detection unit and forwarding logic. The eight processing cores and one central shared memory core are inter connected using 3x3 2-D mesh topology based Network-on-chip (NoC) with virtual channel router. The router is four stage pipelined supporting DOR X-Y routing algorithm and with round robin arbitration technique. For verification and functionality test of above fully synthesized multi core processor, matrix multiplication operation is mapped onto the above said. Partitioning and scheduling of multiple multiplications and addition for each element of resultant matrix has been done accordingly among eight cores to get maximum throughput. All the codes for processor design are written in Verilog HDL. Keywords: MPSoC, message-passing, shared memory, MIPS, ISA, wormhole router, network-on-chip, SIMD, data level parallelism, 2-D Mesh, virtual channel
All new computers have multicore processors. To exploit this hardware parallelism for improved
perf
ormance, the predominant approach today is multithreading using shared variables and locks. This
approach has potential data races that can create a nondeterministic program. This paper presents a
promising new approach to parallel programming that is both
lock
-
free and deterministic. The standard
forall primitive for parallel execution of for
-
loop iterations is extended into a more highly structured
primitive called a Parallel Operation (POP). Each parallel process created by a POP may read shared
variable
s (or shared collections) freely. Shared collections modified by a POP must be selected from a
special set of predefined Parallel Access Collections (PAC). Each PAC has several Write Modes that
govern parallel updates in a deterministic way. This paper pre
sents an overview of a Prototype Library
that implements this POP
-
PAC approach for the C++ language, including performance results for two
benchmark parallel programs.
Artificial Neural Network Implementation on FPGA – a Modular ApproachRoee Levy
This document presents an FPGA implementation of an artificial neural network using a modular approach. Key points:
- The implementation uses a multilayer perceptron topology trained with the backpropagation algorithm. It allows networks of any size to be synthesized quickly.
- The design achieves peak performance of 5.46 million connection updates per second during training and 8.24 million predictions per second during computation.
- It was tested on a breast cancer classification problem, achieving 96% accuracy.
- The paper emphasizes important FPGA design principles that make neural network development modular and parameterized. This allows the system to solve various neural network problems efficiently.
This document summarizes the key aspects of a design flow framework that aims to simplify the development of partially reconfigurable systems. The framework hides complexity related to reconfiguration from designers and supports different architectural paradigms and communication infrastructures. It was developed in three phases: studying existing approaches, realizing the framework based on separated tools, and validating it with a new communication protocol. The framework generates architectures from a system description and allows designers to focus on writing modules while handling reconfiguration details.
A talk on Transformers at GDG DevParty
27.06.2020
Link to Google Slides version: https://docs.google.com/presentation/d/1N7ayCRqgsFO7TqSjN4OWW-dMOQPT5DZcHXsZvw8-6FU/edit?usp=sharing
Migration To Multi Core - Parallel Programming ModelsZvi Avraham
The document discusses multi-core and many-core processors and parallel programming models. It provides an overview of hardware trends including increasing numbers of cores in CPUs and GPUs. It also covers parallel programming approaches like shared memory, message passing, data parallelism and task parallelism. Specific APIs discussed include Win32 threads, OpenMP, and Intel TBB.
The document proposes a new framework for hardware/software co-design that raises the abstraction level to allow uniform system description and faster simulation. The framework uses parameterizable hardware cores along with estimation to explore design spaces and generate optimized implementations. It has been applied to the RoadRunner project and focuses on further developing the hardware aspects. Future work includes expanding to software domains and refining system simulation and estimation techniques.
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...Ritu Arora
Often, HPC software outlives the HPC systems for which they are initially developed. The innovations in the HPC platforms’ hardware and parallel programming standards drive the modernization of HPC applications so that they continue being performant. While such code modernization efforts may not be very challenging for HPC experts and well-funded research groups, many domain-experts may find it challenging to adapt their applications for latest HPC platforms due to lack of expertise, time, and funds. The challenges of such domain-experts can be mitigated by providing them high-level tools for code modernization and migration.
This document reviews Network-on-Chip (NoC) architectures that prioritize selected data streams to reduce communication latency. It categorizes the architectures based on the effect of prioritization (per end-to-end connection, per router, or per path segment) and discusses their pros and cons. Architectures that prioritize at the core-to-core level provide the highest latency reduction by bypassing the NoC, while those prioritizing per router or path segment require redetermining priority at each hop.
System on Chip Design and Modelling Dr. David J GreavesSatya Harish
The document provides an overview of a course on system on chip design and modeling techniques. The course covers topics like register transfer language, SystemC components, basic SoC components, assertion-based design, network on chip structures, and architectural design exploration. It aims to cover the front end of the design automation process, including specification, modeling at different levels of abstraction, and logic synthesis. A running example evolves over the lectures to demonstrate a simple SoC.
The document summarizes research on parallelizing genetic algorithms to improve scalability when solving concept location problems. Four distributed architectures were developed and tested: 1) a simple client-server model with no data sharing, 2) a database configuration, 3) a hash-database configuration, and 4) a hash configuration where each server caches its own data locally. Experimental results showed the hash configuration performed best, reducing computation time by over 140 times compared to a single machine by efficiently storing and accessing already-computed data locally on each server. Future work aims to test different communication protocols and problems to validate the findings.
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
Oak Ridge National Lab is home of Titan, the largest GPU accelerated supercomputer in the world. This fact alone can be an intimidating experience for users new to leadership computing facilities. Our facility has collected over four years of experience helping users port applications to Titan. This talk will explain common paths and tools to successfully port applications, and expose common difficulties experienced by new users. Lastly, learn how our free and open training program can assist your organization in this transition.
The document discusses using a regular expression matching architecture called ReCPU for network intrusion detection systems (NIDS). ReCPU can efficiently match regular expressions in hardware and is well-suited for the high-speed regular expression matching needs of NIDS. It describes the ReCPU architecture, which uses parallel comparators to match multiple characters simultaneously, and how its design can be adapted for NIDS computation.
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the various techniques there followed in, for improving the performance of the Multicore
Processors. We conducted cluster experiments to find this limit. In this paper we propose an alternate design of Multicore processor based on the results of our cluster experiment.
Affect of parallel computing on multicore processorscsandit
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make
number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of
Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the
various techniques there followed in, for improving the performance of the Multicore
Processors.
We conducted cluster experiments to find this limit. In this paper we propose an alternate design
of Multicore processor based on the results of our cluster experiment.
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...VLSICS Design
With the shrinking technology, reduced scale and power-hungry chip IO leads to System on Chip. The design of SOC using traditional standard bus scheme encounters with issues like non-uniform delay and routing problems. Crossbars could scale better when compared to buses but tend to become huge with increasing number of nodes. NOC has become the design paradigm for SOC design for its highly regularized interconnect structure, good scalability and linear design effort. The main components of an NoC topology are the network adapters, routing nodes, and network interconnect links. This paper mainly deals with the implementation of full custom SRAM based arrays over D FF based register arrays in the design of input module of routing node in 2D mesh NOC topology. The custom SRAM blocks replace DFF(D flip flop) memory implementations to optimize area and power of the input block. Full custom design of SRAMs has been carried out by MILKYWAY, while physical implementation of the input module with SRAMs has been carried out by IC Compiler of SYNOPSYS.The improved design occupies approximately 30% of the area of the original design. This is in conformity to the ratio of the area of an SRAM cell to the area of a D flip flop, which is approximately 6:28.The power consumption is almost halved to 1.5 mW. Maximum operating frequency is improved from 50 MHz to 200 MHz. It is intended to study and quantify the behavior of the single packet array design in relation to the multiple packet array design. Intuitively, a
common packet buffer would result in better utilization of available buffer space. This in turn would translate into lower delays in transmission. A MATLAB model is used to show quantitatively how performance is improved in a common packet array design.
Concurrent Replication of Parallel and Distributed SimulationsGabriele D'Angelo
Parallel and distributed simulations enable the analysis of complex systems by concurrently exploiting the aggregate computation power and memory of clusters of execution units. In this work we investigate a new direction for increasing both the speedup of a simulation process and the utilization of computation and communication resources. Many simulation-based investigations require to collect independent observations for a correct and significant statistical analysis of results. The execution of many independent parallel or distributed simulation runs may suffer the speedup reduction due to rollbacks under the optimistic approach, and due to idle CPU times originated by synchronization and communication bottlenecks under the conservative approach. We present a parallel and distributed simulation framework supporting concurrent replication of parallel and distributed simulations (CR-PADS), as an alternative to the execution of a linear sequence of multiple parallel or distributed simulation runs. Results obtained from tests executed under variable scenarios show that speedup and resource utilization gains could be obtained by adopting the proposed replication approach in addition to the pure parallel and distributed simulation.
The document describes a methodology for designing dynamic reconfigurable multi-FPGA systems. It presents an intermediate representation for hierarchical circuits and a design flow with three main phases: design extraction from VHDL, static global layout partitioning and placement, and reuse through dynamic reconfiguration to minimize delays. Experimental results validate partitioning, placement and blocks reuse approaches. Future work includes improving clustering metrics, time estimation, and adding routing algorithms.
The document summarizes research on distributing the computation of a genetic algorithm's fitness function to parallelize concept location. Four distributed architectures - simple client-server, database, hash-database, and hash configurations - were developed, tested, and compared. The experiments found that a simple architecture where each server tracked its own computations without data sharing had the fastest execution time, reducing the genetic algorithm computation by around 140 times compared to a single-machine implementation. Future work aims to experiment with different communication protocols and synchronization strategies on additional traces from other systems.
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems
1) The document discusses improving the efficiency of machine learning algorithms using the HPCC Systems platform through parallelization.
2) It describes the HPCC Systems architecture and its advantages for distributed machine learning.
3) A parallel DBSCAN algorithm is implemented on the HPCC platform which shows improved performance over the serial algorithm, with execution times decreasing as more nodes are used.
The document discusses parallel computing platforms and trends in microprocessor architectures that enable implicit parallelism. It covers topics like pipelining, superscalar execution, limitations of memory performance, and how caches can improve effective memory latency. The key points are:
1) Microprocessor clock speeds have increased dramatically but limitations remain regarding memory latency and bandwidth. Parallelism addresses performance bottlenecks in processors, memory, and communication.
2) Techniques like pipelining and superscalar execution exploit implicit parallelism by executing multiple instructions concurrently, but dependencies and branch prediction limit performance gains.
3) Memory latency is often the bottleneck, but caches can reduce effective latency through data reuse and temporal locality.
An Adaptive Load Balancing Middleware for Distributed SimulationGabriele D'Angelo
The simulation is useful to support the design and performance evaluation of complex systems, possibly composed by a massive number of interacting entities. For this reason, the simulation of such systems may need aggregate computation and memory resources obtained by clusters of parallel and distributed execution units. Shared computer clusters composed of available Commercial-Off-the-Shelf hardware are preferable to dedicated systems, mainly for cost reasons. The performance of distributed simulations is influenced by the heterogeneity of execution units and by their respective CPU load in background. Adaptive load balancing mechanisms could improve the resources utilization and the simulation process execution, by dynamically tuning the simulation load with an eye to the synchronization and communication overheads reduction. In this work it will be presented the GAIA+ framework: a new load balancing mechanism for distributed simulation. The framework has been evaluated by performing testbed simulations of a wireless ad hoc network model. Results confirm the effectiveness of the proposed solutions.
The document introduces Parallel Pixie Dust (PPD), a cross-platform thread library that aims to guarantee deadlock-free and race-condition free schedules that are optimal. It discusses the need for multiple threads due to factors like the memory wall. Current threading models are problematic because testing and debugging threaded code is difficult. PPD uses futures and thread pools to simulate data flow and generate tree-like thread schedules. It provides parallel versions of functions and thread-safe containers to enable multi-threaded standard library algorithms. The goal is to make writing correct multi-threaded programs easier.
A Survey on in-a-box parallel computing and its implications on system softwa...ChangWoo Min
1) The document surveys research on parallel computing using multicore CPUs and GPUs, and its implications for system software.
2) It discusses parallel programming models like OpenMP, Intel TBB, CUDA, and OpenCL. It also covers research on optimizing memory allocation, reducing system call overhead, and revisiting OS architecture for manycore systems.
3) The document reviews work on supporting GPUs in virtualized environments through techniques like GPU virtualization. It also summarizes projects that utilize the GPU in middleware for tasks like network packet processing.
This document summarizes a research paper that proposes a new heuristic called PAUSE for investigating the producer-consumer problem in distributed systems. The paper motivates the need to study this problem, describes PAUSE's approach of using compact configurations and decentralized components, outlines its implementation in Lisp and Java, and presents experimental results showing PAUSE outperforms previous methods. Related work investigating similar challenges is also discussed.
1. Enhancing Cache Coherent Architectures
with Access Patterns for Embedded
Manycore Systems
Jussara Marandola, St´ephane Louise, Lo¨ıc Cudennec, David
A. Bader
stephane.louise@cea.fr
Oct 11-12, 2012, SoC 2012
1 / 23
2. Introduction CoCCA approach evaluation Conclusion and perspective
Background
Multicore and manycore systems: Architecture and its future:
Single processor time is over
Multiprocessor are there and will remain
Down to embedded systems (e.g. my cellphone)
Manycore systems are on the verge of appearing (e.g. Tilera,
but others are on the way)
The future is manycore, even in the embedded world
We have to prepare for this.
Programmability?
2 / 23
3. Introduction CoCCA approach evaluation Conclusion and perspective
Programmability
What are the programming paradigms for manycores? How do we
program them?
New paradigms (e.g. stream programming) still young, need
to learn we way of programming. Bad for legacy software
(porting costs!)
3 / 23
4. Introduction CoCCA approach evaluation Conclusion and perspective
Programmability
What are the programming paradigms for manycores? How do we
program them?
New paradigms (e.g. stream programming) still young, need
to learn we way of programming. Bad for legacy software
(porting costs!)
MPI (OK for HPC applications, also heavy work for
parallelization)
3 / 23
5. Introduction CoCCA approach evaluation Conclusion and perspective
Programmability
What are the programming paradigms for manycores? How do we
program them?
New paradigms (e.g. stream programming) still young, need
to learn we way of programming. Bad for legacy software
(porting costs!)
MPI (OK for HPC applications, also heavy work for
parallelization)
openMP and the like: “only” adding some pragma to
parallelize an application.
3 / 23
6. Introduction CoCCA approach evaluation Conclusion and perspective
Programmability
What are the programming paradigms for manycores? How do we
program them?
New paradigms (e.g. stream programming) still young, need
to learn we way of programming. Bad for legacy software
(porting costs!)
MPI (OK for HPC applications, also heavy work for
parallelization)
openMP and the like: “only” adding some pragma to
parallelize an application.
OpenMP relies on a shared memory model. So a shared memory
behavior must be provided and if possible done in hardware
(because it is faster)
3 / 23
7. Introduction CoCCA approach evaluation Conclusion and perspective
Shared memory consistency for multicore/manycores
For manycore systems, memory consistency = cache coherence
mechanisms:
Based on the four state MESI protocol
Modified: a single valid copy of the data exist in the system
and was modified since its fetch from memory
Exclusive: the value is only in one core’s cache and wasn’t
modified since it was accessed from memory
Shared: multiple copy of the value exist in the system, and
only read operation where done
Invalid: the current copy that the core has must not be used
and will be discarded
4 / 23
8. Introduction CoCCA approach evaluation Conclusion and perspective
Shared memory consistency for multicore/manycores
For manycore systems, memory consistency = cache coherence
mechanisms:
Based on the four state MESI protocol
Modified: a single valid copy of the data exist in the system
and was modified since its fetch from memory
Exclusive: the value is only in one core’s cache and wasn’t
modified since it was accessed from memory
Shared: multiple copy of the value exist in the system, and
only read operation where done
Invalid: the current copy that the core has must not be used
and will be discarded
Use of Home Nodes to keep the state consistency:
For a given address in memory only one core of the system will
keep the coherence state
The distribution of home nodes is done as a modulo on an
address mask (round-robin, line size) to avoid hot spots
A processor mask tracks the cores that share the cache line
Baseline protocol is the base for all memory consistency
systems within SotA
4 / 23
9. Introduction CoCCA approach evaluation Conclusion and perspective
Baseline implementation in a manycore system
Cache
Coherence
Directory
MemoryInterface
NetworkInterface
Address Coherence info. (state and vector bit fields)
Address Coherence info. (state and vector bit fields)
Address Coherence info. (state and vector bit fields)
L2Cache
5 / 23
10. Introduction CoCCA approach evaluation Conclusion and perspective
Modification of a shared value by a given core
Processor with shared copy
of the data
➀
6 / 23
11. Introduction CoCCA approach evaluation Conclusion and perspective
Modification of a shared value by a given core
Processor with shared copy
of the data
➁
6 / 23
12. Introduction CoCCA approach evaluation Conclusion and perspective
Modification of a shared value by a given core
Processor with shared copy
of the data
➂
6 / 23
13. Introduction CoCCA approach evaluation Conclusion and perspective
Modification of a shared value by a given core
➃
6 / 23
14. Introduction CoCCA approach evaluation Conclusion and perspective
Issues of baseline protocol and memory access patterns
Sometime a single write on a shared value triggers lot of
coherence traffic on the NoC
For regular but non conterminous access, lot accesses
Typical example: reading an image by column
But the accesses are simple and deterministic
In some areas the baseline protocol does not work as well as it
could and lacks a bit of scalability
In the embedded world lots of low level data processing display a
regular behavior WRT their memory accesses
Convolutions on images
usual transformations (e.g. FFT, DCT)
vector operation
etc.
7 / 23
15. Introduction CoCCA approach evaluation Conclusion and perspective
Issues of baseline protocol and memory access patterns
Sometime a single write on a shared value triggers lot of
coherence traffic on the NoC
For regular but non conterminous access, lot accesses
Typical example: reading an image by column
But the accesses are simple and deterministic
In some areas the baseline protocol does not work as well as it
could and lacks a bit of scalability
In the embedded world lots of low level data processing display a
regular behavior WRT their memory accesses
Convolutions on images
usual transformations (e.g. FFT, DCT)
vector operation
etc.
The idea: take advantage of these regular memory access pattern
to reduce the coherence traffic and enable memory prefetch
7 / 23
16. Introduction CoCCA approach evaluation Conclusion and perspective
State of the Art, Memory patterns and shared memory
coherence
Use of memory patterns:
Intel: use of special instructions to perform regular accesses to
memory limited to a single core; Patent US 7,143,264 (2006)
IBM: special instructions used to detect and apply patterns,
also limited to a single cache; Patent US 7,395,407 (2008)
Other platforms:
STAR project aim to provide a scalable manycore with a
coherent shared memory
8 / 23
17. Introduction CoCCA approach evaluation Conclusion and perspective
Cache Coherence Architecture with patterns
Our enhancement to Cache Coherence Architecture (CCA)
Relies on the baseline protocol (adds to it, not replace it)
Update it with special cases for pattern management
Add storage with each core for pattern storage and detection
Patterns are a result of the compilation process
9 / 23
18. Introduction CoCCA approach evaluation Conclusion and perspective
Cache Coherence Architecture with patterns
Our enhancement to Cache Coherence Architecture (CCA)
Relies on the baseline protocol (adds to it, not replace it)
Update it with special cases for pattern management
Add storage with each core for pattern storage and detection
Patterns are a result of the compilation process
It can not work worst than baseline, because baseline is still the
default.
Modifies:
Core IP with the pattern storage and matching
Add the speculative protocol to the baseline protocol
The patterns (and the speculative protocol) has its own
determination of Home Node (which can be the same or differ
from the baseline Home Node)
We call this modified system CoCCA (Codesigned CCA)
9 / 23
19. Introduction CoCCA approach evaluation Conclusion and perspective
CoCCA architecture scheme
Cache
Coherence
Directory
CoCCA
Pattern
Table
MemoryInterface
NetworkInterface
Address Coherence info. (state and vector bit fields)
Address Coherence info. (state and vector bit fields)
Address Coherence info. (state and vector bit fields)
Address Pattern Coherence info. (state and bit fields)
Address Pattern Coherence info. (state and bit fields)
Chip area overhead: ~+3%?
10 / 23
20. Introduction CoCCA approach evaluation Conclusion and perspective
Pattern definition and storage
Patterns are not stored the same way on use nodes and home
nodes
The minimum implementation uses a 2D strided access shape:
a start address
a stride lenght
a pattern lenght
On the home node: a pattern size
A speculative access fetches cache lines (as baseline protocol do)
but the access pattern may need to be more fine grained in its
specification (overlaps)
Definition of triggers: way of detecting the signature of a pattern
to fetch
the simplest trigger is the first address of the pattern access
11 / 23
21. Introduction CoCCA approach evaluation Conclusion and perspective
Triggers and pattern definition
Pattern matching principle (hw):
Pattern calculation (simple case):
Desc = fn(Baddr , s, δ)
Baddr Base address
s size of the pattern
δ interval (stride) between 2 consecutive access
E.G.: Pat(1, 4, 2)(@1) = { }
12 / 23
22. Introduction CoCCA approach evaluation Conclusion and perspective
Triggers and pattern definition
Pattern matching principle (hw):
Pattern calculation (simple case):
Desc = fn(Baddr , s, δ)
Baddr Base address
s size of the pattern
δ interval (stride) between 2 consecutive access
E.G.: Pat(1, 4, 2)(@1) = { @2, }
12 / 23
23. Introduction CoCCA approach evaluation Conclusion and perspective
Triggers and pattern definition
Pattern matching principle (hw):
Pattern calculation (simple case):
Desc = fn(Baddr , s, δ)
Baddr Base address
s size of the pattern
δ interval (stride) between 2 consecutive access
E.G.: Pat(1, 4, 2)(@1) = { @2, @5, }
12 / 23
24. Introduction CoCCA approach evaluation Conclusion and perspective
Triggers and pattern definition
Pattern matching principle (hw):
Pattern calculation (simple case):
Desc = fn(Baddr , s, δ)
Baddr Base address
s size of the pattern
δ interval (stride) between 2 consecutive access
E.G.: Pat(1, 4, 2)(@1) = { @2, @5, @8, }
12 / 23
25. Introduction CoCCA approach evaluation Conclusion and perspective
Triggers and pattern definition
Pattern matching principle (hw):
Pattern calculation (simple case):
Desc = fn(Baddr , s, δ)
Baddr Base address
s size of the pattern
δ interval (stride) between 2 consecutive access
E.G.: Pat(1, 4, 2)(@1) = { @2, @5, @8, @11}
12 / 23
26. Introduction CoCCA approach evaluation Conclusion and perspective
Base of the modified protocol
Requester
DIR
Lookup
PT
Lookup
hit miss
hit miss
Send RD_RQSend SPEC_RQ
L2 cache
Read
Pattern
Lookup
Memory
Access
Send RD_RQ_AK
hit
Send RD_RQ_AK
DIR
Lookup
Memory
Access
Send RD_RQ_AK
miss
hit
Baseline
Home Node
Hybrid
(CoCCA)
Home Node
Send RD_RQ_AK
miss Pattern
length
Pattern
length
Without pattern information or in case of pattern miss: the
system acts as an ordinary baseline architecture
In case of pattern hit: the speculative protocol is fired
13 / 23
27. Introduction CoCCA approach evaluation Conclusion and perspective
Base of the modified protocol
Requester
DIR
Lookup
PT
Lookup
hit miss
hit miss
Send RD_RQSend SPEC_RQ
L2 cache
Read
Pattern
Lookup
Memory
Access
Send RD_RQ_AK
hit
Send RD_RQ_AK
DIR
Lookup
Memory
Access
Send RD_RQ_AK
miss
hit
Baseline
Home Node
Hybrid
(CoCCA)
Home Node
Send RD_RQ_AK
miss Pattern
length
Pattern
length
X
Without pattern information or in case of pattern miss: the
system acts as an ordinary baseline architecture
In case of pattern hit: the speculative protocol is fired
13 / 23
28. Introduction CoCCA approach evaluation Conclusion and perspective
Base of the modified protocol
Requester
DIR
Lookup
PT
Lookup
hit miss
hit miss
Send RD_RQSend SPEC_RQ
L2 cache
Read
Pattern
Lookup
Memory
Access
Send RD_RQ_AK
hit
Send RD_RQ_AK
DIR
Lookup
Memory
Access
Send RD_RQ_AK
miss
hit
Baseline
Home Node
Hybrid
(CoCCA)
Home Node
Send RD_RQ_AK
miss Pattern
length
Pattern
length
X
Without pattern information or in case of pattern miss: the
system acts as an ordinary baseline architecture
In case of pattern hit: the speculative protocol is fired
13 / 23
29. Introduction CoCCA approach evaluation Conclusion and perspective
Hardware tables and special instructions
A C-language description of pattern storing tables:
unsigned long capacity; /* sizeof(address) */
unsigned long size; /* address number */
unsigned long * offset; /* pattern offset */
unsigned long * length; /* pattern length */
unsigned long * stride; /* pattern stride */
So it is possible to have a rough estimate of the size of an entry in
the pattern table
14 / 23
30. Introduction CoCCA approach evaluation Conclusion and perspective
Hardware tables and special instructions
A C-language description of pattern storing tables:
unsigned long capacity; /* sizeof(address) */
unsigned long size; /* address number */
unsigned long * offset; /* pattern offset */
unsigned long * length; /* pattern length */
unsigned long * stride; /* pattern stride */
So it is possible to have a rough estimate of the size of an entry in
the pattern table A few specialized instructions to deal manage
pattern tables:
PatternNew: to create a pattern,
PatternAddOffset: to add an offset entry,
PatternAddLength: to add a length entry,
PatternAddStride: to add a stride entry,
PatternFree: to release the pattern after use.
14 / 23
31. Introduction CoCCA approach evaluation Conclusion and perspective
A first benchmark program for early evaluation
The choice of a benchmark program for our speculative protocol:
be representative of typical embedded application
stress the protocol proposal on several aspects
We choose a 2 step image cascading filtering
the first filter result is the source of the second filter
5x5 filter
applied on chunks of the image, for each core with shared
cache lines both in read mode as in write mode
the result of the second filter is written back on the source
(write invalidation)
15 / 23
32. Introduction CoCCA approach evaluation Conclusion and perspective
Memory mapping of our benchmark program
16 / 23
33. Introduction CoCCA approach evaluation Conclusion and perspective
Instrumentation choice: Pin/Pintools
Pin/Pintools:
Pin is an instrumentation framework of binaries based on JIT
technique to accelerate the instrumentation. It is a Intel
project
Pin acts in association with the instrumentation tool called
Pintool which is programmable
Several Pintools are provided in the basic distribution of Pin
17 / 23
34. Introduction CoCCA approach evaluation Conclusion and perspective
Instrumentation choice: Pin/Pintools
Pin/Pintools:
Pin is an instrumentation framework of binaries based on JIT
technique to accelerate the instrumentation. It is a Intel
project
Pin acts in association with the instrumentation tool called
Pintool which is programmable
Several Pintools are provided in the basic distribution of Pin
We used:
inscount: pintool which gives the number of executed
instructions
pinatrace: pintool which trace and log all the memory
accesses (load/store operations)
See paper for details.
17 / 23
35. Introduction CoCCA approach evaluation Conclusion and perspective
Data sharing and prefetch
Rect. i Rect. i+1
Rect. i+7 Rect. i+8
Exclusive data (1 core only)
Data shared by 2 cores
Data shared by 4 cores
Figure: Read data sharing in conterminous rectangles
We can define three kinds of patterns on this benchmark:
Source image prefetch and setting of old Shared values (S) to
Exclusive values (E) when the source image becomes the
destination (2 patterns per core)
18 / 23
36. Introduction CoCCA approach evaluation Conclusion and perspective
Data sharing and prefetch
Rect. i Rect. i+1
Rect. i+7 Rect. i+8
Exclusive data (1 core only)
Data shared by 2 cores
Data shared by 4 cores
Figure: Read data sharing in conterminous rectangles
We can define three kinds of patterns on this benchmark:
False concurrency of write accesses between two rectangles of
the destination image. This happens because the frontiers is
not alined with L2 cache lines. The associated patterns is 6
vertical lines with 0 bytes in common
18 / 23
37. Introduction CoCCA approach evaluation Conclusion and perspective
Data sharing and prefetch
Rect. i Rect. i+1
Rect. i+7 Rect. i+8
Exclusive data (1 core only)
Data shared by 2 cores
Data shared by 4 cores
Figure: Read data sharing in conterminous rectangles
We can define three kinds of patterns on this benchmark:
Shared read data (because convolution kernels read pixels in
conterminous rectangles, see figure 1). There are 6 vertical
lines and 3 sets of two horizontal lines for these patterns
18 / 23
38. Introduction CoCCA approach evaluation Conclusion and perspective
Data sharing and prefetch
Rect. i Rect. i+1
Rect. i+7 Rect. i+8
Exclusive data (1 core only)
Data shared by 2 cores
Data shared by 4 cores
Figure: Read data sharing in conterminous rectangles
We can define three kinds of patterns on this benchmark:
After simplification, only 6 patterns are required
18 / 23
39. Introduction CoCCA approach evaluation Conclusion and perspective
Evaluation results
Condition MESI CoCCA
Shared line invalidation 34560 17283
Exclusive line sharing (2 cores) 12768 12768
Exclusive line sharing (4 cores) 1344 772
Total throughput 48672 30723
19 / 23
40. Introduction CoCCA approach evaluation Conclusion and perspective
Evaluation results
Condition MESI CoCCA
Shared line invalidation 34560 17283
Exclusive line sharing (2 cores) 12768 12768
Exclusive line sharing (4 cores) 1344 772
Total throughput 48672 30723
reduction of 37% of coherence message throughput
19 / 23
41. Introduction CoCCA approach evaluation Conclusion and perspective
Evaluation results
Condition MESI CoCCA
Shared line invalidation 34560 17283
Exclusive line sharing (2 cores) 12768 12768
Exclusive line sharing (4 cores) 1344 772
Total throughput 48672 30723
reduction of 37% of coherence message throughput
prefetch stands for 10% of cache accesses
19 / 23
42. Introduction CoCCA approach evaluation Conclusion and perspective
Evaluation results
Condition MESI CoCCA
Shared line invalidation 34560 17283
Exclusive line sharing (2 cores) 12768 12768
Exclusive line sharing (4 cores) 1344 772
Total throughput 48672 30723
reduction of 37% of coherence message throughput
prefetch stands for 10% of cache accesses
Means that without prefetch the application runs 67% slower
(20 cycles for on chip shared cache access and 80 cycles for
external memory accesses)
19 / 23
43. Introduction CoCCA approach evaluation Conclusion and perspective
Contributions
Shared memory and coherence is important for
programmability of CMP
SotA cache coherence mechanisms falls into worst case
behaviors for scenarios that seems simple: regular access to
memory with patterns
We defined an extension of cores to store pattern
We extended the baseline protocol with a speculative protocol
For embedded system: tables are part of the compilation
process
20 / 23
44. Introduction CoCCA approach evaluation Conclusion and perspective
Contributions
Shared memory and coherence is important for
programmability of CMP
SotA cache coherence mechanisms falls into worst case
behaviors for scenarios that seems simple: regular access to
memory with patterns
We defined an extension of cores to store pattern
We extended the baseline protocol with a speculative protocol
For embedded system: tables are part of the compilation
process
Only few patterns entries are necessary for each typical low
level filter
Patterns can reduce significantly coherence message
throughput
Patterns allow for early and efficient cache preloading which
accelerate significantly applications
20 / 23
45. Introduction CoCCA approach evaluation Conclusion and perspective
Contributions
Shared memory and coherence is important for
programmability of CMP
SotA cache coherence mechanisms falls into worst case
behaviors for scenarios that seems simple: regular access to
memory with patterns
We defined an extension of cores to store pattern
We extended the baseline protocol with a speculative protocol
For embedded system: tables are part of the compilation
process
Only few patterns entries are necessary for each typical low
level filter
Patterns can reduce significantly coherence message
throughput
Patterns allow for early and efficient cache preloading which
accelerate significantly applications
May provide a path to cache coherency in massive many-cores
20 / 23
46. Introduction CoCCA approach evaluation Conclusion and perspective
Future work and perspective
extend the number of benchmark applications to draw more
general conclusions
apply our ideas in a NoC simulator to do cycle accurate
simulations
21 / 23
47. Introduction CoCCA approach evaluation Conclusion and perspective
Future work and perspective
extend the number of benchmark applications to draw more
general conclusions
apply our ideas in a NoC simulator to do cycle accurate
simulations
include it in a full scale simulator (e.g. SoCLib)
21 / 23
48. Introduction CoCCA approach evaluation Conclusion and perspective
Future work and perspective
extend the number of benchmark applications to draw more
general conclusions
apply our ideas in a NoC simulator to do cycle accurate
simulations
include it in a full scale simulator (e.g. SoCLib)
extend our work toward a HPC friendly architecture that
would determine patterns dynamically at runtime
21 / 23
49. Introduction CoCCA approach evaluation Conclusion and perspective
Thank you for your attention
Questions?
22 / 23
50. Introduction CoCCA approach evaluation Conclusion and perspective
ALCHEMY wokshop @ ICCS 2013 (Barcelona)
The International Conference on Computational Science (ICCS)
can be a good place to talk with people using HPC architectures
for their needs.
Lo¨ıc Cudennec and I are organizing a workshop on the issues that
are raising with future manycore systems (number of cores ¿ 1000
and beyond)
Architecture, Language, Compilation and Hardware support
for Emerging ManYcore systems
ALCHEMY wokshop
23 / 23
51. Introduction CoCCA approach evaluation Conclusion and perspective
ALCHEMY wokshop @ ICCS 2013 (Barcelona)
The International Conference on Computational Science (ICCS)
can be a good place to talk with people using HPC architectures
for their needs.
Lo¨ıc Cudennec and I are organizing a workshop on the issues that
are raising with future manycore systems (number of cores ¿ 1000
and beyond)
Architecture, Language, Compilation and Hardware support
for Emerging ManYcore systems
ALCHEMY wokshop
Topics:
Advanced architecture support for massive parallelism
management
Advanced architecture support for enhanced communication
for manycores
Full paper submission: December 15th Notification: Feb. 10
23 / 23