Simultaneous Multi-Threading (SMT) processors improve system performance by allowing concurrent execution of multiple independent threads with sharing key datapath components and better utilization of resources. Speculative execution allows modern processors to fetch continuously and reduce the delays of control instructions. However, a significant amount of resources is usually wasted due to miss-speculation, which could have been used by other valid instructions, and such a waste is even more pronounced in an SMT system. In order to minimize the waste of resources, a speculative trace capping technique [1] was proposed to limit the number of speculative instructions in the system. In this paper, a thorough analysis is given to investigate the trade-offs among applying this capping mechanism at different pipeline stages so as to maximize its benefits. Our simulations show that the best choice can improve overall system throughput by a very significant margin (up to 46%) without sacrificing execution fairness among the threads.
AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...IJCSEA Journal
The performanceof the processor is highly dependent on the regular supply of correct instruction at the right time. Whenever a data miss is occurring in the cache memory the processor has to spend more cycles for the fetching operation. One of the methodsused to reduce instruction cache miss is the instruction prefetching,which in turn will increase instructions supply to the processor.The technology developments in these fields indicates that in future the gap between processing speeds of processor and data transfer speed of memory is likely to be increased. Branch Predictor play a critical role in achieving effective performance in many modernpipelined microprocessor architecture.
Prefetching canbe done either with software or hardware. In software prefetching the compiler will insert a prefetch code in the program. In this case as actual memory capacity is not known to the compiler and it will lead to some harmful prefetches. In hardware prefetching instead of inserting prefetch code it will make use of extra hardware and which is utilized during the execution. The most significant source of lost performance when the process waiting for the availability of the next instruction. Thetime that is wasted in case of branch misprediction is equal to the number of stages in the pipeline, starting from fetch stage to execute stage. All the prefetching methods are givenstress only to the fetching of the instruction for the execution, not to the overall performance of the processor. The most significant source of lost performance is,when the process is waiting for the availability of the next instruction. The time that is wasted in case of branch misprediction is equal to the number of stages in the pipeline, starting from fetch stage to execution stage.This paper we made an attempt to study the branch handling in a uniprocessing environment, whenever branching is identified instead of invoking the branch prediction the proper cache memory management is enabled inside the memory management unit.
The effective way of processor performance enhancement by proper branch handlingcsandit
The processor performance is highly dependent on the regular supply of correct instruction at
the right time. To reduce instruction cache misses, one of the proposed mechanism is the
instruction prefetching, which in turn will increase instructions supply to the processor. The
technology developments in these fields indicates that in future the gap between processing
speeds of processor and data transfer speed of memory is likely to be increased. Memory
bandwidth can be increased significantly using the prefetching, but unsuccessful prefetches will
pollute the primary cache. Prefetching can be done either with software or hardware. In
software prefetching the compiler will insert a prefetch code in the program. In this case as
actual memory capacity is not known to the compiler and it will lead to some harmful
prefetches. In hardware prefetching instead of inserting prefetch code it will make use of extra
hardware and which is utilized during the execution. The most significant source of lost
performance when the process waiting for the availability of the next instruction. All the
prefetching methods are giving stress only to the fetching of the instruction for the execution,
not to the overall performance of the processor. This paper is an attempt to study the branch
handling in a uniprocessing environment, where, when ever branching is identified the proper
cache memory management is enabled inside the memory management unit.
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...cscpconf
The processor performance is highly dependent on the regular supply of correct instruction at
the right time. To reduce instruction cache misses, one of the proposed mechanism is the
instruction prefetching, which in turn will increase instructions supply to the processor. The
technology developments in these fields indicates that in future the gap between processing
speeds of processor and data transfer speed of memory is likely to be increased. Memory bandwidth can be increased significantly using the prefetching, but unsuccessful prefetches will pollute the primary cache. Prefetching can be done either with software or hardware. In software prefetching the compiler will insert a prefetch code in the program. In this case as actual memory capacity is not known to the compiler and it will lead to some harmful prefetches. In hardware prefetching instead of inserting prefetch code it will make use of extra hardware and which is utilized during the execution. The most significant source of lostperformance when the process waiting for the availability of the next instruction. All the prefetching methods are giving stress only to the fetching of the instruction for the execution, not to the overall performance of the processor. This paper is an attempt to study the branch handling in a uniprocessing environment, where, when ever branching is identified the proper cache memory management is enabled inside the memory management unit.
In a simultaneous multithreaded system, a core’s pipeline resources are sometimes partitioned and otherwise shared amongst numerous active threads. One mutual resource is the write buffer, which acts as an intermediary between a store instruction’s retirement from the pipeline and the store value being written to cache. The write buffer takes a completed store instruction from the load/store queue and eventually writes the value to the level-one data cache. Once a store is buffered with a write-allocate cache policy, the store must remain in the write buffer until its cache block is in level-one data cache. This latency may vary from as little as a single clock cycle (in the case of a level-one cache hit) to several hundred clock cycles (in the case of a cache miss). This paper shows that cache misses routinely dominate the write buffer’s resources and deny cache hits from being written to memory, thereby degrading performance of simultaneous multithreaded systems. This paper proposes a technique to reduce denial of resources to cache hits by limiting the number of cache misses that may concurrently reside in the write buffer and shows that system performance can be improved by using this technique.
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the various techniques there followed in, for improving the performance of the Multicore
Processors. We conducted cluster experiments to find this limit. In this paper we propose an alternate design of Multicore processor based on the results of our cluster experiment.
AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...IJCSEA Journal
The performanceof the processor is highly dependent on the regular supply of correct instruction at the right time. Whenever a data miss is occurring in the cache memory the processor has to spend more cycles for the fetching operation. One of the methodsused to reduce instruction cache miss is the instruction prefetching,which in turn will increase instructions supply to the processor.The technology developments in these fields indicates that in future the gap between processing speeds of processor and data transfer speed of memory is likely to be increased. Branch Predictor play a critical role in achieving effective performance in many modernpipelined microprocessor architecture.
Prefetching canbe done either with software or hardware. In software prefetching the compiler will insert a prefetch code in the program. In this case as actual memory capacity is not known to the compiler and it will lead to some harmful prefetches. In hardware prefetching instead of inserting prefetch code it will make use of extra hardware and which is utilized during the execution. The most significant source of lost performance when the process waiting for the availability of the next instruction. Thetime that is wasted in case of branch misprediction is equal to the number of stages in the pipeline, starting from fetch stage to execute stage. All the prefetching methods are givenstress only to the fetching of the instruction for the execution, not to the overall performance of the processor. The most significant source of lost performance is,when the process is waiting for the availability of the next instruction. The time that is wasted in case of branch misprediction is equal to the number of stages in the pipeline, starting from fetch stage to execution stage.This paper we made an attempt to study the branch handling in a uniprocessing environment, whenever branching is identified instead of invoking the branch prediction the proper cache memory management is enabled inside the memory management unit.
The effective way of processor performance enhancement by proper branch handlingcsandit
The processor performance is highly dependent on the regular supply of correct instruction at
the right time. To reduce instruction cache misses, one of the proposed mechanism is the
instruction prefetching, which in turn will increase instructions supply to the processor. The
technology developments in these fields indicates that in future the gap between processing
speeds of processor and data transfer speed of memory is likely to be increased. Memory
bandwidth can be increased significantly using the prefetching, but unsuccessful prefetches will
pollute the primary cache. Prefetching can be done either with software or hardware. In
software prefetching the compiler will insert a prefetch code in the program. In this case as
actual memory capacity is not known to the compiler and it will lead to some harmful
prefetches. In hardware prefetching instead of inserting prefetch code it will make use of extra
hardware and which is utilized during the execution. The most significant source of lost
performance when the process waiting for the availability of the next instruction. All the
prefetching methods are giving stress only to the fetching of the instruction for the execution,
not to the overall performance of the processor. This paper is an attempt to study the branch
handling in a uniprocessing environment, where, when ever branching is identified the proper
cache memory management is enabled inside the memory management unit.
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...cscpconf
The processor performance is highly dependent on the regular supply of correct instruction at
the right time. To reduce instruction cache misses, one of the proposed mechanism is the
instruction prefetching, which in turn will increase instructions supply to the processor. The
technology developments in these fields indicates that in future the gap between processing
speeds of processor and data transfer speed of memory is likely to be increased. Memory bandwidth can be increased significantly using the prefetching, but unsuccessful prefetches will pollute the primary cache. Prefetching can be done either with software or hardware. In software prefetching the compiler will insert a prefetch code in the program. In this case as actual memory capacity is not known to the compiler and it will lead to some harmful prefetches. In hardware prefetching instead of inserting prefetch code it will make use of extra hardware and which is utilized during the execution. The most significant source of lostperformance when the process waiting for the availability of the next instruction. All the prefetching methods are giving stress only to the fetching of the instruction for the execution, not to the overall performance of the processor. This paper is an attempt to study the branch handling in a uniprocessing environment, where, when ever branching is identified the proper cache memory management is enabled inside the memory management unit.
In a simultaneous multithreaded system, a core’s pipeline resources are sometimes partitioned and otherwise shared amongst numerous active threads. One mutual resource is the write buffer, which acts as an intermediary between a store instruction’s retirement from the pipeline and the store value being written to cache. The write buffer takes a completed store instruction from the load/store queue and eventually writes the value to the level-one data cache. Once a store is buffered with a write-allocate cache policy, the store must remain in the write buffer until its cache block is in level-one data cache. This latency may vary from as little as a single clock cycle (in the case of a level-one cache hit) to several hundred clock cycles (in the case of a cache miss). This paper shows that cache misses routinely dominate the write buffer’s resources and deny cache hits from being written to memory, thereby degrading performance of simultaneous multithreaded systems. This paper proposes a technique to reduce denial of resources to cache hits by limiting the number of cache misses that may concurrently reside in the write buffer and shows that system performance can be improved by using this technique.
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the various techniques there followed in, for improving the performance of the Multicore
Processors. We conducted cluster experiments to find this limit. In this paper we propose an alternate design of Multicore processor based on the results of our cluster experiment.
Affect of parallel computing on multicore processorscsandit
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make
number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of
Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the
various techniques there followed in, for improving the performance of the Multicore
Processors.
We conducted cluster experiments to find this limit. In this paper we propose an alternate design
of Multicore processor based on the results of our cluster experiment.
The timing behavior of the OS must be predictable - services of the OS: Upper bound on the execution time!
2. OS must manage the timing and scheduling
OS possibly has to be aware of task deadlines;
(unless scheduling is done off-line).
3. The OS must be fast
Enhancing network security and performance using optimized aclsijfcstjournal
Access Control list plays a very important role in network security.Proper combination of rules for ACLs can
close loop holes in the system, this minimizing security breaches. An ACL can improvise network performance up to a good level by limiting the traffic controls the areas that can be accessible to any device or user. However, if ACL is not managed properly and efficiently it causes packet latency and degrades the network performance. In this paper we present various optimization mechanisms to achieve an optimal ACL which reduces the Packet
latency. We also proposed an efficient optimization algorithm to optimize the ACL to enhance network performance. We also discuss the importance of ACL and the various rule anomalies.
Multilevel priority packet scheduling scheme for wireless networksijdpsjournal
Scheduling different types of packets such as real
-
time and non
-
real time data packets in wireless links is
necessary to reduce energy consumption of the wireless device. Most of the existing packet schedulin
g
mechanism uses opportunistic transmission sche
duling, in which communication is postponed upto an
acceptable time deadline until the best expected channel conditions to transmit are found. This algo
rithm
incurs a large processing overhead and more energy consumption. In this paper we propose a Dynamic
Multilevel Queue Scheduling algorithm. In the proposed scheme, the ready queue is partitioned into t
hree
levels of priority queues. Real
-
time packets are placed into the highest priority queue and non
-
real time
data packets are placed into two other queue
s. We evaluate the performance of the proposed Dynamic
Multilevel Queue Scheduling scheme through simulations for real
-
time and non
-
real time data. Simulation
results illustrate that the Multilevel Priority packet scheduling scheme overcomes the convention
al methods
interms of average data waiting time and end
-
to
-
end delay
The note is compiled with reference from many sites and According to the syllabus of Real Time System (6th semester CSIT). Drive deep to the never ending knowledge.
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORESijmvsc
In a simultaneous multithreaded system, a core’s pipeline resources are sometimes partitioned and
otherwise shared amongst numerous active threads. One mutual resource is the write buffer, which acts as
an intermediary between a store instruction’s retirement from the pipeline and the store value being written
to cache. The write buffer takes a completed store instruction from the load/store queue and eventually
writes the value to the level-one data cache. Once a store is buffered with a write-allocate cache policy, the
store must remain in the write buffer until its cache block is in level-one data cache. This latency may vary
from as little as a single clock cycle (in the case of a level-one cache hit) to several hundred clock cycles
(in the case of a cache miss). This paper shows that cache misses routinely dominate the write buffer’s
resources and deny cache hits from being written to memory, thereby degrading performance of
simultaneous multithreaded systems. This paper proposes a technique to reduce denial of resources to
cache hits by limiting the number of cache misses that may concurrently reside in the write buffer and
shows that system performance can be improved by using this technique.
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORESijdpsjournal
In a simultaneous multithreaded system, a core’s pipeline resources are sometimes partitioned and
otherwise shared amongst numerous active threads. One mutual resource is the write buffer, which acts as
an intermediary between a store instruction’s retirement from the pipeline and the store value being written
to cache. The write buffer takes a completed store instruction from the load/store queue and eventually
writes the value to the level-one data cache. Once a store is buffered with a write-allocate cache policy, the
store must remain in the write buffer until its cache block is in level-one data cache. This latency may vary
from as little as a single clock cycle (in the case of a level-one cache hit) to several hundred clock cycles
(in the case of a cache miss). This paper shows that cache misses routinely dominate the write buffer’s
resources and deny cache hits from being written to memory, thereby degrading performance of
simultaneous multithreaded systems. This paper proposes a technique to reduce denial of resources to
cache hits by limiting the number of cache misses that may concurrently reside in the write buffer and
shows that system performance can be improved by using this technique.
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...IDES Editor
In this paper, we have proposed a novel architectural
technique which can be used to boost performance of modern
day processors. It is especially useful in certain code constructs
like small loops and try-catch blocks. The technique is aimed
at improving performance by reducing the number of
instructions that need to enter the pipeline itself. We also
demonstrate its working in a scalar pipelined soft-core
processor developed by us. Lastly, we present how a superscalar
microprocessor can take advantage of this technique and
increase its performance.
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORESijdpsjournal
In simultaneous multithreaded systems, there are several pipeline resources that are shared amongst multiple
threads concurrently. Some of these mutual resources to mention are the register-file and the write buffer.
The Physical Register file is a critical shared resource in these types of systems due to the limited number of
rename registers available for renaming. The write buffer, another shared resource, is also crucial since it
serves as an intermediary between the retirement of a store instruction and the writing of its value to cache.
Both components, if not configured accurately, can serve as a bottleneck in inefficient usage of the resources
and output undesirable performance.
However, when configuring both shared components concurrently, there is potential to all eviate common
performance congestion. This paper shows that when implementing a static register capping algorithm
(limiting the number of physical register entries for each thread), there is a byproduct of increased variety
in source for the write buffer. This also presents an opportunity for the write buffer to have a higher variety
to potentially choose for a better suitable thread asit’s source at certain clock cycles. With this presented
opportunity, this paper proposes a technique to allow the write buffer to both prioritize and enforce the
choice for low-latency threads by partitioning the write buffer in two sections; cache-hit priority and cachehit only partitions, showing that system performance and resource efficiency can be further improved by
using this technique in a modified SMT environment.
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docxfaithxdunce63732
CS 301 Computer Architecture
Student # 1
E
ID: 09
Kingdom of Saudi Arabia Royal Commission at Yanbu Yanbu University College Yanbu Al-Sinaiyah
Student # 2
H
ID: 09
Kingdom of Saudi Arabia Royal Commission at Yanbu Yanbu University College Yanbu Al-Sinaiyah
1
1. Introduction
High-performance processor design has recently taken two distinct approaches. One approach is to increase the execution rate by increasing the clock frequency of the processor or by reducing the execution latency of the operations. While this approach is important, much of its performance gain comes as a consequence of circuit and layout improvements and is beyond the scope of this research. The other approach is to directly exploit the instruction-level parallelism (ILP) in the program and to issue and execute multiple operations concurrently. This approach requires both compiler and microarchitecture support.
Traditional processor designs that issue and execute at most one operation per cycle are often called scalar designs. Static and dynamic scheduling techniques have been used to achieve better-than scalar performance by issuing and executing more than one operation per cycle. While Johnson[7] defines a superscalar processor as a design that achieves better-than scalar performance, popular usage of this term refers exclusively to those processors that use dynamic scheduling techniques. For clarity, we use instruction-level parallel processors to refer to the general class of processors that execute more than one operation per cycle of the computer both at the personal level, or the level of a small network of computers to do not require more of these types.
The primary static scheduling technique uses the compiler to determine sets of operations that have their source operands ready and have no dependencies within the set. These operations can then be scheduled within the same instruction subject only to hardware resource limits. Since each of the operations in an instruction is guaranteed by the compiler to be independent, the hardware is able to is- sue and execute these operations directly with no dynamic analysis. These multi-operation instructions are very long in comparison with traditional single-operation instructions and processors using .
Affect of parallel computing on multicore processorscsandit
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make
number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of
Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the
various techniques there followed in, for improving the performance of the Multicore
Processors.
We conducted cluster experiments to find this limit. In this paper we propose an alternate design
of Multicore processor based on the results of our cluster experiment.
The timing behavior of the OS must be predictable - services of the OS: Upper bound on the execution time!
2. OS must manage the timing and scheduling
OS possibly has to be aware of task deadlines;
(unless scheduling is done off-line).
3. The OS must be fast
Enhancing network security and performance using optimized aclsijfcstjournal
Access Control list plays a very important role in network security.Proper combination of rules for ACLs can
close loop holes in the system, this minimizing security breaches. An ACL can improvise network performance up to a good level by limiting the traffic controls the areas that can be accessible to any device or user. However, if ACL is not managed properly and efficiently it causes packet latency and degrades the network performance. In this paper we present various optimization mechanisms to achieve an optimal ACL which reduces the Packet
latency. We also proposed an efficient optimization algorithm to optimize the ACL to enhance network performance. We also discuss the importance of ACL and the various rule anomalies.
Multilevel priority packet scheduling scheme for wireless networksijdpsjournal
Scheduling different types of packets such as real
-
time and non
-
real time data packets in wireless links is
necessary to reduce energy consumption of the wireless device. Most of the existing packet schedulin
g
mechanism uses opportunistic transmission sche
duling, in which communication is postponed upto an
acceptable time deadline until the best expected channel conditions to transmit are found. This algo
rithm
incurs a large processing overhead and more energy consumption. In this paper we propose a Dynamic
Multilevel Queue Scheduling algorithm. In the proposed scheme, the ready queue is partitioned into t
hree
levels of priority queues. Real
-
time packets are placed into the highest priority queue and non
-
real time
data packets are placed into two other queue
s. We evaluate the performance of the proposed Dynamic
Multilevel Queue Scheduling scheme through simulations for real
-
time and non
-
real time data. Simulation
results illustrate that the Multilevel Priority packet scheduling scheme overcomes the convention
al methods
interms of average data waiting time and end
-
to
-
end delay
The note is compiled with reference from many sites and According to the syllabus of Real Time System (6th semester CSIT). Drive deep to the never ending knowledge.
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORESijmvsc
In a simultaneous multithreaded system, a core’s pipeline resources are sometimes partitioned and
otherwise shared amongst numerous active threads. One mutual resource is the write buffer, which acts as
an intermediary between a store instruction’s retirement from the pipeline and the store value being written
to cache. The write buffer takes a completed store instruction from the load/store queue and eventually
writes the value to the level-one data cache. Once a store is buffered with a write-allocate cache policy, the
store must remain in the write buffer until its cache block is in level-one data cache. This latency may vary
from as little as a single clock cycle (in the case of a level-one cache hit) to several hundred clock cycles
(in the case of a cache miss). This paper shows that cache misses routinely dominate the write buffer’s
resources and deny cache hits from being written to memory, thereby degrading performance of
simultaneous multithreaded systems. This paper proposes a technique to reduce denial of resources to
cache hits by limiting the number of cache misses that may concurrently reside in the write buffer and
shows that system performance can be improved by using this technique.
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORESijdpsjournal
In a simultaneous multithreaded system, a core’s pipeline resources are sometimes partitioned and
otherwise shared amongst numerous active threads. One mutual resource is the write buffer, which acts as
an intermediary between a store instruction’s retirement from the pipeline and the store value being written
to cache. The write buffer takes a completed store instruction from the load/store queue and eventually
writes the value to the level-one data cache. Once a store is buffered with a write-allocate cache policy, the
store must remain in the write buffer until its cache block is in level-one data cache. This latency may vary
from as little as a single clock cycle (in the case of a level-one cache hit) to several hundred clock cycles
(in the case of a cache miss). This paper shows that cache misses routinely dominate the write buffer’s
resources and deny cache hits from being written to memory, thereby degrading performance of
simultaneous multithreaded systems. This paper proposes a technique to reduce denial of resources to
cache hits by limiting the number of cache misses that may concurrently reside in the write buffer and
shows that system performance can be improved by using this technique.
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...IDES Editor
In this paper, we have proposed a novel architectural
technique which can be used to boost performance of modern
day processors. It is especially useful in certain code constructs
like small loops and try-catch blocks. The technique is aimed
at improving performance by reducing the number of
instructions that need to enter the pipeline itself. We also
demonstrate its working in a scalar pipelined soft-core
processor developed by us. Lastly, we present how a superscalar
microprocessor can take advantage of this technique and
increase its performance.
WRITE BUFFER PARTITIONING WITH RENAME REGISTER CAPPING IN MULTITHREADED CORESijdpsjournal
In simultaneous multithreaded systems, there are several pipeline resources that are shared amongst multiple
threads concurrently. Some of these mutual resources to mention are the register-file and the write buffer.
The Physical Register file is a critical shared resource in these types of systems due to the limited number of
rename registers available for renaming. The write buffer, another shared resource, is also crucial since it
serves as an intermediary between the retirement of a store instruction and the writing of its value to cache.
Both components, if not configured accurately, can serve as a bottleneck in inefficient usage of the resources
and output undesirable performance.
However, when configuring both shared components concurrently, there is potential to all eviate common
performance congestion. This paper shows that when implementing a static register capping algorithm
(limiting the number of physical register entries for each thread), there is a byproduct of increased variety
in source for the write buffer. This also presents an opportunity for the write buffer to have a higher variety
to potentially choose for a better suitable thread asit’s source at certain clock cycles. With this presented
opportunity, this paper proposes a technique to allow the write buffer to both prioritize and enforce the
choice for low-latency threads by partitioning the write buffer in two sections; cache-hit priority and cachehit only partitions, showing that system performance and resource efficiency can be further improved by
using this technique in a modified SMT environment.
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docxfaithxdunce63732
CS 301 Computer Architecture
Student # 1
E
ID: 09
Kingdom of Saudi Arabia Royal Commission at Yanbu Yanbu University College Yanbu Al-Sinaiyah
Student # 2
H
ID: 09
Kingdom of Saudi Arabia Royal Commission at Yanbu Yanbu University College Yanbu Al-Sinaiyah
1
1. Introduction
High-performance processor design has recently taken two distinct approaches. One approach is to increase the execution rate by increasing the clock frequency of the processor or by reducing the execution latency of the operations. While this approach is important, much of its performance gain comes as a consequence of circuit and layout improvements and is beyond the scope of this research. The other approach is to directly exploit the instruction-level parallelism (ILP) in the program and to issue and execute multiple operations concurrently. This approach requires both compiler and microarchitecture support.
Traditional processor designs that issue and execute at most one operation per cycle are often called scalar designs. Static and dynamic scheduling techniques have been used to achieve better-than scalar performance by issuing and executing more than one operation per cycle. While Johnson[7] defines a superscalar processor as a design that achieves better-than scalar performance, popular usage of this term refers exclusively to those processors that use dynamic scheduling techniques. For clarity, we use instruction-level parallel processors to refer to the general class of processors that execute more than one operation per cycle of the computer both at the personal level, or the level of a small network of computers to do not require more of these types.
The primary static scheduling technique uses the compiler to determine sets of operations that have their source operands ready and have no dependencies within the set. These operations can then be scheduled within the same instruction subject only to hardware resource limits. Since each of the operations in an instruction is guaranteed by the compiler to be independent, the hardware is able to is- sue and execute these operations directly with no dynamic analysis. These multi-operation instructions are very long in comparison with traditional single-operation instructions and processors using .
Hardback solution to accelerate multimedia computation through mgp in cmpeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
DEVELOPING SCHEDULER TEST CASES TO VERIFY SCHEDULER IMPLEMENTATIONS IN TIMETR...ijesajournal
Despite that there is a “one-to-many” mapping between scheduling algorithms and scheduler
implementations, only a few studies have discussed the challenges and consequences of translating between
these two system models. There has been an argument that a wide gap exists between scheduling theory and
scheduling implementation in practical systems, where such a gap must be bridged to obtain an effective
validation of embedded systems. In this paper, we introduce a technique called “Scheduler Test Case”
(STC) aimed at bridging the gap between scheduling algorithms and scheduler implementations in singleprocessor embedded systems implemented using Time-Triggered Co-operative (TTC) architectures. We will
demonstrate how the STC technique can provide a simple and systematic way for documenting, verifying
(testing) and comparing various TTC scheduler implementations on particular hardware. However, STC is
a generic technique that provides a black-box tool for assessing and predicting the behaviour of
representative implementation sets of any real-time scheduling algorithm.
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORSijdpsjournal
Advances in Integrated Circuit processing allow for more microprocessor design options. As Chip Multiprocessor system (CMP) become the predominant topology for leading microprocessors, critical components of the system are now integrated on a single chip. This enables sharing of computation resources that was not previously possible. In addition the virtualization of these computation resources exposes the system to a mix of diverse and competing workloads. On chip Cache memory is a resource of primary concern as it can be dominant in controlling overall throughput. This Paper presents analysis of various parameters affecting the performance of Multi-core Architectures like varying the number of cores, changes L2 cache size, further we have varied directory size from 64 to 2048 entries on a 4 node, 8 node 16 node and 64 node Chip multiprocessor which in turn presents an open area of research on multicore processors with private/shared last level cache as the future trend seems to be towards tiled architecture executing multiple parallel applications with optimized silicon area utilization and excellent performance.
Developing Scheduler Test Cases to Verify Scheduler Implementations In Time-T...ijesajournal
Despite that there is a “one-to-many” mapping between scheduling algorithms and scheduler implementations, only a few studies have discussed the challenges and consequences of translating between these two system models. There has been an argument that a wide gap exists between scheduling theory and scheduling implementation in practical systems, where such a gap must be bridged to obtain an effective validation of embedded systems. In this paper, we introduce a technique called “Scheduler Test Case” (STC) aimed at bridging the gap between scheduling algorithms and scheduler implementations in single-processor embedded systems implemented using Time-Triggered Co-operative (TTC) architectures. We will demonstrate how the STC technique can provide a simple and systematic way for documenting, verifying (testing) and comparing various TTC scheduler implementations on particular hardware. However, STC is a generic technique that provides a black-box tool for assessing and predicting the behaviour of representative implementation sets of any real-time scheduling algorithm.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE STAGES IN SIMULTANEOUS MULTI-THREADING PROCESSORS
1. Computer Applications: An International Journal (CAIJ), Vol.3, No.4, November 2016
DOI:10.5121/caij.2016.3401 1
PERFORMANCE ENHANCEMENT WITH
SPECULATIVE-TRACE CAPPING AT DIFFERENT
PIPELINE STAGES IN SIMULTANEOUS
MULTI-THREADING PROCESSORS
Wenjun Wang and Wei-Ming Lin
Department of Electrical and Computer Engineering, The University of Texas at San
Antonio, San Antonio, TX 78249-0669, USA
ABSTRACT
Simultaneous Multi-Threading (SMT) processors improve system performance by allowing concurrent
execution of multiple independent threads with sharing key datapath components and better utilization of
resources. Speculative execution allows modern processors to fetch continuously and reduce the delays of
control instructions. However, a significant amount of resources is usually wasted due to miss-speculation,
which could have been used by other valid instructions, and such a waste is even more pronounced in an
SMT system. In order to minimize the waste of resources, a speculative trace capping technique [1] was
proposed to limit the number of speculative instructions in the system. In this paper, a thorough analysis is
given to investigate the trade-offs among applying this capping mechanism at different pipeline stages so as
to maximize its benefits. Our simulations show that the best choice can improve overall system throughput
by a very significant margin (up to 46%) without sacrificing execution fairness among the threads.
KEYWORDS
Simultaneous Multi-Threading, Superscalar, Speculative Execution
1. INTRODUCTION
Simultaneous Multi-Threading (SMT) improves overall system performance by allowing
concurrent execution of multiple independent threads for better utilization of the resources
provided by modern processor architecture. Essentially, SMT improves the overall performance
by exploiting Thread-Level Parallelism (TLP) among threads to overcome the limited availability
of Instruction-Level Parallelism (ILP) present in a single thread which inherently limits the
potential of a superscalar system [2][3]. There have been various designs targeted on the
improvement of overall system performance in an SMT system. Several techniques focus on the
fairness of the resources allocation and modify the fetch policy. Among many others, ICOUNT
[4] gives priority to threads with the fewest instructions in the system prior to issuing. DCRA [5]
monitors the usage of resources by each thread and gives a higher share of the available resources
to memory-intensive threads. Several others address the resource allocation at the rename stage
[6][7], dispatch stage [8][7] and commit stage [9][10][11].
All modern processors adopt speculative execution to reduce the delays from control instructions.
With basic speculative processing, instructions are allowed to speculatively fetch and execute
along the predicted path without waiting for the actual outcome of the control instruction. If a
miss-prediction occurs, instructions along the speculative trace are then flushed out of the system
2. Computer Applications: An International Journal (CAIJ), Vol.3, No.4, November 2016
2
and all the resources thus occupied are considered wasted. Traditional branch prediction
techniques [12][13][14] can partially alleviate the flush-out problem. Note that this problem is
usually exasperated in an SMT system – instructions from different threads intermix through
shared components in the pipeline, which potentially extends the length of speculative traces. The
amount of miss-speculated instructions flushed out can easily account for a significant percentage
of overall instructions fetched. Coupled with the adverse effect that these resources could have
been utilized by valid instructions otherwise, net performance gain from speculative execution
can be offset to an undesirable degree. Note that this offset does not exist in a non-SMT system.
A speculative trace-capping control mechanism was first proposed in [1] applied at the dispatch
stage to limit the maximum number of instructions in a speculative trace allowed beyond this
stage. However, the stage where such a capping mechanism is applied can significantly affect the
performance outcome. An obvious trade-off exists in between capping at an earlier stage to
benefit more from a shorter speculative trace and capping at a later stage to diminish adverse
effect from delay in replenishing the pipeline after a correct speculation is resolved. In this paper,
a comprehensive study is performed to investigate this trade-off so as to identify the optimal
pipeline stage to apply such a control mechanism to maximize its benefits. Our simulations show
that when this mechanism is applied at the right stage, not only the overall system performance is
improved by up to 46% but the execution fairness is also enhanced by 29%.
The rest of the paper is organized as follows. Background information of SMT systems and
speculative execution is first given in Section 2, followed by the descriptions of the adopted
simulation environment (simulator and workload) and performance metrics in Section 3. A
detailed analysis to illustrate our motivation and conjectures are provided in Section 4. Section 5
explains the proposed method, followed by the simulation results in Section 6. Concluding
remarks are then given in Section 7.
2. BACKGROUND
2.1. SMT Processors
Figure 1. Pipeline Stages in a 4-Threaded SMT System
5. Writeback
4. Issue
Physical Rename
Register File
store data
6. Commit
Commit
6. Store
effective address
1. Fetch
Cache/
2. Decode/Rename
6. Load Commit
L/S
IQ
FU
FU
FU
.
.
.
.
.
.
Write
Buffer
Mem
read data
3. Dispatch
read register
IFQ ROB
LSQ
3. Computer Applications: An International Journal (CAIJ), Vol.3, No.4, November 2016
3
An SMT processor is built from a typical out-of-order superscalar processor and the basic
pipeline stages of a 4-threaded SMT system is shown in Figure 1. Instructions from a thread are
fetched frommemory (and cache) and put into their respective private Instruction Fetch Queue
(IFQ). After the stages of decode and register-rename, the instructions are allocated to their
respective Re-Order Buffer (ROB) and through the dispatch stage into the shared Issue Queue
(IQ). Load/Store instructions have their operations sent into individual Load Store Queue (LSQ)
with their address calculation operationsalso sent into IQ. When instruction-issuing conditions
(all operands ready and the required functional unit available) are met, operations are then issued
to corresponding functional units and have their results writeback to their target locations or
forwarded to where the results are needed in IQ or LSQ. Load/Store instructions, once their
addresses are calculated, will initiate their memory operation. Finally, all instructions are
committed from ROB (synchronized with Load/Store instructions in LSQ) in order. The most
common characteristic of SMT processors is the sharing of key datapath components among
multiple independent threads. The shared resources of an SMT system normally include the
physical register file (for renaming), various inter-stage bandwidths, buffers (e.g. Issue Queue
(IQ) and Write Buffer), and functional units. Due to the sharing of resources, the amount of
hardware required in an SMT system can be significantly less than that from employing multiple
copies of superscalar processors while achieving a comparable throughput.
2.2. Speculative Execution
Modern processors with speculative mechanism use branch predictors at the fetch stage to allow
continuous fetching even a control instruction (a branch or a jump with an execution-time
dependent target) is encountered. Fetch direction after these control instructions depends on the
outcome of the branch predictor. The instructions fetched following these control instructions are
called “speculative instructions” which are allowed to proceed through the pipeline even before
the branch condition and/or target address is resolved. Such a speculation is not resolved until the
control instruction reaches the writeback stage. If the speculation is found to be correct, all
speculative instructions following it will be processed as normal instructions until another
speculation is encountered. In this case, the processor can significantly reduce the delay from an
unresolved control instruction compared to a system, which simply suspends the fetching of
subsequent instruction until it is resolved. Conversely with an incorrect speculation, all the
subsequent speculative instructions, the so-called speculative trace, have to be flushed out of the
system, which requires a more complex control logic and leads to resource wastage. Note that
such a flush out does not really pose a significant threat to performance for a superscalar system
due to the single-thread platform. That is, the resources thus wasted would not have been used by
other “valid” instructions anyway. However, in an SMT system, instructions from different
threads intermix in time flowing through the shared components in the pipeline, potentially
stretching the length of a speculative trace. Threads also potentially block one another, and thus
further lengthen the speculative trace and prolong the lifetime of speculative instructions in the
system. As aforementioned, in an SMT system, some critical resources (such as register file, IQ,
etc.) are shared among concurrent executing threads and instructions from different threads may
run into severe competition for available slots due to their limited sizes. These shared resources
may be occupied by miss-speculated instructions for a long period of time, which could have
been utilized by other valid instructions. Our analysis presented in a later section shows that the
percentage of resources thus wasted can be undesirably high which could be especially harmful to
the overall system performance.
4. Computer Applications: An International Journal (CAIJ), Vol.3, No.4, November 2016
4
3. SIMULATION ENVIRONMENT
3.1. Simulator
We use the M-sim [15], a multi-threaded microarchitectural simulation environment to estimate
the performance of an SMT system. M-sim explicitly models the key pipeline structures such as
the Reorder Buffer (ROB), the Issue Queue (IQ), the Load/Store Queue (LSQ), separate integer
and floating-point register files, register renaming,and the associated rename table. M-sim
supports both single threaded execution and the simultaneous execution of multiple threads. In
SMT mode, the physical register file, IQ, functional units and write buffer are shared among
threads. Table 1 gives the detailed configuration of the simulated processor.
Table 1. Configuration of Simulated Processor
Parameter Configuration
Machine width 8-wide fetch/dispatch/issue/commit
L/S Queue Size 48-entry load/store queue
ROB & IQ size 128-entry ROB, 32-entry IQ
Functional Units &
Latency (total/issue)
4 Int Add (1/1)
1 IntMult (3/1)/Div (20/19)
2 Load/Store (1/1), 4 FP Add (2/1)
1 FP Mult (4/1)/Div (12/12)
Sqrt(24/24)
Physical registers Integer and floating point as
specified in the paper
L1 I-cache 64KB, 2-way set associative
64-byte line
L1 D-cache 64KB, 4-way set associative
64-byte line
write back, 1 cycle access latency
L2 Cache unified 512KB, 16-way set associative
64-byte line
write back, 10 cycles access latency
BTB 512 entry, 4-way set associative
Branch Predictor bimod: 2K entry
Pipeline Structure 5-stage front-end (fetch-dispatch)
scheduling (for register file access:
2 stages, execution, write back,
commit)
Memory 32-bit wide, 300 cycles access
latency
3.2. Workload
Simulations on simultaneous multi-threading use the workload of mixed SPEC CPU2006
benchmark suite [16] with mixtures of various levels of ILP. ILP classification of each mix is
obtained by initializing it in accordance with the procedure mentioned in Simpoints tool [17] and
simulated individually in a simplescalar environment. Three types of ILP classification are
identified, low ILP (memory bound), medium ILP, high ILP (execution bound). Table 2 and
5. Computer Applications: An International Journal (CAIJ), Vol.3, No.4, November 2016
5
Table 3 give the selected 4-threaded and 8-threaded workload for our simulation with various
mixtures of ILP classification types respectively (each number in the table denotes the number of
programs with the corresponding classification in the respective mix).
Table 2. Simulated SPEC CPU2006 4-Threaded Workload
Mix Benchmarks Classification (ILP)
Low Med High
Mix1 libquantum, dealII, gromacs, namd 0 0 4
Mix2 soplex, leslie3d, povray, milc 0 4 0
Mix3 hmmer, sjeng, gobmk, gcc 0 4 0
Mix4 lbm, cactusADM, xalancbmk, bzip2 4 0 0
Mix5 libquantum, dealII, gobmk, gcc 0 2 2
Mix6 gromacs, namd, soplex, leslie3d 0 2 2
Mix7 dealII, gromacs, lbm, cactusADM 2 0 2
Mix8 libquantum, namd, xalancbmk, bzip2 2 0 2
Mix9 povray, milc, cactusADM, xalancbmk 2 2 0
Mix10 hmmer, sjeng, lbm, bzip2 2 2 0
Table 3. Simulated SPEC CPU2006 8-Threaded Workload
Mix Benchmarks Classification
(ILP)
Low Med High
Mix1 libquantum, dealII, gromacs, namd,
soplex, leslie3d, povray, milc
0 4 4
Mix2 libquantum, dealII, gromacs, namd,
lbm, cactusADM, xalancbmk, bzip2
4 0 4
Mix3 hmmer, sjeng, gobmk, gcc, lbm,
cactusADM, xalancbmk, bzip2
4 4 0
Mix4 libquantum, dealII, gromacs, soplex,
leslie3d, povray, lbm, cactusADM
2 3 3
Mix5 dealII, gromacs, namd, xalancbmk,
hmmer, cactusADM, milc, bzip2
3 2 3
Mix6 gromacs, namd, sjeng, gobmk, gcc,
lbm, cactusADM, xalancbmk
3 3 2
3.3. Metrics
For a multi-threaded workload, total combined IPC is a typical indicator used to measure the
overall performance, which is definedas the sum of each thread’s IPC:
∑ (1)
6. Computer Applications: An International Journal (CAIJ), Vol.3, No.4, November 2016
6
where n denotes the number of threads per mix in the system. However, in order to preclude
starvation effect among threads, the so-called Harmonic IPC is also adopted, which reflects the
degree of execution fairness among the threads, namely,
∑⁄ (2)
In this paper, these two indicators are used to compare the proposed algorithm to the baseline
(default) system. The following metric indicates the improvement percentage averaged over the
selected mixes, which is applied to both and , namely,
∑ (3)
where m denotes the number of mixes of the workload in our simulation. Note that the
improvement percentage represents the average improvement percentage over all mixes, instead
of the improvement percentage of average IPC, which may be skewed by large variation among
IPC values of different mixes.
4. MOTIVATION
The analysis performed in this paper is based on the conjecture that the benefit brought by
speculative execution employed in an SMT system may be out-weighed by the detrimental effect
caused by the prolonged wasted resources allocated to the eventual miss-speculated instructions.
This conjecture is again based on the two following assumptions:
(1) There is a significant amount of flush-outs in SMT systems
(2) Competition for the shared resources among the threads is high
If the above assumptions are true, flushing out miss-speculated instructions from these already
congested shared resources means precious resources being wasted. Properly limiting the chances
and amount that these resources are wasted can thus potentially tip the balance. This section is
devoted to a comprehensive analysis to support this conjecture and these assumptions. The
simulation results presented in this section are based on the system configuration as shown in
Table 1 with 10 mixes of a 4-threaded workload as shown in Table 2.
4.1. Flush-out Rate and Throughput Analysis
7. Computer Applications: An International Journal (CAIJ), Vol.3, No.4, November 2016
7
Figure 2. Percentage of Flush-Outs of All Fetched Instructions
Just like in a superscalar system, not all instructions fetched in an SMT system are eventually
committed, and the percentage of flushed-out instructions is higher. Figure 2 shows the
percentage of flush-outs of all fetched instructions. An average of 21% of fetched instructions are
eventually flushed out as shown in Figure 2. For mix10, nearly 32% of fetched instructions are
flushed out, which clearly indicates a very significant amount of wasted resources.
We further investigate the flush-out rate at each of the four pipeline stages: fetch, rename,
dispatch and commit. Figure 3 shows the throughput of each stage with physical register file size
(denoted as Rt) set at 256, that is, the average number of instructions per clock cycle flowing
through that stage including the instructions eventually flushed out. As expected, throughput
decreases as thepipeline continues to go downstream, due to some additional flushed-out
instructions never reaching the next stage in the pipeline. Note that the three main causes for
flush-outs are miss-speculations, interrupts and no-ops, with the first cause accounting for over
99% of all flush-outs [1], which is the sole focus of this study, with the other insignificant two
ignored.
Figure 3. Throughput at Each SMT Pipeline Stage
4.2. Utilization Analysis of Shared Resources
Out of all the wasted resources from miss-speculation in an SMT system, not all are critical to the
overall system performance. Some of the resources are private buffers (such as IFQ and ROB),
which cannot be utilized by instructions from other threads, and thus congestion of these affects
the overall perform to a much lesser degree than that from the shared resources. In this section,
utilization of several key shared resources is analyzed, including the shared bandwidths between
adjacent pipeline stages, the physical register file (the earliest shared buffer in the pipeline) and
IQ (the shared buffer required for a speculation to resolve as early as possible). Study of this is
necessary to the affirmation of the two aforementioned assumptions.
(1) Inter-stage Bandwidth: In addition to the analysis result shown in Section 4.1. to illustrate the
percentages of instructions flushed out, the following analysis shows how a critical shared
resource, bandwidth in this case, is wasted due to the same reason. Figure 4 shows the
bandwidth wastage between adjacent pipeline stages. Anaverage of 21% of bandwidth is
wasted to fetch instructions into IFQ that are eventually flushed out; 9% of bandwidth is
0
1
2
3
4
5
1 2 3 4 5 6 7 8 9 10
averagethroughput
mix
Throughput at Each Stage with Rt=256
fetch
rename
dispatch
commit
8. Computer Applications: An International Journal (CAIJ), Vol.3, No.4, November 2016
8
wasted to rename into register files and 7% is wasted to dispatch into IQ. All the wastage
could have been utilized by other non-flush-outs for higher overall system performance.
(2) Physical Rename Register File: Physical rename register file is the first shared buffer among
threads from the start of pipeline. If there are no more physical registers for a thread to
rename, none of the available buffers or bandwidth down-stream in the pipeline is going to
matter. Thus utilization of this particular buffer is very critical to the overall system
performance. Figure 5 displays the occupancy rate of the physical register file with its size
(Rt) set at 160. Each point represents an average result fromthe 10 mixes of the 4-threaded
workload. As the results show, in nearly 80% of simulation clock cycles, the physical register
file is fully occupied, which clearly indicates the need for a larger register file. Instead of
increasing the register file size, which contradicts the design philosophy of an SMT
processor, a more efficient resource allocation scheme is needed. The next analysis is to
investigate the wastage of this resource due to miss speculation. Exploiting such wastage is
only meaningful when the register file is full. Figure 6 shows the percentage of
miss-speculated instructions in the register file when the register file is full. As the results
show, up to 9% of registers are occupied by miss-speculated instructions, which are
eventually flushed out, with an average of 3.7% of wasted registers. Note that adverse effect
on IPC from the wastage of these resources is not truly reflected by its percentage, but instead
much worse in an SMT system, since such a wastage leads to potential blocking of other
threads which could have contributed significantly to the overall IPC had the blocking not
been there.
Figure 4. Bandwidth Wastage at Each Stage
0
10
20
30
40
1 2 3 4 5 6 7 8 9 10 avg
stagebandwidthwastage
(%)
mix
Bandwidth Wastage at Each Stage with Rt=256
fetch
rename
dispatch
9. Computer Applications: An International Journal (CAIJ), Vol.3, No.4, November 2016
9
Figure 5. Average Register File Occupancy with Rt=160
(3) Issue Queue: A similar analysis is also performed on another critical shared resource, the
issue queue. Due to its relatively limited number of entries (compared to the size of ROB at
its up-stream stage), the competition among concurrent executing threads for the use ofIQ
entries can be severe at times. Figure 7 displays the percentage of simulation clockcyclesin
which a given number of IQ entries are occupied. IQ is fully occupied in about 44% of
simulation time, mostly due to stagnant thread(s) which have their instructions stuck at IQ for
a long period of time waiting for data dependency (such as from cache misses) or resource
dependency (on the corresponding functional units). Such a high percentage of full
occupancy clearly indicates a need for a better resource management. On the other hand,
there is about 15% of time when the IQ is completely empty. Such situation of low IQ
occupancy may be a direct outcome of instruction flushes out at IQ, which is partially
explained by the next result.
Figure 6. Percentage of Miss-Speculated Instruction in Register File (RF) when It Is Full
0
20
40
60
80
100
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99100
percentageofclockcycles(%)
percentage of RF occupied (%)
Average RF Occupancy with Rt=160
0
2
4
6
8
10
1 2 3 4 5 6 7 8 9 10 avg
%ofmiss-specinstrinRF
mix
Percentage of Miss-Spec Instruction in RF when
RF is Full
10. Computer Applications: An International Journal (CAIJ), Vol.3, No.4, November 2016
10
Figure 7. Average IQ Occupancy with IQ Size Equals to 32
Figure 8 shows the percentage of miss-speculated instructions in IQ when IQ is full. Up to
28% of IQ entries are occupied by miss-speculated instructions. An average of 16% of IQ
entries are wasted due to miss-speculation. That is, on the average one out of roughly every
six entries in IQ could have been used by anotherthread to boost the overall IPC. A
combination of high occupancy and high percentage of flush-outs in IQ again ascertains the
two assumptions for this study. If at all possible, one should try to prevent miss-speculated
instructions from occupying IQ entries as much as possible.
Figure 8. Percentage of Miss-Speculated Instructions in IQ when IQ is Full
5. CAPPING SPECULATIVE TRACES
From the analysis described in the Section 4, we find that a large amount of resources, such as
shared bandwidths, the physical register file and IQ, are wasted due to the flush out of
miss-speculated instructions. A technique was proposed in [1] to impose a real-time limit on the
number of speculative instructions per speculative trace that are allowed to be dispatched into IQ.
Although a very significant amount of performance improvement is given from employing this
technique at the dispatch stage, one may argue that, from the high percentage of flush-outs and
heavy resource competition shown in the previous section, the dispatch stage may not be the best
0
10
20
30
40
50
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
percentageofclockcycles(%)
# of IQ entries occupied
Average IQ occupancy
0
5
10
15
20
25
30
1 2 3 4 5 6 7 8 9 10 avg
%ofmiss-specinstrinIQ
mix
Percentage of Miss-Spec Instruction in IQ when
IQ is Full
11. Computer Applications: An International Journal (CAIJ), Vol.3, No.4, November 2016
11
point of applying the capping mechanism; that is, the speculative trace may already be too long to
be desirable in the system taking up too much shared resources and/or using up too much
bandwidth in the upper stream of pipeline. In this paper, three different pipeline stages are
identified as the targets for the initiation of the capping mechanism:
(1) Fetch
(2) Rename
(3) Dispatch
Essentially, at each of the selected stages, once a trace enters the speculative mode, the number of
instructions of this thread that are allowed to proceed passing this stage is limited by a fixed cap
value until the speculation is resolved: if correct, the capping is lifted, and if incorrect, all
speculative instructions are then flushed out. The reason why only these three stages are targeted is
due to that speculation is resolved at the writebackstage, and thus there is no reason to cap the trace
at this stage or any of the later pipeline stages, the commit stage. In addition, since once an
instruction enters the issue stage its occupancy time on the shared resources, the functional unit, is
deterministic, there is very limited resource utilization optimization available at this stage to boost
the performance unless more units are employed.
In the proposed capping technique, a counter is used to record the number of speculative
instructions fetched/renamed/dispatched so far. Once this number reaches a threshold value (a
pre-set cap value), fetching/renaming/dispatching from this thread will be stalled until the
respective control instruction is resolved at thewriteback stage. Figure 9 illustrates the difference
among capping at the three respective stages. As shown in the figure, the pre-set cap value is
denoted as C and the number of instructions already in the system before the capping point is
denoted as X(Xd for capping-at-dispatch, Xr for capping-at-rename and Xf (= 0) for
capping-at-fetch). The main objective of this technique is to reduce the wasted resources occupied
by these X + C instructions due to miss-speculation. Note that there exists a trade-off stemming
from (1) the cap values employed as well as (2) the stage selected. Effectiveness of the capping
mechanism is heavily influenced by the selections. Obviously, the larger the X + C is, the more
potential resource wastage is which leads to more blocking for other threads to proceed through
the pipeline. On the other hand, the smaller the X + C is, the size of the pipeline “bubble” (or gap)
due to the capping is bigger. That is, after a correctly speculated trace is resolved and resumed,
the subsequent instructions blocked due to the capping will take more time to replenish the
pipeline gap again. Especially, in an SMT system, instructions from different threads intermix
through shared components in the pipeline and compete for the precious resources,which
potentially prolongs the latency to re-supply the subsequent instructions to a correctly speculated
trace.
12. Computer Applications: An International Journal (CAIJ), Vol.3, No.4, November 2016
12
Figure 9. Capping Speculative Traces at the Fetch, Rename, and Dispatch Stages
(1) Cap value selection
A strict (small) cap value C, no matter which cap stage is chosen, leads to a smaller X + C, entails a
performance compromise between losing the benefits from correctly speculated traces and gaining
on saving the wasted resources occupied by miss-speculated traces. A larger cap value will instead
diminish the benefits acquired from reducing resources wastage due to miss-speculation.
(2) Cap stage selection
With a fixed cap value C, capping at an earlier stage, e.g., the fetch stage, can potentially spare more
shared resource than capping at a later stage, e.g., the dispatch stage, due to a smaller X value.
However, such a gain could be easily offset by slowing down a correctly speculated trace as
aforementioned due to a larger gap to fill.
6. SIMULATION RESULTS
The proposed technique is simulated with M-sim and compared with the default settings using 10
mixes and 6 mixes for 4-threaded workload and 8-threaded workload as shown in Table 2 and
Table 3 respectively. The IPC improvement in this section is evaluated by Equation 3.
6.1. 4-threaded Workload
We first examine the effect on IPC when the cap value is varied with the capping technique
applied at the three selected pipeline stages respectively. In order to clearly demonstrate the
trade-off effects under various resource availability situations, we choose another parameter, the
physical register file (Rt), since it has been shown to be the most dominant shared resources [6].
In the 4-threaded case, four different Rt values are used, ranging from 160 to 256, with 160 giving
a very tightly competed resource unit and 256 for a rarely congested one. Cap values are selected
to cover the respective relevant range for each Rt value as shown in Figures 10, 11, 12, and 13,
respectively.
13. Computer Applications: An International Journal (CAIJ), Vol.3, No.4, November 2016
13
Figure 10. IPC Improvement for Each Cap Value with Rt=160 on 4-Threaded Workload
Figure 11. IPC Improvement for Each Cap Value with Rt=192 on 4-Threaded Workload
There are five important aspects of observation and conclusion one can make from these results:
(1) Trade-off due to the cap value (C):
The results clearly verify our hypothesis and the first trade-off - for each setting, there is an optimal
C value which maximizes the overall gain. Setting the C value smaller, thus leaving a gap too large,
may not be able to compensate the loss of the intended benefit from speculation with the saving of
resource wastage. On the other hand, a cap value set larger allows the loss from resource wastage to
gradually overtake the benefit from correct speculation. In a nutshell, increasing the cap value will
continue to diminish the gain on reducing the wasted resources, while abating the potential loss
from interrupting the correctly speculated traces. Further increasing the cap value will eventually
completely diminish the benefit from resource saving, defaulting back to the original non-capping
case.
0
10
20
30
40
50
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28
IPCimprovement(%)
cap value
IPC improvement for each cap value with Rt=160
fetch
rename
dispatch
0
10
20
30
40
50
0 10 20 30 40 50 60 70 80
IPCimprovement(%)
cap value
IPC improvement for each cap value with Rt=192
fetch
rename
dispatch
14. Computer Applications: An International Journal (CAIJ), Vol.3, No.4, November 2016
14
Figure 12. IPC Improvement for Each Cap Value with Rt=224 on 4-Threaded Workload
Figure 13. IPC Improvement for Each Cap Value with Rt=256 on 4-Threaded Workload
(2) Effectiveness of capping with respect to the setting of Rt:
The loss of intended speculation benefit from setting the cap value too small is more prominent
when the shared resources (Rtin this case) are less stringent which can afford the resource wastage.
For example, with the case when the capping is imposed at the fetch stage, setting the cap value at 1
still gives an average IPC improvement of 12% for Rt= 160, while leading to only 5% for Rt= 192,
and begins to show a net negative impact at −8% for Rt= 224 and −24% for Rt= 256. This
observation clearly shows the impact of Rton the selection of the cap value C. The larger Rtis, the
smaller cap value the system can afford to have. Also, the overall effectiveness of capping
continues to dwindle as the shared resource size increases. This clearly reflects the degree of
competition among the threads for the registers.
(3) Optimal cap value point with respect to the setting of Rt:
Optimal IPC gain occurs at a different cap value for each Rtsetting. For the example of
capping-at-dispatch case, when Rt= 160, optimal IPC happens at when cap value is set at 15, and
the location continues to rise: 30 for Rt= 192, 45 for Rt= 224 and 50 for Rt= 256. That is, the larger
the shared resources are given, the larger the cap value needs to be set to prevent the loss from
pipeline bubbles to overcome the benefit from resource saving.
(4) Trade-off from where the capping is imposed:
-10
0
10
20
30
40
0 10 20 30 40 50 60 70 80 90 100110120
IPCimprovement(%)
cap value
IPC improvement for each cap value with Rt=224
fetch
rename
dispatch
-30
-20
-10
0
10
20
0 10 20 30 40 50 60 70 80 90 100110120
IPCimprovement(%)
cap value
IPC improvement for each cap value with Rt=256
fetch
rename
dispatch
15. Computer Applications: An International Journal (CAIJ), Vol.3, No.4, November 2016
15
Among the three stages, capping-at-rename almost consistently delivers the best result, outgaining
the other two by another 2 − 3%. This comes from an obvious tradeoff in between capping at an
earlier stage to benefit more from a shorter speculative trace and capping at a later stage to diminish
adverse effect from delay in replenishing the pipeline after a correct speculation is resolved. That is,
as shown in Figure 9, with the same cap value imposed, capping-at-fetch allows for a shorted
overall speculative trace (Xf= 0) before it is resolved and thus leading to the smallest amount of
resource wastage. However, by having a shorter trace in the system, the ensuing larger pipeline gap
(bubble) takes a longer time to replenish if the speculation is resolved to be correct. On the other
hand, capping-at-dispatch allows for the longest trace (largest X) to exist thus leading to the
smallest benefit from curtailing resource wastage, but does not suffer as much in replenishing the
pipeline.
(5) Optimal cap value point with respect to the stage of capping:
Further investigating the results leads to another interesting observation that the optimal cap value
that leads to the highest IPC gain differs among the three stages of capping. This optimal cap value
continues to shift downward as the stage of capping further moves down-stream. For example, for
Rt= 160 (Figure 10), capping-at-fetch peaks at cap value equal to 15, capping-at-rename shifts to 12
and capping-at-dispatch further down to 8. That is, capping at an earlier stage has to adopt a larger
cap value to prevent the aforementioned instruction- replenishing delay from severely offsetting the
benefit from resource saving.
6.2. 8-threaded Workload
A similar simulation run is performed on the 8-threaded case with all the results displayed in
Figures 14, 15, 16, and 17 with the Rtvalue being set at 320, 384, 448 and 512, respectively.
These sizes are chosen to allow for the same resource allocation quota per thread compared to the
4-threaded case1
. Compared to the results from the 4-threaded workload, the 8-threaded results
are very similar except that the distinction among results from capping at the three different
stages is much more pronounced. Again, the respective five observations are briefly summarized
in the following:
(1) Trade-off due to the cap setting is in general similar to the 4-threaded results.
(2)Effectiveness of capping is even better than the 4-threaded one, consistently up to 35%, even for
a very large register file. In addition, even a very small cap value would lead to improvement in
most of the cases. All these clearly indicate a heavier resource competition among the higher
number of threads even the per-thread quota remains the same. That is, a system with more
threads has a higher chance of imbalanced resource occupancy, thus it tends to benefit more
from the capping technique.
1Note that, for the rename physical register file, n × 32 entries are automatically pre-reserved for the n architectural
register files, where n is the number of concurrent threads. Thus, Rt
= 160 for 4-threaded case actually leaves a total of 32
(160 − 4 × 32) for renaming register allocation among the 4 threads – a quota of 8 per thread; similarly, Rt= 320 for
8-threaded case allows for 64 (320 − 8 × 32) rename registers, also a quota of 8 per thread.
16. Computer Applications: An International Journal (CAIJ), Vol.3, No.4, November 2016
16
Figure 14. IPC Improvement for Each Cap Value with Rt=320 on 8-Threaded Workload
Figure 15. IPC Improvement for Each Cap Value with Rt=384 on 8-Threaded Workload
Figure 16. IPC Improvement for Each Cap Value with Rt=448 on 8-Threaded Workload
0
10
20
30
40
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28
IPCimprovement(%)
cap value
IPC improvement for each cap value with Rt=320
fetch
rename
dispatch
0
5
10
15
20
25
30
35
40
0 10 20 30 40 50 60 70 80
IPCimprovement(%)
cap value
IPC improvement for each cap value with Rt=384
fetch
rename
dispatch
-20
0
20
40
0 10 20 30 40 50 60 70 80 90
IPCimprovement
(%)
cap value
IPC improvement for each cap value with Rt=448
fetch
rename
dispatch
17. Computer Applications: An International Journal (CAIJ), Vol.3, No.4, November 2016
17
Figure 17. IPC Improvement for Each Cap Value with Rt=512 on 8-Threaded Workload
(3)Optimal cap value point shifts higher as Rtincreases, similar to the 4-threaded case; however,
compared to the 4-threaded case, this optimal point position is comparably lower, another
indication of heavier resource competition.
(4)Capping-at-rename is the best among the three when Rtis small and capping-at-dispatch takes
over the top when Rtbecomes larger. This implies that the pipeline bubbles created by capping
are harder to fill as Rtincreases.
(5)Optimal cap value point shifts downward as the stage of capping further moves down-stream, a
scenario even more prominent than the 4-threaded case.
6.3. Execution Fairness
To further demonstrate that overall IPC improvement by the proposed technique does not
jeopardize the execution fairness due to the capping process, Figure 18 and Figure 19 display the
Harmonic IPC (defined in Equation 2) for each mix with Rt = 160 for 4 threads and Rt= 448 for 8
threads respectively. During these simulation runs, stage-wise respective overall-IPC-optimal cap
values are used – 15, 12, and 8 for the fetch, rename and dispatch stages, respectively on
4-threaded workload, and 40, 30, and 30 for the 8-threaded workload. The reason why these
optimal cap values are adopted here in this part of simulations is that usually the highest degree of
unfair execution happens when the overall system throughput (overall IPC) is optimized by
favoring faster threads over slower threads. As shown in Figure 18, even under the situation
where overall system throughput is maximized, irrespective of which stage the capping is
imposed, the technique still manages to enhance the Harmonic IPC by up to 29%, an astonishing
result in assuring execution fairness. An improvement of up to 6% is shown for the 8-threaded
workload.
Combining all the results presented in this section, including analysis of the two trade-offs on
overall IPC and execution fairness, we can see that rename stage in most cases is the best stage to
apply the capping mechanism, with its result outperforming those from applying at the other two
stages by up to 10%. This choice is especially more preferable when the amount of resources
available is tight. All selections come without any suffering on execution fairness. That is,
properly capping speculative threads not only prevents undesirable amount of resource wastage
-10
-5
0
5
10
15
20
25
30
35
0 10 20 30 40 50 60 70 80 90
IPCimprovement(%)
cap value
IPC improvement for each cap value with Rt=512
fetch
rename
dispatch
18. Computer Applications: An International Journal (CAIJ), Vol.3, No.4, November 2016
18
from flush-outs but also balances the proceeding of threads by minimizing the
resource-overwhelming and blocking from any thread.
7. CONCLUSIONS
A very thorough analysis was given in this paper in identifying the ideal pipeline stage to place
the speculative trace capping mechanism in order to maximize system throughput and to
understand the causes. The analysis results from this paper set a foundation for future extensions
in further customizing the capping technique to best exploit the trade-offs discussed in this paper.
One obvious direction is to dynamically adjust the cap value in real time based on the run-time
speculative prediction result; that is, a speculative trace is allowed to grow longer if it has a
higher chance of correct speculation. Such a dynamic approach not only can minimize the
pipeline gap caused by the capping when the speculation has a higher chance of being correct,
while under the same scheme can also curtail the wastage of shared resources when it is less
expected to be correct. Another potential direction is to have a hybrid capping mechanism, which
imposes multiple capping points at different pipeline stagesfor each individual trace. One
intended benefit from this technique is to spread and thus dilute the congestion around all stages,
instead of allowing all speculative instructions to flow into the writeback stage. Bottleneck
problem should be lessened with this approach.
Figure 18. Harmonic IPC for Each Mix with Rt=160 on a 4-Threaded Workload
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10 avg
harmonicIPC
mix
Harmonic IPC for Each Mix with Rt=160
default
rename
fetch
dispatch
0
0.1
0.2
0.3
0.4
0.5
1 2 3 4 5 6 avg
harmonicIPC
mix
Harmonic IPC for Each Mix with Rt=448
default
fetch
rename
dispatch
19. Computer Applications: An International Journal (CAIJ), Vol.3, No.4, November 2016
19
Figure 19. Harmonic IPC for Each Mix with Rt=448 on an 8-Threaded Workload
REFERENCES
[1] Y.Zhang,andW.-M.Lin,“CappingSpeculative Traces to Improve Performance in Simultaneous
Multi-Threading CPUs”, Parallel and Distributed Processing Symposium Workshops & PhD Forum
(IPDPSW), 2013 IEEE 27th International IEEE, 2013.
[2] H.Hirate, K. Kimura, S. Nagamine, Y. Mochizuki, A. Nishimura, Y.Nakase and T. Nishizawa, “An
Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads”,
Proc. of 19th Annual International Symposium on Computer Architecture, 1992, May, pp. 136-145.
[3] D.Tullsen, S. J. Eggers and H. M. Levy, “Simultaneous Multithreading: Maximizing On-Chip
Parallelism”, Proc. of 22nd Annual International Symposium on Computer Architecture, 1995, May,
pp. 392-493.
[4] D. M. Tullsen, S. J. Emer, H. M. Levy, J. L. Lo and R. L. Stamm, “Exploiting Choice: Instruction
Fetch and Issue on an Implementable Simultaneous Multi-Threading Processor”, Proc. of 23rd
Annual International Symposium on Computer Architecture, 1996, May, pp. 191-202.
[5] F. J. Cazorla, A. Ramirez, M. Valero and E. Fernandez, “Dynamically Controlled Resource
Allocation in SMT processors”, Proc. of 37th International Symposium on Microarchitecture, 2004,
Dec, pp. 171-192.
[6] Y. Zhang and W.-M. Lin, “Efficient Physical Register File Allocation in Simultaneous
Multi-Threading CPUs”, 33rd IEEE International Performance Computing and Communications
Conference (IPCCC 2014), Austin, Texas, Dec, 2014.
[7] Y. Zhang and W.-M. Lin, “Intelligent Usage Management of Shared Resources in Simultaneous
Multi-Threading Processors”, The 21st International Conference on Parallel and Distributed
Processing Techniques and Applications (PDPTA’15), Las Vegas, NV, July, 2015.
[8] Y. Zhang, C. Douglas and W.-M. Lin,“Recalling Instructions from Idling Threads to Maximize
Resource Utilization for Simultaneous Multi-Threading Processor”, Journal of Computers and
Electrical Engineering, 40 (2013).
[9] Y. Zhang and W.-M. Lin, “Write Buffer Sharing Control in SMT Processors”, PDPTA’13: The 19th
International Conference on Parallel and Distributed Processing Techniques and Applications, Las
Vegas, NV, July 22-25, 2013.
[10] S.Lawal,Y.ZhangandW.-M.Lin,“PrioritizingWriteBufferOccupancy in Simultaneous Multithreading
Processors”, Journal of Emerging Trends in Computing and Information Sciences, Vol. 6, No. 10, pp.
515-522, Nov 2015.
[11] S. Carroll and W.-M. Lin, “Latency-Aware Write Buffer Resource Control in Multithreaded Cores”,
International Journal of Distributed and Parallel Systems (IJDPS), Vol. 7, No. 1, Jan 2016
[12] S.McFarling, “Combining Branch Predictor”, DEC Western Research Laboratory Technical Note
TN-36, June 1993.
[13] S. Pan, K. So and J. T. Rahmeh, “Improving the Accuracy of Dynamic Branch Prediction Using
Branch Correlation”, 5th International Conference on Architectural Support for Programming
Languages and Operating Systems, 1992, Oct, pp. 76-84.
[14] J. E. Smith, “A Study of Branch Prediction Strategies”, In the Proceedings of the 8th Annual
International Symposium on Computer Architecture, 1981, May, pp. 135-148.
[15] J. Sharkey, “M-Sim: A Flexible, Multi-threaded Simulation Environment”, Tech. Report
CS-TR-05-DP1, Department of Computer Science, SUNY Binghamton, 2005.
[16] Standard Performance Evaluation Corporation (SPEC) website, http://www.spec.org/.
[17] T. Sherwood, E. Perelman, G. Hamerly and B. Calder, “Automatically Characterizing Large Scale
Program Behavior”, Proc. of 10th International Conference on Architectural Support for
Programming Languages and Operating Systems, 2002, Oct, pp. 45-57.