SlideShare a Scribd company logo
1 of 6
Download to read offline
Load Speculation
Jong-Jiann Shieh
Department of Computer Science and Engineering
Tatung University
shieh@ttu.edu.tw
and
Cheng-Chun Lin
Night.cola@msa.hinet.net
and
Shin-Rung Chen
apsras@amigo.cse.ttu.edu.tw
Abstract
The superscalar processor must issues instructions as
early as possible to enhance the performance. But load
instructions would be issued with register dependencies are
solved and memory dependencies are known. Register
dependence makes load instruction must wait until prior
instruction with same destination register is completed. Memory
dependence results in load instruction cannot be issued before
the ambiguities are resolved. Therefore load instructions only
could be issued when no register dependencies exist and all
prior stores’ effective addresses calculated. This paper combines
two mechanisms: value prediction (VP) and load forwarding
history table (LFHT) to speculatively execute load instructions.
Our study shows that by doing so there is about 15% average
speedup up over baseline architecture.
Keyword: load speculation, register dependence,
memory dependence, value prediction, load forwarding
1. Introduction
Modern superscalar processors allow instructions to
execute out of program order to find more instruction level
parallelism (ILP). These processors must monitor data
dependencies to maintain correct program behavior. There are
two types of data dependencies, register dependence and
memory dependence.
Register dependence is detected in the instruction
decode stage by examining instructions’ register operand fields.
If there is an instruction which load instruction depends on, the
load instruction must wait until prior instruction completed,
then the value of operand can be used.
The lack of information about memory dependence at
instruction decode time is a problem for an out-of-order
instruction scheduler. If the scheduler executes a load before a
prior store that writes to the same memory location, the load
will read the wrong value. In this event the load and all
subsequent dependent instructions must be re-executed,
resulting in a huge performance penalty.
To avoid these memory order violations, the instruction
scheduler should be conservative to prevent loads from
executing until all prior stores have executed. This approach
decreases performance because loads in majority cases will be
made falsely dependent on no alias stores as data on section 3
shown.
In this paper, we use a simple value predictor to predict
the operand value to avoid register dependence and propose a
structure called Load Forwarding History Table (LFHT) to
exploit memory dependence speculation at run time. As we
combine these two mechanisms, the predictor can help LFHT
making more load instructions execute without waiting for the
prior stores’ effective addresses calculated, this result in more
load instructions will be issued earlier. When a load instruction
is speculatively executed, instructions that are dependent upon
the load instruction will also be speculatively executed.
The organization of the rest of this paper is as follows.
Section 2 surveys previously proposed related works. Section 3
illustrates whole structure in superscalar processor. Section 4
describes our CPU model and simulation environment. The
performance is evaluated in section 5. Finally, the conclusion of
this paper is presented in section 6.
2. Related Works
The traditional works on memory disambiguation were
done in the context of compiler and hardware mechanisms for
non-speculative disambiguation to ensure program correctness.
Franklin and Sohi [2] proposed the address resolution buffer
(ARB). The ARB indicates memory references into bins
according to their address. The bins are used to cause a
temporal order between references to the same address. The
ARB is a structure based on bank. Multiple disambiguation
requests can be dispatched in one cycle, provided that they are
all to different banks.
Chrysos and Emer used predictor to solve memory
disambiguation problem in [5]. The goal of the designers is to
be able to schedule load instructions as soon as possible without
causing any memory order violations. The predictor proposed is
based on store-sets. A store set for a specific load is the set of
all stores upon which the load has ever depended. The
processor adds a store to the store set of the load if a memory
order violation is caused when the load executes before that
store. In the next instance of the load instruction, the store set is
accessed to determine which stores the load will need to wait
for before executing.
A. Yoaz., M. Erez., R. Ronen. and S. Jourdan designed a
CHT predictor [7]. The CHT predictor provides a prediction
about whether a load instruction will conflict with any store in
the instruction window. Allocating a new entry only when a
load collides for the first time and invalidating its entry when its
state changes to non-colliding. It does not predict which store
instruction the load will conflict with. Therefore, it is easier to
design but it does not provide the best possible information for
disambiguation purposes.
Color set [10] presents a simple mechanism which
incorporates multiple speculation levels within the processor
and classifies the load and the store instructions at run time to
the appropriate speculation level. Each speculation level is
termed as a color and the sets of load and store instructions are
called color sets. These colors divide the load instructions into
distinct sets, starting with the base color which corresponds to
the no violation case. In other words, this set is the set of load
instructions which have never collided with unready store
instructions in the past. Each color in the spectrum represents
increasing levels of aggressiveness in load speculation; a load
instruction is allowed to issue only if its color is less than or
equal to the current speculation level. If the processor later
discovers that the load has collided with a store, the color
assigned to the load instruction in the predictor is increased.
3. VALUE PREDICTION AND LFHT
3.1 Issuing a Load
When executing a load or store instruction, the
instruction is split into two micro instructions inside the
processor [1]. One instruction calculates the effective address,
and the other instruction performs the memory access once the
effective address calculated and any potential store alias
dependencies resolved. In the baseline architecture, each store
and load instruction must wait until its effective address
calculation completes. In addition, all stores are issued in-order
with respect to prior stores, and each load must wait on the most
recent store before it can be speculatively issued.
There are three cases that a load instruction always
spends cycles on, (1) waiting on its effective address calculation
(ea), (2) waiting for prior store addresses to be calculated (dep),
and (3) the latency for fetching the data (mem). This paper
focus on (1) and (2). We use data prediction to solve problem
(1), and use LFHT to solve problem (2).
Figure 3.1 shows how many cycles per load instruction
waiting on its effective address [13]. As the figure shows that
each load instruction must wait 7 cycles in average so that it can
get its effective address, this make a lot of wasting.
0
5
10
15
20
25
30
bzip2
crafty
gap
gcc
gzip
m
cf
parser
twolf
vortex
vpr
Figure 3.1 Cycles per load instruction spend on waiting its
effective address in baseline architecture.
In the conventional disambiguation memory dependence
mechanism [14], load-forwarding behavior can detect store alias
and forward store data. Figure 3.2 shows percentages of load
that can take advantage of load-forwarding behavior [12]. Most
load instructions will not forward store data and conflict with
prior store on the baseline simulation architecture (describe in
section 4), the average amount of forwarding load is 12.7% and
the lowest amount of forwarding load is only 2.7%. It means
that most load instructions are unnecessarily pending for
disambiguating memory dependence.
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
bzip2
crafty
gap
gzip
m
cf
parser
tw
olf
Figure 3.2 Percent of forwarding load instructions
3.2 Value Prediction
All loads have to wait until their effective address is
calculated before they can be issued. If the load is on the critical
path, and the address can be accurately predicted, then it can be
beneficial to speculate the value of the address and load the data
as soon as possible.
A load instruction is effectively split into two instructions
inside the processor, one instruction calculates the effective
address. In order to predict this instruction, we predict
instruction’s operand so that we don’t have to wait prior
instruction which this instruction depend on.
Since we predict instruction’s operand, there are some
instructions that don’t have their register dependence with prior
instruction, these instructions don’t need to predict operand
because they already have exactly register value. In our
simulation, we only predict instructions with register
dependence. But all load instructions must update the predictor,
so that we can maintain predictor’s accurate rate.
Data prediction helps speedup the effective address
calculation for a load. The load then has only to wait on
potential store aliases before issuing. But if the operand was
incorrectly predicted, a recovery mechanism will take place
when the actual operand is available.
Value prediction has been studied for a long time and
many schemes have been proposed [4, 6, 11]. In this paper we
use the simplest scheme, stride predictor, to predict the operand
of a load instruction.
3.3 Load Forwarding History Table
When a load is issued, it performs a lookup in the store
buffer for a non-committed aliased store and it performs its data
cache access in parallel. If a store alias is found, store data
forward to load and the load has a shorter latency. If there is no
store alias, and there is a data cache hit, the load has a longer
latency because of the pipelined data cache. If there is a miss in
the data cache, the miss will only be processed if no alias is
found in the store buffer, load-forwarding behavior can detect
store alias and forward store data. This way, load instructions
can be issued out of order without waiting for prior stores
executed [8, 9, 14].
Conventional disambiguation memory dependence
mechanism unable to provide information for load instruction in
the decode stage. For that reason, in order to exploit
load-forwarding behavior and bring about all of these benefits, a
mechanism is proposed: the load forwarding history table
(LFHT). The LFHT records the result produced when the load
instruction was executed for load forwarding behavior of the
last time, and determines whether or not to out of order issue the
load when the load instruction is encountered in the future.
Each LFHT entry contains two fields: the tag field and
alias bit field. The LFHT is considered as a direct mapped cache,
indexed by the PC. The alias bit field is a sticky bit, the load
instruction is always treated as alias and waits until all prior
store addresses have been calculated before issuing, after the
load instruction encounter the first load forwarding behavior
indicate conflict with store at the execution time.
LFHT will be established or updated according to load
forwarding behavior of a load instruction at run time. If LFHT
miss, part of load instruction’s PC is written to the
corresponding entry as tag and the alias bit will be set or clear
depend on the load forwarding behavior. If LFHT hit, alias bit
will also be set or clear depend on the load forwarding behavior.
But if alias bit is in set state and load forwarding behavior
indicate no conflict with prior store, alias bit is still kept in set
state. If LFHT hit, and alias bit is in clear state, the load
instruction will be speculatively executed.
The validation/invalidation of speculative load instruction
is performed when each prior store address has been calculated.
Each time a store address is calculates, all the executed
speculative loads that occur after store in the instruction
window have their addresses checked for an alias. If an alias is
found, recovery action is taken for the load, and the load must
be re-issued; corresponding alias bit changed into set state to
avoid incorrect speculative load execution in the future.
3.4 Combine VP and LFHT
We used LFHT introduced in section 3.3 to combine with
value predictor discussed in section 3.2.
If only VP is used, although some load instructions can
get their operand faster, but these load instructions still must
wait prior store instruction’s effective address calculated to
ensure that there are no memory dependence.
But if only LFHT is used, although we have overcome
memory disambiguation problem, but there still exist register
dependence, it means, some load instructions’ operand isn’t
available to use. So that these load instructions must wait
operand ready to issue to the function unit.
So we combine these two mechanisms as shown in figure
3.3 to solve both memory dependence and register dependence.
Instr. Fetch
Unit
Decode Unit
Register Update
Unit
Load/Store
Function Unit
Function Unit
Function Unit
LFHT
Register file
The additional data
path
VP
Figure 3.3 Architecture data path with VP and LFHT.
4. Evaluation Methodology
4.1 Machine Model
The simulator used in this work is derived from the
SimpleScalar 2.0 and 3.0c tool set [3], a suite of functional and
timing simulation tools. The instruction set architecture
employed is the Alpha instruction set, which is based on the
Alpha AXP ISA.
Table 1 summarizes some of the parameters used in our
baseline architecture. Table 2 shows the architectures we
studied in this evaluation.
Table 1 Baseline Architecture Configuration
Instruction fetch 8 instructions per cycle.
Out-of-Order
execution
mechanism
Issue of 8 instructions /cycle, 256 entry
RUU(which is the ROB and the IW
combined), 128 entry load/store queue.
Loads executed only after all preceding
store addresses are known. Value
bypassed to loads from matching stores
ahead in the load/store queue. 2 cycle
load forwarding latency.
Architected registers 32 interger, hi, lo, 32 floating point.
Functional units
(FU)
8-integer ALUs, 8 load/store units, 4-FP
adders, 1-Integer MULT/DIV, 1-FP
MULT/DIV
FU latency int alu--1, load/store--1, int mult--3, int
div--12, fp adder--2, fp mult--4, fp
div--12, fp sqrt--24
L1 Instruction cache 64K bytes, 2-way set assoc., 32 byte
block, 4 cycles hit latency.
L1 Data cache 64K bytes, 2-way set assoc., 32 byte
block, 4 cycles hit latency. Dual ported.
L2 unified cache 1024K bytes, 4-way set assoc., 64 byte
block, 12 cycles hit latency
Memory Memory access latency (first-36, rest-4)
cycle. Width of memory bus is 32 bytes.
TLB miss 30 cycles
Table 2 Architectures we studied
Baseline Baseline architecture
VP Baseline + VP
VP + LFHT Baseline + VP + LFHT
VP + LFHT with Cycle
Clear
Baseline + VP + LFHT with
Cycle Clear
Perfect VP + Perfect LFHT Baseline + Perfect VP + Perfect
LFHT
Note: Cycle Clear is a keyword detailed in section 5.3
4.2 Benchmarks
To perform our experimental study, we have collected
results of the SPEC2000 benchmarks. The programs were
compiled with the gcc compiler included in the tool set. Table 4
shows the input data set for each integer benchmark. Table 5
shows the floating-point benchmark. In simulating the
benchmarks, we skipped the first billion instructions, and
collected statistics on the next five hundred million instructions.
Table 4 Input data set for benchmarks
SPECint 2000 Input SPECfp 2000 Input
bzip2 input.source ammp ammp.in
crafty crafty.in applu applu.in
gap ref.in art a10.img &
gcc 166.i equake inp.in
gzip input.graphic galgel galgel.in
mcf inp.in licas lucas2.in
parser ref.in mesa mesa.in
twolf ./twolf/ref mgrid mgrid.in
vortex lendian.raw swim swim.in
vpr net.in & arch.in
5. Performance Analysis
In this section, we will examine the performance
improvement gained by using the proposed mechanism. We also
explore detail configuration of VP and LFHT.
First of all, we show how many load instructions each
benchmarks have (as shown in figure 5.1 and 5.2). This can help
analyzing data.
Note: load instructions with same program counter means these
instructions are the same instruction.
5.1 VP: Cycles Waste for Waiting EA
Figure 5.1 shows how many cycles per load waste for
waiting its effective address when using VP. Compare to figure
3.1 in section 3.1, 4.5 cycles per load instruction are saved after
using VP.
0
1
2
3
4
5
6
bzip2
crafty
gap
gcc
gzip
m
cf
parser
twolf
vortex
vpr
Figure 5.1 Cycles per load instructions spend on waiting its
effective address after using VP.
5.2 LFHT
The average LFHT hit rate of integer benchmarks are
68% for 128, 85% for 512, 94% for 2048, 99% for 8192, of
floating-point benchmarks are 80% for 128, 99% for 512, 97%
for 2048, 99% for 8192.
Figure 5.2 shows how many cycles per load waste for
waiting its effective address when using LFHT, each load
instruction spends average 3.43 cycles.
Figure 5.3 shows how many cycles per load instructions
spend on waiting its effective address after we combined both
VP and LFHT. Each load instruction now spends average 1.91
cycles..
Because of (1) effective address calculation need at
least one cycle, (2) there are 40% of load instructions with
unavailable operand can’t predict (because these load
instructions’ state field in VP are init or transient), so we still
have to wait 1.91 cycles.
5.3 IPC & Speedup
Figure 5.4 shows IPC in integer benchmarks. Figure 5.5
shows IPC in floating-point benchmarks.
Alias bit is a sticky bit in LFHT, alias bit keeps in set
state after first conflict store detected. When a load instruction
conflicts with a prior store instruction and the alias bit is set,
this load instruction always waits until all prior store addresses
have been calculated before issuing, no matter whether store
alias has occurred or not. That may reduce speculative load in
effect, as false data dependency after the long period of run time
may occurs.
To prevent the LFHT from being too conservative,
causing false data dependency, all alias bit fields in LFHT are
cleared at a regular length of cycles. In this paper, we modeled
50,000 cycles clear (CC) for LFHT.
Figure 5.6 shows integer benchmarks’ speedup over the
baseline. The average speedups are 14.5% without cycle clear,
16.1% with cycle clear. Figure 5.7 shows floating-point
benchmarks’ speedup over the baseline. The average speedups
are both 5% with or without cycle clear.
Speedup is calculated by:
(New schemes’IPC – baseline’s IPC) / baseline’s IPC
In our simulation, we also model a perfect scheme on
both VP and LFHT, using this scheme to check how many
performance we can get most from this scheme.
With VP, we use perfect prediction on all load
instructions. This means when a load instruction is dispatched to
RUU, whether its operand is available or not, we can use it with
actual value, because we can predict it with 100% accurate rate.
With LFHT, perfect method means load instruction
only wait for store instruction with same effective address. This
means there are no unnecessary time spent on waiting store
instructions’ effective addresses.
The average speedup with perfect prediction in integer
benchmarks is 20.5%, in floating-point benchmarks is 7.5%.
Our simulation results in speedup of 16.1% for integer
benchmarks, of 5% for floating-point. This is close to perfect
case.
The main reason of vortex’s performance worsen than
baseline is due to its low accurate rate in value prediction, so it
needs more cycles to squash instructions and re-fetch
instructions to instruction windows.
0
1
2
3
4
5
6
bzip2
crafty
gap
gcc
gzip
m
cf
parser
tw
olf
vortex
vpr
Figure 5.2 Cycles per load instructions spend on waiting its
effective address after using LFHT
0
0.5
1
1.5
2
2.5
3
3.5
bzip2
crafty
gap
gcc
gzip
m
cfparser
tw
olfvortex
vpr
Figure 5.3 Cycles per load spend on waiting its ea after combine
both VP and LFHT
IPC (integer)
0
1
2
3
4
5
6
7
bzip2
crafty
gap
gcc
gzip
m
cfparser
tw
olfvortex
vpr
baseline
VP(2K)
VP+LFHT
with CC
perfect
Figure 5.4 IPC (integer benchmarks)
IPC (floating-point)
0
1
2
3
4
5
6
am
m
p
applu
artequakegalgel
lucas
m
esa
m
grid
sw
im
baseline
VP
VP+LFHT
with CC
perfect
Figure 5.5 IPC (floating-point benchmarks)
speedup (integer)
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
bzip2
gap
gzip
parser
vortex
VP
VP+LFHT
with CC
perfect
Figure 5.6 Speedup over baseline (integer benchmarks).
speedup (floating-point)
-0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
am
m
papplu
artequakegalgel
lucas
m
esam
grid
sw
im
VP
VP+LFHT
with CC
perfect
Figure 5.7 Speedup over baseline (floating-point benchmarks).
6. Conclusions
In this paper we present a combine mechanism for
improving load instruction issue rule in the modern superscalar
processor. Conventional load instructions only be issued when
ensure there are no dependences, and therefore reduces the
instruction level parallelism. We proposed a scheme combine
two mechanisms, value prediction (VP) + load forwarding
history table (LFHT) to speculatively execute load instructions.
All the information of VP and LFHT are established or
updated at run-time using load instruction history and load
forwarding behavior. They will provide the information of
memory disambiguation for the load instruction speculative
issue at the issue time. Throughout this study, we have not only
examined the load instruction issue rule, but also re-establish
memory dependence and disambiguation mechanism from two
aspects: first, we have studied the characteristic of load
instructions and using the information of load for memory
dependence and disambiguation, and second, we have proposed
a combine scheme that can take advantages of improving the
instruction level parallelism.
We evaluated the performance of our proposed
architecture with SimpleScalar. VP provides average speedup of
8.5% over baseline simulation architecture. With VP & LFHT,
the speedup is 14.5% over baseline architecture. With Cycle
Clear of LFHT, 16.1% speedup over baseline architecture is
achieved.
References
[1] M. Johnson, “Superscalar Microprocessor Design,”
Englewood Cliffs, Prentice Hall, 1991.
[2] M. Franklin and G. S. Sohi ,“ARB: A Hardware
Mechanism for Dynamic Reordering of Memory
References”, IEEE Transactions on Computer, May 1996
[3] D. C. Burger and T. M. Austin, “The Simplescalar Tool Set,
version 2.0” ,Technical Report CS-TR-97-1342, University
of Wisconsin, Madison, June 1997.
[4] K. Wang and M. Franklin, “Highly accurate data value
prediction using hybrid predictors”, In proceedings of the
Thirtieth Annual IEEE/ACM International Symposium on
Microarchitecture, pages 281-290, Dec. 1997.
[5] G.Z Chrysos and J.S Emer ,“Memory dependence
prediction using store sets”, In proceedings of the 25th
Annual International Symposium on Computer
Architecture, pages 142-153, June 1998.
[6] G. Reinman and B. Calder. “Predictive techniques for
aggressive load speculation”, In proceedings of the 31st
Annual ACM/IEEE International Symposium on
Microarchitecture, pages 127-137, Nov 1998.
[7] A. Yoaz.; M. Erez.; R. Ronen.; S. Jourdan. “Speculation
Techniques for Improving Load Related Instruction
Scheduling.” Proceedings of the 26th International
Symposium on Computer Architecture, May 1999
[8] G. Reinman. B. Calder. “A Comparative Survey of Load
Speculation Architectures ,” Journal of Instruction Level
Parallelism, May 2000.
[9] A. Moshovos and G.S. Sohi . “Reducing memory latency
via read-after-read memory dependence prediction”, IEEE
Transactions on Computers, pages 313-326, March 2002.
[10] S. Onder, “Cost effective memory dependence prediction
using speculation levels and color sets”, In proceedings of
the 2002 International Conference on Parallel Architectures
and Compilation Techniques, pages 232-241, Sept. 2002.
[11] Huiyang Zhou, J. Flanagan. and T.M. Conte , “Detecting
global stride locality in value streams”, In proceedings of
the 30th Annual International Symposium on Computer
Architecture, pages 324-335, June 2003.
[12] Shin-Rung , Chen , “Memory Disambiguation using
Load Forwarding”, Master Thesis, Department of
Computer Science and Engineering, Tatung University,
July 2004
[13] Cheng-Chun Lin , “Load Speculation”, Master Thesis,
Department of Computer Science and Engineering, Tatung
University, July 2005
[14] Jon Paul Shen , Mikko H. Lipasti ,Modezn Processor
Design : Fundamentals of Superscalar Processors ,
McGraw-Hill,2005.

More Related Content

What's hot

MOOC backbone using Netty and Protobuf
MOOC backbone using Netty and ProtobufMOOC backbone using Netty and Protobuf
MOOC backbone using Netty and ProtobufGaurav Bhardwaj
 
Pinterest like site using REST and Bottle
Pinterest like site using REST and Bottle Pinterest like site using REST and Bottle
Pinterest like site using REST and Bottle Gaurav Bhardwaj
 
IRJET- Implementation of Mesi Protocol using Verilog
IRJET- Implementation of Mesi Protocol using VerilogIRJET- Implementation of Mesi Protocol using Verilog
IRJET- Implementation of Mesi Protocol using VerilogIRJET Journal
 
Comparative study on Cache Coherence Protocols
Comparative study on Cache Coherence ProtocolsComparative study on Cache Coherence Protocols
Comparative study on Cache Coherence Protocolsiosrjce
 
Summary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
Summary of Simultaneous Multithreading: Maximizing On-Chip ParallelismSummary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
Summary of Simultaneous Multithreading: Maximizing On-Chip ParallelismFarwa Ansari
 
Cache coherency controller for MESI protocol based on FPGA
Cache coherency controller for MESI protocol based on FPGA Cache coherency controller for MESI protocol based on FPGA
Cache coherency controller for MESI protocol based on FPGA IJECEIAES
 
Dynamic MPLS with Feedback
Dynamic MPLS with FeedbackDynamic MPLS with Feedback
Dynamic MPLS with FeedbackIJCSEA Journal
 
Process Synchronization
Process SynchronizationProcess Synchronization
Process SynchronizationShipra Swati
 
Virtual Machine Maanager
Virtual Machine MaanagerVirtual Machine Maanager
Virtual Machine MaanagerGaurav Bhardwaj
 

What's hot (17)

MOOC backbone using Netty and Protobuf
MOOC backbone using Netty and ProtobufMOOC backbone using Netty and Protobuf
MOOC backbone using Netty and Protobuf
 
Pinterest like site using REST and Bottle
Pinterest like site using REST and Bottle Pinterest like site using REST and Bottle
Pinterest like site using REST and Bottle
 
shashank_hpca1995_00386533
shashank_hpca1995_00386533shashank_hpca1995_00386533
shashank_hpca1995_00386533
 
StateKeeper Report
StateKeeper ReportStateKeeper Report
StateKeeper Report
 
IRJET- Implementation of Mesi Protocol using Verilog
IRJET- Implementation of Mesi Protocol using VerilogIRJET- Implementation of Mesi Protocol using Verilog
IRJET- Implementation of Mesi Protocol using Verilog
 
Comparative study on Cache Coherence Protocols
Comparative study on Cache Coherence ProtocolsComparative study on Cache Coherence Protocols
Comparative study on Cache Coherence Protocols
 
Summary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
Summary of Simultaneous Multithreading: Maximizing On-Chip ParallelismSummary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
Summary of Simultaneous Multithreading: Maximizing On-Chip Parallelism
 
Compiler design
Compiler designCompiler design
Compiler design
 
Cache coherency controller for MESI protocol based on FPGA
Cache coherency controller for MESI protocol based on FPGA Cache coherency controller for MESI protocol based on FPGA
Cache coherency controller for MESI protocol based on FPGA
 
Bus Based Multiprocessors v2
Bus Based Multiprocessors v2Bus Based Multiprocessors v2
Bus Based Multiprocessors v2
 
Dynamic MPLS with Feedback
Dynamic MPLS with FeedbackDynamic MPLS with Feedback
Dynamic MPLS with Feedback
 
Process Synchronization
Process SynchronizationProcess Synchronization
Process Synchronization
 
PhaseII_1
PhaseII_1PhaseII_1
PhaseII_1
 
Virtual Machine Maanager
Virtual Machine MaanagerVirtual Machine Maanager
Virtual Machine Maanager
 
BigDataDebugging
BigDataDebuggingBigDataDebugging
BigDataDebugging
 
Bt0070
Bt0070Bt0070
Bt0070
 
S peculative multi
S peculative multiS peculative multi
S peculative multi
 

Viewers also liked

Richard sinnott.docx july 2014
Richard sinnott.docx july 2014Richard sinnott.docx july 2014
Richard sinnott.docx july 2014Rich Sinnott
 
Publication Can safe cover
Publication Can safe cover Publication Can safe cover
Publication Can safe cover Nebojsa Maric
 
A journey in the public clouds
A journey in the public cloudsA journey in the public clouds
A journey in the public cloudsAlexis Lê-Quôc
 
1.ความรู้เบื้องต้นเกี่ยวกับinternet1
1.ความรู้เบื้องต้นเกี่ยวกับinternet11.ความรู้เบื้องต้นเกี่ยวกับinternet1
1.ความรู้เบื้องต้นเกี่ยวกับinternet1Mevenwen Singollo
 
A Novel Approach for Detection of Routes with Misbehaving Nodes in MANETs
A Novel Approach for Detection of Routes with Misbehaving Nodes in MANETsA Novel Approach for Detection of Routes with Misbehaving Nodes in MANETs
A Novel Approach for Detection of Routes with Misbehaving Nodes in MANETsIDES Editor
 
Institutions
InstitutionsInstitutions
InstitutionsBela17
 
Conversión entre binarios, octal y hexadecimal
Conversión entre binarios, octal y hexadecimalConversión entre binarios, octal y hexadecimal
Conversión entre binarios, octal y hexadecimalLiliana Avila
 
MinorHotels_Brand Presentation_July2016
MinorHotels_Brand Presentation_July2016MinorHotels_Brand Presentation_July2016
MinorHotels_Brand Presentation_July2016Luis Coelho
 

Viewers also liked (14)

Richard sinnott.docx july 2014
Richard sinnott.docx july 2014Richard sinnott.docx july 2014
Richard sinnott.docx july 2014
 
Publication Can safe cover
Publication Can safe cover Publication Can safe cover
Publication Can safe cover
 
A journey in the public clouds
A journey in the public cloudsA journey in the public clouds
A journey in the public clouds
 
1.ความรู้เบื้องต้นเกี่ยวกับinternet1
1.ความรู้เบื้องต้นเกี่ยวกับinternet11.ความรู้เบื้องต้นเกี่ยวกับinternet1
1.ความรู้เบื้องต้นเกี่ยวกับinternet1
 
Urkontinentalak
UrkontinentalakUrkontinentalak
Urkontinentalak
 
Guia Zoomorficas
Guia ZoomorficasGuia Zoomorficas
Guia Zoomorficas
 
Andres pastrana
Andres pastranaAndres pastrana
Andres pastrana
 
A Novel Approach for Detection of Routes with Misbehaving Nodes in MANETs
A Novel Approach for Detection of Routes with Misbehaving Nodes in MANETsA Novel Approach for Detection of Routes with Misbehaving Nodes in MANETs
A Novel Approach for Detection of Routes with Misbehaving Nodes in MANETs
 
Institutions
InstitutionsInstitutions
Institutions
 
Conversión entre binarios, octal y hexadecimal
Conversión entre binarios, octal y hexadecimalConversión entre binarios, octal y hexadecimal
Conversión entre binarios, octal y hexadecimal
 
The Conversation Prism : vision prospective des réseaux sociaux
The Conversation Prism : vision prospective des réseaux sociauxThe Conversation Prism : vision prospective des réseaux sociaux
The Conversation Prism : vision prospective des réseaux sociaux
 
MinorHotels_Brand Presentation_July2016
MinorHotels_Brand Presentation_July2016MinorHotels_Brand Presentation_July2016
MinorHotels_Brand Presentation_July2016
 
Vision prospective : À quoi ressemblera le magasin de demain ?
Vision prospective : À quoi ressemblera le magasin de demain ?Vision prospective : À quoi ressemblera le magasin de demain ?
Vision prospective : À quoi ressemblera le magasin de demain ?
 
C chap1
C chap1C chap1
C chap1
 

Similar to shieh06a

Code scheduling constraints
Code scheduling constraintsCode scheduling constraints
Code scheduling constraintsArchanaMani2
 
Load balancing in Distributed Systems
Load balancing in Distributed SystemsLoad balancing in Distributed Systems
Load balancing in Distributed SystemsRicha Singh
 
Enhanced equally distributed load balancing algorithm for cloud computing
Enhanced equally distributed load balancing algorithm for cloud computingEnhanced equally distributed load balancing algorithm for cloud computing
Enhanced equally distributed load balancing algorithm for cloud computingeSAT Publishing House
 
Enhanced equally distributed load balancing algorithm for cloud computing
Enhanced equally distributed load balancing algorithm for cloud computingEnhanced equally distributed load balancing algorithm for cloud computing
Enhanced equally distributed load balancing algorithm for cloud computingeSAT Journals
 
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE  CONTROL IN MULTITHREADED CORESLATENCY-AWARE WRITE BUFFER RESOURCE  CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORESijmvsc
 
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORESLATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORESijdpsjournal
 
Iaetsd appliances of harmonizing model in cloud
Iaetsd appliances of harmonizing model in cloudIaetsd appliances of harmonizing model in cloud
Iaetsd appliances of harmonizing model in cloudIaetsd Iaetsd
 
Modified Active Monitoring Load Balancing with Cloud Computing
Modified Active Monitoring Load Balancing with Cloud ComputingModified Active Monitoring Load Balancing with Cloud Computing
Modified Active Monitoring Load Balancing with Cloud Computingijsrd.com
 
Continental division of load and balanced ant
Continental division of load and balanced antContinental division of load and balanced ant
Continental division of load and balanced antIJCI JOURNAL
 
A method for balancing heterogeneous request load in dht based p2 p
A method for balancing heterogeneous request load in dht based p2 pA method for balancing heterogeneous request load in dht based p2 p
A method for balancing heterogeneous request load in dht based p2 pIAEME Publication
 
A method for balancing heterogeneous request load in dht based p2 p
A method for balancing heterogeneous request load in dht based p2 pA method for balancing heterogeneous request load in dht based p2 p
A method for balancing heterogeneous request load in dht based p2 pIAEME Publication
 
Scalable Distributed Job Processing with Dynamic Load Balancing
Scalable Distributed Job Processing with Dynamic Load BalancingScalable Distributed Job Processing with Dynamic Load Balancing
Scalable Distributed Job Processing with Dynamic Load Balancingijdpsjournal
 
A load balancing algorithm based on
A load balancing algorithm based onA load balancing algorithm based on
A load balancing algorithm based onijp2p
 
Distributed System Management
Distributed System ManagementDistributed System Management
Distributed System ManagementIbrahim Amer
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processorscsandit
 

Similar to shieh06a (20)

Final report
Final reportFinal report
Final report
 
Code scheduling constraints
Code scheduling constraintsCode scheduling constraints
Code scheduling constraints
 
1844 1849
1844 18491844 1849
1844 1849
 
compiler design
compiler designcompiler design
compiler design
 
Load balancing in Distributed Systems
Load balancing in Distributed SystemsLoad balancing in Distributed Systems
Load balancing in Distributed Systems
 
Enhanced equally distributed load balancing algorithm for cloud computing
Enhanced equally distributed load balancing algorithm for cloud computingEnhanced equally distributed load balancing algorithm for cloud computing
Enhanced equally distributed load balancing algorithm for cloud computing
 
Enhanced equally distributed load balancing algorithm for cloud computing
Enhanced equally distributed load balancing algorithm for cloud computingEnhanced equally distributed load balancing algorithm for cloud computing
Enhanced equally distributed load balancing algorithm for cloud computing
 
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE  CONTROL IN MULTITHREADED CORESLATENCY-AWARE WRITE BUFFER RESOURCE  CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
 
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORESLATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORES
 
Iaetsd appliances of harmonizing model in cloud
Iaetsd appliances of harmonizing model in cloudIaetsd appliances of harmonizing model in cloud
Iaetsd appliances of harmonizing model in cloud
 
Modified Active Monitoring Load Balancing with Cloud Computing
Modified Active Monitoring Load Balancing with Cloud ComputingModified Active Monitoring Load Balancing with Cloud Computing
Modified Active Monitoring Load Balancing with Cloud Computing
 
Minimize Staleness and Stretch in Streaming Data Warehouses
Minimize Staleness and Stretch in Streaming Data WarehousesMinimize Staleness and Stretch in Streaming Data Warehouses
Minimize Staleness and Stretch in Streaming Data Warehouses
 
Continental division of load and balanced ant
Continental division of load and balanced antContinental division of load and balanced ant
Continental division of load and balanced ant
 
A method for balancing heterogeneous request load in dht based p2 p
A method for balancing heterogeneous request load in dht based p2 pA method for balancing heterogeneous request load in dht based p2 p
A method for balancing heterogeneous request load in dht based p2 p
 
A method for balancing heterogeneous request load in dht based p2 p
A method for balancing heterogeneous request load in dht based p2 pA method for balancing heterogeneous request load in dht based p2 p
A method for balancing heterogeneous request load in dht based p2 p
 
Scalable Distributed Job Processing with Dynamic Load Balancing
Scalable Distributed Job Processing with Dynamic Load BalancingScalable Distributed Job Processing with Dynamic Load Balancing
Scalable Distributed Job Processing with Dynamic Load Balancing
 
A load balancing algorithm based on
A load balancing algorithm based onA load balancing algorithm based on
A load balancing algorithm based on
 
Distributed System Management
Distributed System ManagementDistributed System Management
Distributed System Management
 
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORSAFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS
 
Affect of parallel computing on multicore processors
Affect of parallel computing on multicore processorsAffect of parallel computing on multicore processors
Affect of parallel computing on multicore processors
 

shieh06a

  • 1. Load Speculation Jong-Jiann Shieh Department of Computer Science and Engineering Tatung University shieh@ttu.edu.tw and Cheng-Chun Lin Night.cola@msa.hinet.net and Shin-Rung Chen apsras@amigo.cse.ttu.edu.tw Abstract The superscalar processor must issues instructions as early as possible to enhance the performance. But load instructions would be issued with register dependencies are solved and memory dependencies are known. Register dependence makes load instruction must wait until prior instruction with same destination register is completed. Memory dependence results in load instruction cannot be issued before the ambiguities are resolved. Therefore load instructions only could be issued when no register dependencies exist and all prior stores’ effective addresses calculated. This paper combines two mechanisms: value prediction (VP) and load forwarding history table (LFHT) to speculatively execute load instructions. Our study shows that by doing so there is about 15% average speedup up over baseline architecture. Keyword: load speculation, register dependence, memory dependence, value prediction, load forwarding 1. Introduction Modern superscalar processors allow instructions to execute out of program order to find more instruction level parallelism (ILP). These processors must monitor data dependencies to maintain correct program behavior. There are two types of data dependencies, register dependence and memory dependence. Register dependence is detected in the instruction decode stage by examining instructions’ register operand fields. If there is an instruction which load instruction depends on, the load instruction must wait until prior instruction completed, then the value of operand can be used. The lack of information about memory dependence at instruction decode time is a problem for an out-of-order instruction scheduler. If the scheduler executes a load before a prior store that writes to the same memory location, the load will read the wrong value. In this event the load and all subsequent dependent instructions must be re-executed, resulting in a huge performance penalty. To avoid these memory order violations, the instruction scheduler should be conservative to prevent loads from executing until all prior stores have executed. This approach decreases performance because loads in majority cases will be made falsely dependent on no alias stores as data on section 3 shown. In this paper, we use a simple value predictor to predict the operand value to avoid register dependence and propose a structure called Load Forwarding History Table (LFHT) to exploit memory dependence speculation at run time. As we combine these two mechanisms, the predictor can help LFHT making more load instructions execute without waiting for the prior stores’ effective addresses calculated, this result in more load instructions will be issued earlier. When a load instruction is speculatively executed, instructions that are dependent upon the load instruction will also be speculatively executed. The organization of the rest of this paper is as follows. Section 2 surveys previously proposed related works. Section 3 illustrates whole structure in superscalar processor. Section 4 describes our CPU model and simulation environment. The performance is evaluated in section 5. Finally, the conclusion of this paper is presented in section 6. 2. Related Works The traditional works on memory disambiguation were done in the context of compiler and hardware mechanisms for non-speculative disambiguation to ensure program correctness. Franklin and Sohi [2] proposed the address resolution buffer (ARB). The ARB indicates memory references into bins according to their address. The bins are used to cause a temporal order between references to the same address. The ARB is a structure based on bank. Multiple disambiguation requests can be dispatched in one cycle, provided that they are all to different banks. Chrysos and Emer used predictor to solve memory disambiguation problem in [5]. The goal of the designers is to be able to schedule load instructions as soon as possible without causing any memory order violations. The predictor proposed is
  • 2. based on store-sets. A store set for a specific load is the set of all stores upon which the load has ever depended. The processor adds a store to the store set of the load if a memory order violation is caused when the load executes before that store. In the next instance of the load instruction, the store set is accessed to determine which stores the load will need to wait for before executing. A. Yoaz., M. Erez., R. Ronen. and S. Jourdan designed a CHT predictor [7]. The CHT predictor provides a prediction about whether a load instruction will conflict with any store in the instruction window. Allocating a new entry only when a load collides for the first time and invalidating its entry when its state changes to non-colliding. It does not predict which store instruction the load will conflict with. Therefore, it is easier to design but it does not provide the best possible information for disambiguation purposes. Color set [10] presents a simple mechanism which incorporates multiple speculation levels within the processor and classifies the load and the store instructions at run time to the appropriate speculation level. Each speculation level is termed as a color and the sets of load and store instructions are called color sets. These colors divide the load instructions into distinct sets, starting with the base color which corresponds to the no violation case. In other words, this set is the set of load instructions which have never collided with unready store instructions in the past. Each color in the spectrum represents increasing levels of aggressiveness in load speculation; a load instruction is allowed to issue only if its color is less than or equal to the current speculation level. If the processor later discovers that the load has collided with a store, the color assigned to the load instruction in the predictor is increased. 3. VALUE PREDICTION AND LFHT 3.1 Issuing a Load When executing a load or store instruction, the instruction is split into two micro instructions inside the processor [1]. One instruction calculates the effective address, and the other instruction performs the memory access once the effective address calculated and any potential store alias dependencies resolved. In the baseline architecture, each store and load instruction must wait until its effective address calculation completes. In addition, all stores are issued in-order with respect to prior stores, and each load must wait on the most recent store before it can be speculatively issued. There are three cases that a load instruction always spends cycles on, (1) waiting on its effective address calculation (ea), (2) waiting for prior store addresses to be calculated (dep), and (3) the latency for fetching the data (mem). This paper focus on (1) and (2). We use data prediction to solve problem (1), and use LFHT to solve problem (2). Figure 3.1 shows how many cycles per load instruction waiting on its effective address [13]. As the figure shows that each load instruction must wait 7 cycles in average so that it can get its effective address, this make a lot of wasting. 0 5 10 15 20 25 30 bzip2 crafty gap gcc gzip m cf parser twolf vortex vpr Figure 3.1 Cycles per load instruction spend on waiting its effective address in baseline architecture. In the conventional disambiguation memory dependence mechanism [14], load-forwarding behavior can detect store alias and forward store data. Figure 3.2 shows percentages of load that can take advantage of load-forwarding behavior [12]. Most load instructions will not forward store data and conflict with prior store on the baseline simulation architecture (describe in section 4), the average amount of forwarding load is 12.7% and the lowest amount of forwarding load is only 2.7%. It means that most load instructions are unnecessarily pending for disambiguating memory dependence. 0.0% 10.0% 20.0% 30.0% 40.0% 50.0% bzip2 crafty gap gzip m cf parser tw olf Figure 3.2 Percent of forwarding load instructions 3.2 Value Prediction All loads have to wait until their effective address is calculated before they can be issued. If the load is on the critical path, and the address can be accurately predicted, then it can be beneficial to speculate the value of the address and load the data as soon as possible. A load instruction is effectively split into two instructions inside the processor, one instruction calculates the effective address. In order to predict this instruction, we predict instruction’s operand so that we don’t have to wait prior instruction which this instruction depend on. Since we predict instruction’s operand, there are some instructions that don’t have their register dependence with prior instruction, these instructions don’t need to predict operand because they already have exactly register value. In our
  • 3. simulation, we only predict instructions with register dependence. But all load instructions must update the predictor, so that we can maintain predictor’s accurate rate. Data prediction helps speedup the effective address calculation for a load. The load then has only to wait on potential store aliases before issuing. But if the operand was incorrectly predicted, a recovery mechanism will take place when the actual operand is available. Value prediction has been studied for a long time and many schemes have been proposed [4, 6, 11]. In this paper we use the simplest scheme, stride predictor, to predict the operand of a load instruction. 3.3 Load Forwarding History Table When a load is issued, it performs a lookup in the store buffer for a non-committed aliased store and it performs its data cache access in parallel. If a store alias is found, store data forward to load and the load has a shorter latency. If there is no store alias, and there is a data cache hit, the load has a longer latency because of the pipelined data cache. If there is a miss in the data cache, the miss will only be processed if no alias is found in the store buffer, load-forwarding behavior can detect store alias and forward store data. This way, load instructions can be issued out of order without waiting for prior stores executed [8, 9, 14]. Conventional disambiguation memory dependence mechanism unable to provide information for load instruction in the decode stage. For that reason, in order to exploit load-forwarding behavior and bring about all of these benefits, a mechanism is proposed: the load forwarding history table (LFHT). The LFHT records the result produced when the load instruction was executed for load forwarding behavior of the last time, and determines whether or not to out of order issue the load when the load instruction is encountered in the future. Each LFHT entry contains two fields: the tag field and alias bit field. The LFHT is considered as a direct mapped cache, indexed by the PC. The alias bit field is a sticky bit, the load instruction is always treated as alias and waits until all prior store addresses have been calculated before issuing, after the load instruction encounter the first load forwarding behavior indicate conflict with store at the execution time. LFHT will be established or updated according to load forwarding behavior of a load instruction at run time. If LFHT miss, part of load instruction’s PC is written to the corresponding entry as tag and the alias bit will be set or clear depend on the load forwarding behavior. If LFHT hit, alias bit will also be set or clear depend on the load forwarding behavior. But if alias bit is in set state and load forwarding behavior indicate no conflict with prior store, alias bit is still kept in set state. If LFHT hit, and alias bit is in clear state, the load instruction will be speculatively executed. The validation/invalidation of speculative load instruction is performed when each prior store address has been calculated. Each time a store address is calculates, all the executed speculative loads that occur after store in the instruction window have their addresses checked for an alias. If an alias is found, recovery action is taken for the load, and the load must be re-issued; corresponding alias bit changed into set state to avoid incorrect speculative load execution in the future. 3.4 Combine VP and LFHT We used LFHT introduced in section 3.3 to combine with value predictor discussed in section 3.2. If only VP is used, although some load instructions can get their operand faster, but these load instructions still must wait prior store instruction’s effective address calculated to ensure that there are no memory dependence. But if only LFHT is used, although we have overcome memory disambiguation problem, but there still exist register dependence, it means, some load instructions’ operand isn’t available to use. So that these load instructions must wait operand ready to issue to the function unit. So we combine these two mechanisms as shown in figure 3.3 to solve both memory dependence and register dependence. Instr. Fetch Unit Decode Unit Register Update Unit Load/Store Function Unit Function Unit Function Unit LFHT Register file The additional data path VP Figure 3.3 Architecture data path with VP and LFHT. 4. Evaluation Methodology 4.1 Machine Model The simulator used in this work is derived from the SimpleScalar 2.0 and 3.0c tool set [3], a suite of functional and timing simulation tools. The instruction set architecture employed is the Alpha instruction set, which is based on the Alpha AXP ISA. Table 1 summarizes some of the parameters used in our baseline architecture. Table 2 shows the architectures we studied in this evaluation. Table 1 Baseline Architecture Configuration Instruction fetch 8 instructions per cycle. Out-of-Order execution mechanism Issue of 8 instructions /cycle, 256 entry RUU(which is the ROB and the IW combined), 128 entry load/store queue. Loads executed only after all preceding store addresses are known. Value bypassed to loads from matching stores ahead in the load/store queue. 2 cycle load forwarding latency.
  • 4. Architected registers 32 interger, hi, lo, 32 floating point. Functional units (FU) 8-integer ALUs, 8 load/store units, 4-FP adders, 1-Integer MULT/DIV, 1-FP MULT/DIV FU latency int alu--1, load/store--1, int mult--3, int div--12, fp adder--2, fp mult--4, fp div--12, fp sqrt--24 L1 Instruction cache 64K bytes, 2-way set assoc., 32 byte block, 4 cycles hit latency. L1 Data cache 64K bytes, 2-way set assoc., 32 byte block, 4 cycles hit latency. Dual ported. L2 unified cache 1024K bytes, 4-way set assoc., 64 byte block, 12 cycles hit latency Memory Memory access latency (first-36, rest-4) cycle. Width of memory bus is 32 bytes. TLB miss 30 cycles Table 2 Architectures we studied Baseline Baseline architecture VP Baseline + VP VP + LFHT Baseline + VP + LFHT VP + LFHT with Cycle Clear Baseline + VP + LFHT with Cycle Clear Perfect VP + Perfect LFHT Baseline + Perfect VP + Perfect LFHT Note: Cycle Clear is a keyword detailed in section 5.3 4.2 Benchmarks To perform our experimental study, we have collected results of the SPEC2000 benchmarks. The programs were compiled with the gcc compiler included in the tool set. Table 4 shows the input data set for each integer benchmark. Table 5 shows the floating-point benchmark. In simulating the benchmarks, we skipped the first billion instructions, and collected statistics on the next five hundred million instructions. Table 4 Input data set for benchmarks SPECint 2000 Input SPECfp 2000 Input bzip2 input.source ammp ammp.in crafty crafty.in applu applu.in gap ref.in art a10.img & gcc 166.i equake inp.in gzip input.graphic galgel galgel.in mcf inp.in licas lucas2.in parser ref.in mesa mesa.in twolf ./twolf/ref mgrid mgrid.in vortex lendian.raw swim swim.in vpr net.in & arch.in 5. Performance Analysis In this section, we will examine the performance improvement gained by using the proposed mechanism. We also explore detail configuration of VP and LFHT. First of all, we show how many load instructions each benchmarks have (as shown in figure 5.1 and 5.2). This can help analyzing data. Note: load instructions with same program counter means these instructions are the same instruction. 5.1 VP: Cycles Waste for Waiting EA Figure 5.1 shows how many cycles per load waste for waiting its effective address when using VP. Compare to figure 3.1 in section 3.1, 4.5 cycles per load instruction are saved after using VP. 0 1 2 3 4 5 6 bzip2 crafty gap gcc gzip m cf parser twolf vortex vpr Figure 5.1 Cycles per load instructions spend on waiting its effective address after using VP. 5.2 LFHT The average LFHT hit rate of integer benchmarks are 68% for 128, 85% for 512, 94% for 2048, 99% for 8192, of floating-point benchmarks are 80% for 128, 99% for 512, 97% for 2048, 99% for 8192. Figure 5.2 shows how many cycles per load waste for waiting its effective address when using LFHT, each load instruction spends average 3.43 cycles. Figure 5.3 shows how many cycles per load instructions spend on waiting its effective address after we combined both VP and LFHT. Each load instruction now spends average 1.91 cycles.. Because of (1) effective address calculation need at least one cycle, (2) there are 40% of load instructions with unavailable operand can’t predict (because these load instructions’ state field in VP are init or transient), so we still have to wait 1.91 cycles. 5.3 IPC & Speedup Figure 5.4 shows IPC in integer benchmarks. Figure 5.5 shows IPC in floating-point benchmarks. Alias bit is a sticky bit in LFHT, alias bit keeps in set state after first conflict store detected. When a load instruction
  • 5. conflicts with a prior store instruction and the alias bit is set, this load instruction always waits until all prior store addresses have been calculated before issuing, no matter whether store alias has occurred or not. That may reduce speculative load in effect, as false data dependency after the long period of run time may occurs. To prevent the LFHT from being too conservative, causing false data dependency, all alias bit fields in LFHT are cleared at a regular length of cycles. In this paper, we modeled 50,000 cycles clear (CC) for LFHT. Figure 5.6 shows integer benchmarks’ speedup over the baseline. The average speedups are 14.5% without cycle clear, 16.1% with cycle clear. Figure 5.7 shows floating-point benchmarks’ speedup over the baseline. The average speedups are both 5% with or without cycle clear. Speedup is calculated by: (New schemes’IPC – baseline’s IPC) / baseline’s IPC In our simulation, we also model a perfect scheme on both VP and LFHT, using this scheme to check how many performance we can get most from this scheme. With VP, we use perfect prediction on all load instructions. This means when a load instruction is dispatched to RUU, whether its operand is available or not, we can use it with actual value, because we can predict it with 100% accurate rate. With LFHT, perfect method means load instruction only wait for store instruction with same effective address. This means there are no unnecessary time spent on waiting store instructions’ effective addresses. The average speedup with perfect prediction in integer benchmarks is 20.5%, in floating-point benchmarks is 7.5%. Our simulation results in speedup of 16.1% for integer benchmarks, of 5% for floating-point. This is close to perfect case. The main reason of vortex’s performance worsen than baseline is due to its low accurate rate in value prediction, so it needs more cycles to squash instructions and re-fetch instructions to instruction windows. 0 1 2 3 4 5 6 bzip2 crafty gap gcc gzip m cf parser tw olf vortex vpr Figure 5.2 Cycles per load instructions spend on waiting its effective address after using LFHT 0 0.5 1 1.5 2 2.5 3 3.5 bzip2 crafty gap gcc gzip m cfparser tw olfvortex vpr Figure 5.3 Cycles per load spend on waiting its ea after combine both VP and LFHT IPC (integer) 0 1 2 3 4 5 6 7 bzip2 crafty gap gcc gzip m cfparser tw olfvortex vpr baseline VP(2K) VP+LFHT with CC perfect Figure 5.4 IPC (integer benchmarks) IPC (floating-point) 0 1 2 3 4 5 6 am m p applu artequakegalgel lucas m esa m grid sw im baseline VP VP+LFHT with CC perfect Figure 5.5 IPC (floating-point benchmarks)
  • 6. speedup (integer) -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 bzip2 gap gzip parser vortex VP VP+LFHT with CC perfect Figure 5.6 Speedup over baseline (integer benchmarks). speedup (floating-point) -0.02 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 am m papplu artequakegalgel lucas m esam grid sw im VP VP+LFHT with CC perfect Figure 5.7 Speedup over baseline (floating-point benchmarks). 6. Conclusions In this paper we present a combine mechanism for improving load instruction issue rule in the modern superscalar processor. Conventional load instructions only be issued when ensure there are no dependences, and therefore reduces the instruction level parallelism. We proposed a scheme combine two mechanisms, value prediction (VP) + load forwarding history table (LFHT) to speculatively execute load instructions. All the information of VP and LFHT are established or updated at run-time using load instruction history and load forwarding behavior. They will provide the information of memory disambiguation for the load instruction speculative issue at the issue time. Throughout this study, we have not only examined the load instruction issue rule, but also re-establish memory dependence and disambiguation mechanism from two aspects: first, we have studied the characteristic of load instructions and using the information of load for memory dependence and disambiguation, and second, we have proposed a combine scheme that can take advantages of improving the instruction level parallelism. We evaluated the performance of our proposed architecture with SimpleScalar. VP provides average speedup of 8.5% over baseline simulation architecture. With VP & LFHT, the speedup is 14.5% over baseline architecture. With Cycle Clear of LFHT, 16.1% speedup over baseline architecture is achieved. References [1] M. Johnson, “Superscalar Microprocessor Design,” Englewood Cliffs, Prentice Hall, 1991. [2] M. Franklin and G. S. Sohi ,“ARB: A Hardware Mechanism for Dynamic Reordering of Memory References”, IEEE Transactions on Computer, May 1996 [3] D. C. Burger and T. M. Austin, “The Simplescalar Tool Set, version 2.0” ,Technical Report CS-TR-97-1342, University of Wisconsin, Madison, June 1997. [4] K. Wang and M. Franklin, “Highly accurate data value prediction using hybrid predictors”, In proceedings of the Thirtieth Annual IEEE/ACM International Symposium on Microarchitecture, pages 281-290, Dec. 1997. [5] G.Z Chrysos and J.S Emer ,“Memory dependence prediction using store sets”, In proceedings of the 25th Annual International Symposium on Computer Architecture, pages 142-153, June 1998. [6] G. Reinman and B. Calder. “Predictive techniques for aggressive load speculation”, In proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture, pages 127-137, Nov 1998. [7] A. Yoaz.; M. Erez.; R. Ronen.; S. Jourdan. “Speculation Techniques for Improving Load Related Instruction Scheduling.” Proceedings of the 26th International Symposium on Computer Architecture, May 1999 [8] G. Reinman. B. Calder. “A Comparative Survey of Load Speculation Architectures ,” Journal of Instruction Level Parallelism, May 2000. [9] A. Moshovos and G.S. Sohi . “Reducing memory latency via read-after-read memory dependence prediction”, IEEE Transactions on Computers, pages 313-326, March 2002. [10] S. Onder, “Cost effective memory dependence prediction using speculation levels and color sets”, In proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques, pages 232-241, Sept. 2002. [11] Huiyang Zhou, J. Flanagan. and T.M. Conte , “Detecting global stride locality in value streams”, In proceedings of the 30th Annual International Symposium on Computer Architecture, pages 324-335, June 2003. [12] Shin-Rung , Chen , “Memory Disambiguation using Load Forwarding”, Master Thesis, Department of Computer Science and Engineering, Tatung University, July 2004 [13] Cheng-Chun Lin , “Load Speculation”, Master Thesis, Department of Computer Science and Engineering, Tatung University, July 2005 [14] Jon Paul Shen , Mikko H. Lipasti ,Modezn Processor Design : Fundamentals of Superscalar Processors , McGraw-Hill,2005.