shieh06a

Load Speculation
Jong-Jiann Shieh
Department of Computer Science and Engineering
Tatung University
shieh@ttu.edu.tw
and
Cheng-Chun Lin
Night.cola@msa.hinet.net
and
Shin-Rung Chen
apsras@amigo.cse.ttu.edu.tw
Abstract
The superscalar processor must issues instructions as
early as possible to enhance the performance. But load
instructions would be issued with register dependencies are
solved and memory dependencies are known. Register
dependence makes load instruction must wait until prior
instruction with same destination register is completed. Memory
dependence results in load instruction cannot be issued before
the ambiguities are resolved. Therefore load instructions only
could be issued when no register dependencies exist and all
prior stores’ effective addresses calculated. This paper combines
two mechanisms: value prediction (VP) and load forwarding
history table (LFHT) to speculatively execute load instructions.
Our study shows that by doing so there is about 15% average
speedup up over baseline architecture.
Keyword: load speculation, register dependence,
memory dependence, value prediction, load forwarding
1. Introduction
Modern superscalar processors allow instructions to
execute out of program order to find more instruction level
parallelism (ILP). These processors must monitor data
dependencies to maintain correct program behavior. There are
two types of data dependencies, register dependence and
memory dependence.
Register dependence is detected in the instruction
decode stage by examining instructions’ register operand fields.
If there is an instruction which load instruction depends on, the
load instruction must wait until prior instruction completed,
then the value of operand can be used.
The lack of information about memory dependence at
instruction decode time is a problem for an out-of-order
instruction scheduler. If the scheduler executes a load before a
prior store that writes to the same memory location, the load
will read the wrong value. In this event the load and all
subsequent dependent instructions must be re-executed,
resulting in a huge performance penalty.
To avoid these memory order violations, the instruction
scheduler should be conservative to prevent loads from
executing until all prior stores have executed. This approach
decreases performance because loads in majority cases will be
made falsely dependent on no alias stores as data on section 3
shown.
In this paper, we use a simple value predictor to predict
the operand value to avoid register dependence and propose a
structure called Load Forwarding History Table (LFHT) to
exploit memory dependence speculation at run time. As we
combine these two mechanisms, the predictor can help LFHT
making more load instructions execute without waiting for the
prior stores’ effective addresses calculated, this result in more
load instructions will be issued earlier. When a load instruction
is speculatively executed, instructions that are dependent upon
the load instruction will also be speculatively executed.
The organization of the rest of this paper is as follows.
Section 2 surveys previously proposed related works. Section 3
illustrates whole structure in superscalar processor. Section 4
describes our CPU model and simulation environment. The
performance is evaluated in section 5. Finally, the conclusion of
this paper is presented in section 6.
2. Related Works
The traditional works on memory disambiguation were
done in the context of compiler and hardware mechanisms for
non-speculative disambiguation to ensure program correctness.
Franklin and Sohi [2] proposed the address resolution buffer
(ARB). The ARB indicates memory references into bins
according to their address. The bins are used to cause a
temporal order between references to the same address. The
ARB is a structure based on bank. Multiple disambiguation
requests can be dispatched in one cycle, provided that they are
all to different banks.
Chrysos and Emer used predictor to solve memory
disambiguation problem in [5]. The goal of the designers is to
be able to schedule load instructions as soon as possible without
causing any memory order violations. The predictor proposed is

based on store-sets. A store set for a specific load is the set of
all stores upon which the load has ever depended. The
processor adds a store to the store set of the load if a memory
order violation is caused when the load executes before that
store. In the next instance of the load instruction, the store set is
accessed to determine which stores the load will need to wait
for before executing.
A. Yoaz., M. Erez., R. Ronen. and S. Jourdan designed a
CHT predictor [7]. The CHT predictor provides a prediction
about whether a load instruction will conflict with any store in
the instruction window. Allocating a new entry only when a
load collides for the first time and invalidating its entry when its
state changes to non-colliding. It does not predict which store
instruction the load will conflict with. Therefore, it is easier to
design but it does not provide the best possible information for
disambiguation purposes.
Color set [10] presents a simple mechanism which
incorporates multiple speculation levels within the processor
and classifies the load and the store instructions at run time to
the appropriate speculation level. Each speculation level is
termed as a color and the sets of load and store instructions are
called color sets. These colors divide the load instructions into
distinct sets, starting with the base color which corresponds to
the no violation case. In other words, this set is the set of load
instructions which have never collided with unready store
instructions in the past. Each color in the spectrum represents
increasing levels of aggressiveness in load speculation; a load
instruction is allowed to issue only if its color is less than or
equal to the current speculation level. If the processor later
discovers that the load has collided with a store, the color
assigned to the load instruction in the predictor is increased.
3. VALUE PREDICTION AND LFHT
3.1 Issuing a Load
When executing a load or store instruction, the
instruction is split into two micro instructions inside the
processor [1]. One instruction calculates the effective address,
and the other instruction performs the memory access once the
effective address calculated and any potential store alias
dependencies resolved. In the baseline architecture, each store
and load instruction must wait until its effective address
calculation completes. In addition, all stores are issued in-order
with respect to prior stores, and each load must wait on the most
recent store before it can be speculatively issued.
There are three cases that a load instruction always
spends cycles on, (1) waiting on its effective address calculation
(ea), (2) waiting for prior store addresses to be calculated (dep),
and (3) the latency for fetching the data (mem). This paper
focus on (1) and (2). We use data prediction to solve problem
(1), and use LFHT to solve problem (2).
Figure 3.1 shows how many cycles per load instruction
waiting on its effective address [13]. As the figure shows that
each load instruction must wait 7 cycles in average so that it can
get its effective address, this make a lot of wasting.
0
5
10
15
20
25
30
bzip2
crafty
gap
gcc
gzip
m
cf
parser
twolf
vortex
vpr
Figure 3.1 Cycles per load instruction spend on waiting its
effective address in baseline architecture.
In the conventional disambiguation memory dependence
mechanism [14], load-forwarding behavior can detect store alias
and forward store data. Figure 3.2 shows percentages of load
that can take advantage of load-forwarding behavior [12]. Most
load instructions will not forward store data and conflict with
prior store on the baseline simulation architecture (describe in
section 4), the average amount of forwarding load is 12.7% and
the lowest amount of forwarding load is only 2.7%. It means
that most load instructions are unnecessarily pending for
disambiguating memory dependence.
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
bzip2
crafty
gap
gzip
m
cf
parser
tw
olf
Figure 3.2 Percent of forwarding load instructions
3.2 Value Prediction
All loads have to wait until their effective address is
calculated before they can be issued. If the load is on the critical
path, and the address can be accurately predicted, then it can be
beneficial to speculate the value of the address and load the data
as soon as possible.
A load instruction is effectively split into two instructions
inside the processor, one instruction calculates the effective
address. In order to predict this instruction, we predict
instruction’s operand so that we don’t have to wait prior
instruction which this instruction depend on.
Since we predict instruction’s operand, there are some
instructions that don’t have their register dependence with prior
instruction, these instructions don’t need to predict operand
because they already have exactly register value. In our

simulation, we only predict instructions with register
dependence. But all load instructions must update the predictor,
so that we can maintain predictor’s accurate rate.
Data prediction helps speedup the effective address
calculation for a load. The load then has only to wait on
potential store aliases before issuing. But if the operand was
incorrectly predicted, a recovery mechanism will take place
when the actual operand is available.
Value prediction has been studied for a long time and
many schemes have been proposed [4, 6, 11]. In this paper we
use the simplest scheme, stride predictor, to predict the operand
of a load instruction.
3.3 Load Forwarding History Table
When a load is issued, it performs a lookup in the store
buffer for a non-committed aliased store and it performs its data
cache access in parallel. If a store alias is found, store data
forward to load and the load has a shorter latency. If there is no
store alias, and there is a data cache hit, the load has a longer
latency because of the pipelined data cache. If there is a miss in
the data cache, the miss will only be processed if no alias is
found in the store buffer, load-forwarding behavior can detect
store alias and forward store data. This way, load instructions
can be issued out of order without waiting for prior stores
executed [8, 9, 14].
Conventional disambiguation memory dependence
mechanism unable to provide information for load instruction in
the decode stage. For that reason, in order to exploit
load-forwarding behavior and bring about all of these benefits, a
mechanism is proposed: the load forwarding history table
(LFHT). The LFHT records the result produced when the load
instruction was executed for load forwarding behavior of the
last time, and determines whether or not to out of order issue the
load when the load instruction is encountered in the future.
Each LFHT entry contains two fields: the tag field and
alias bit field. The LFHT is considered as a direct mapped cache,
indexed by the PC. The alias bit field is a sticky bit, the load
instruction is always treated as alias and waits until all prior
store addresses have been calculated before issuing, after the
load instruction encounter the first load forwarding behavior
indicate conflict with store at the execution time.
LFHT will be established or updated according to load
forwarding behavior of a load instruction at run time. If LFHT
miss, part of load instruction’s PC is written to the
corresponding entry as tag and the alias bit will be set or clear
depend on the load forwarding behavior. If LFHT hit, alias bit
will also be set or clear depend on the load forwarding behavior.
But if alias bit is in set state and load forwarding behavior
indicate no conflict with prior store, alias bit is still kept in set
state. If LFHT hit, and alias bit is in clear state, the load
instruction will be speculatively executed.
The validation/invalidation of speculative load instruction
is performed when each prior store address has been calculated.
Each time a store address is calculates, all the executed
speculative loads that occur after store in the instruction
window have their addresses checked for an alias. If an alias is
found, recovery action is taken for the load, and the load must
be re-issued; corresponding alias bit changed into set state to
avoid incorrect speculative load execution in the future.
3.4 Combine VP and LFHT
We used LFHT introduced in section 3.3 to combine with
value predictor discussed in section 3.2.
If only VP is used, although some load instructions can
get their operand faster, but these load instructions still must
wait prior store instruction’s effective address calculated to
ensure that there are no memory dependence.
But if only LFHT is used, although we have overcome
memory disambiguation problem, but there still exist register
dependence, it means, some load instructions’ operand isn’t
available to use. So that these load instructions must wait
operand ready to issue to the function unit.
So we combine these two mechanisms as shown in figure
3.3 to solve both memory dependence and register dependence.
Instr. Fetch
Unit
Decode Unit
Register Update
Unit
Load/Store
Function Unit
Function Unit
Function Unit
LFHT
Register file
The additional data
path
VP
Figure 3.3 Architecture data path with VP and LFHT.
4. Evaluation Methodology
4.1 Machine Model
The simulator used in this work is derived from the
SimpleScalar 2.0 and 3.0c tool set [3], a suite of functional and
timing simulation tools. The instruction set architecture
employed is the Alpha instruction set, which is based on the
Alpha AXP ISA.
Table 1 summarizes some of the parameters used in our
baseline architecture. Table 2 shows the architectures we
studied in this evaluation.
Table 1 Baseline Architecture Configuration
Instruction fetch 8 instructions per cycle.
Out-of-Order
execution
mechanism
Issue of 8 instructions /cycle, 256 entry
RUU(which is the ROB and the IW
combined), 128 entry load/store queue.
Loads executed only after all preceding
store addresses are known. Value
bypassed to loads from matching stores
ahead in the load/store queue. 2 cycle
load forwarding latency.

Architected registers 32 interger, hi, lo, 32 floating point.
Functional units
(FU)
8-integer ALUs, 8 load/store units, 4-FP
adders, 1-Integer MULT/DIV, 1-FP
MULT/DIV
FU latency int alu--1, load/store--1, int mult--3, int
div--12, fp adder--2, fp mult--4, fp
div--12, fp sqrt--24
L1 Instruction cache 64K bytes, 2-way set assoc., 32 byte
block, 4 cycles hit latency.
L1 Data cache 64K bytes, 2-way set assoc., 32 byte
block, 4 cycles hit latency. Dual ported.
L2 unified cache 1024K bytes, 4-way set assoc., 64 byte
block, 12 cycles hit latency
Memory Memory access latency (first-36, rest-4)
cycle. Width of memory bus is 32 bytes.
TLB miss 30 cycles
Table 2 Architectures we studied
Baseline Baseline architecture
VP Baseline + VP
VP + LFHT Baseline + VP + LFHT
VP + LFHT with Cycle
Clear
Baseline + VP + LFHT with
Cycle Clear
Perfect VP + Perfect LFHT Baseline + Perfect VP + Perfect
LFHT
Note: Cycle Clear is a keyword detailed in section 5.3
4.2 Benchmarks
To perform our experimental study, we have collected
results of the SPEC2000 benchmarks. The programs were
compiled with the gcc compiler included in the tool set. Table 4
shows the input data set for each integer benchmark. Table 5
shows the floating-point benchmark. In simulating the
benchmarks, we skipped the first billion instructions, and
collected statistics on the next five hundred million instructions.
Table 4 Input data set for benchmarks
SPECint 2000 Input SPECfp 2000 Input
bzip2 input.source ammp ammp.in
crafty crafty.in applu applu.in
gap ref.in art a10.img &
gcc 166.i equake inp.in
gzip input.graphic galgel galgel.in
mcf inp.in licas lucas2.in
parser ref.in mesa mesa.in
twolf ./twolf/ref mgrid mgrid.in
vortex lendian.raw swim swim.in
vpr net.in & arch.in
5. Performance Analysis
In this section, we will examine the performance
improvement gained by using the proposed mechanism. We also
explore detail configuration of VP and LFHT.
First of all, we show how many load instructions each
benchmarks have (as shown in figure 5.1 and 5.2). This can help
analyzing data.
Note: load instructions with same program counter means these
instructions are the same instruction.
5.1 VP: Cycles Waste for Waiting EA
Figure 5.1 shows how many cycles per load waste for
waiting its effective address when using VP. Compare to figure
3.1 in section 3.1, 4.5 cycles per load instruction are saved after
using VP.
0
1
2
3
4
5
6
bzip2
crafty
gap
gcc
gzip
m
cf
parser
twolf
vortex
vpr
Figure 5.1 Cycles per load instructions spend on waiting its
effective address after using VP.
5.2 LFHT
The average LFHT hit rate of integer benchmarks are
68% for 128, 85% for 512, 94% for 2048, 99% for 8192, of
floating-point benchmarks are 80% for 128, 99% for 512, 97%
for 2048, 99% for 8192.
Figure 5.2 shows how many cycles per load waste for
waiting its effective address when using LFHT, each load
instruction spends average 3.43 cycles.
Figure 5.3 shows how many cycles per load instructions
spend on waiting its effective address after we combined both
VP and LFHT. Each load instruction now spends average 1.91
cycles..
Because of (1) effective address calculation need at
least one cycle, (2) there are 40% of load instructions with
unavailable operand can’t predict (because these load
instructions’ state field in VP are init or transient), so we still
have to wait 1.91 cycles.
5.3 IPC & Speedup
Figure 5.4 shows IPC in integer benchmarks. Figure 5.5
shows IPC in floating-point benchmarks.
Alias bit is a sticky bit in LFHT, alias bit keeps in set
state after first conflict store detected. When a load instruction

conflicts with a prior store instruction and the alias bit is set,
this load instruction always waits until all prior store addresses
have been calculated before issuing, no matter whether store
alias has occurred or not. That may reduce speculative load in
effect, as false data dependency after the long period of run time
may occurs.
To prevent the LFHT from being too conservative,
causing false data dependency, all alias bit fields in LFHT are
cleared at a regular length of cycles. In this paper, we modeled
50,000 cycles clear (CC) for LFHT.
Figure 5.6 shows integer benchmarks’ speedup over the
baseline. The average speedups are 14.5% without cycle clear,
16.1% with cycle clear. Figure 5.7 shows floating-point
benchmarks’ speedup over the baseline. The average speedups
are both 5% with or without cycle clear.
Speedup is calculated by:
(New schemes’IPC – baseline’s IPC) / baseline’s IPC
In our simulation, we also model a perfect scheme on
both VP and LFHT, using this scheme to check how many
performance we can get most from this scheme.
With VP, we use perfect prediction on all load
instructions. This means when a load instruction is dispatched to
RUU, whether its operand is available or not, we can use it with
actual value, because we can predict it with 100% accurate rate.
With LFHT, perfect method means load instruction
only wait for store instruction with same effective address. This
means there are no unnecessary time spent on waiting store
instructions’ effective addresses.
The average speedup with perfect prediction in integer
benchmarks is 20.5%, in floating-point benchmarks is 7.5%.
Our simulation results in speedup of 16.1% for integer
benchmarks, of 5% for floating-point. This is close to perfect
case.
The main reason of vortex’s performance worsen than
baseline is due to its low accurate rate in value prediction, so it
needs more cycles to squash instructions and re-fetch
instructions to instruction windows.
0
1
2
3
4
5
6
bzip2
crafty
gap
gcc
gzip
m
cf
parser
tw
olf
vortex
vpr
Figure 5.2 Cycles per load instructions spend on waiting its
effective address after using LFHT
0
0.5
1
1.5
2
2.5
3
3.5
bzip2
crafty
gap
gcc
gzip
m
cfparser
tw
olfvortex
vpr
Figure 5.3 Cycles per load spend on waiting its ea after combine
both VP and LFHT
IPC (integer)
0
1
2
3
4
5
6
7
bzip2
crafty
gap
gcc
gzip
m
cfparser
tw
olfvortex
vpr
baseline
VP(2K)
VP+LFHT
with CC
perfect
Figure 5.4 IPC (integer benchmarks)
IPC (floating-point)
0
1
2
3
4
5
6
am
m
p
applu
artequakegalgel
lucas
m
esa
m
grid
sw
im
baseline
VP
VP+LFHT
with CC
perfect
Figure 5.5 IPC (floating-point benchmarks)

speedup (integer)
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
bzip2
gap
gzip
parser
vortex
VP
VP+LFHT
with CC
perfect
Figure 5.6 Speedup over baseline (integer benchmarks).
speedup (floating-point)
-0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
am
m
papplu
artequakegalgel
lucas
m
esam
grid
sw
im
VP
VP+LFHT
with CC
perfect
Figure 5.7 Speedup over baseline (floating-point benchmarks).
6. Conclusions
In this paper we present a combine mechanism for
improving load instruction issue rule in the modern superscalar
processor. Conventional load instructions only be issued when
ensure there are no dependences, and therefore reduces the
instruction level parallelism. We proposed a scheme combine
two mechanisms, value prediction (VP) + load forwarding
history table (LFHT) to speculatively execute load instructions.
All the information of VP and LFHT are established or
updated at run-time using load instruction history and load
forwarding behavior. They will provide the information of
memory disambiguation for the load instruction speculative
issue at the issue time. Throughout this study, we have not only
examined the load instruction issue rule, but also re-establish
memory dependence and disambiguation mechanism from two
aspects: first, we have studied the characteristic of load
instructions and using the information of load for memory
dependence and disambiguation, and second, we have proposed
a combine scheme that can take advantages of improving the
instruction level parallelism.
We evaluated the performance of our proposed
architecture with SimpleScalar. VP provides average speedup of
8.5% over baseline simulation architecture. With VP & LFHT,
the speedup is 14.5% over baseline architecture. With Cycle
Clear of LFHT, 16.1% speedup over baseline architecture is
achieved.
References
[1] M. Johnson, “Superscalar Microprocessor Design,”
Englewood Cliffs, Prentice Hall, 1991.
[2] M. Franklin and G. S. Sohi ,“ARB: A Hardware
Mechanism for Dynamic Reordering of Memory
References”, IEEE Transactions on Computer, May 1996
[3] D. C. Burger and T. M. Austin, “The Simplescalar Tool Set,
version 2.0” ,Technical Report CS-TR-97-1342, University
of Wisconsin, Madison, June 1997.
[4] K. Wang and M. Franklin, “Highly accurate data value
prediction using hybrid predictors”, In proceedings of the
Thirtieth Annual IEEE/ACM International Symposium on
Microarchitecture, pages 281-290, Dec. 1997.
[5] G.Z Chrysos and J.S Emer ,“Memory dependence
prediction using store sets”, In proceedings of the 25th
Annual International Symposium on Computer
Architecture, pages 142-153, June 1998.
[6] G. Reinman and B. Calder. “Predictive techniques for
aggressive load speculation”, In proceedings of the 31st
Annual ACM/IEEE International Symposium on
Microarchitecture, pages 127-137, Nov 1998.
[7] A. Yoaz.; M. Erez.; R. Ronen.; S. Jourdan. “Speculation
Techniques for Improving Load Related Instruction
Scheduling.” Proceedings of the 26th International
Symposium on Computer Architecture, May 1999
[8] G. Reinman. B. Calder. “A Comparative Survey of Load
Speculation Architectures ,” Journal of Instruction Level
Parallelism, May 2000.
[9] A. Moshovos and G.S. Sohi . “Reducing memory latency
via read-after-read memory dependence prediction”, IEEE
Transactions on Computers, pages 313-326, March 2002.
[10] S. Onder, “Cost effective memory dependence prediction
using speculation levels and color sets”, In proceedings of
the 2002 International Conference on Parallel Architectures
and Compilation Techniques, pages 232-241, Sept. 2002.
[11] Huiyang Zhou, J. Flanagan. and T.M. Conte , “Detecting
global stride locality in value streams”, In proceedings of
the 30th Annual International Symposium on Computer
Architecture, pages 324-335, June 2003.
[12] Shin-Rung , Chen , “Memory Disambiguation using
Load Forwarding”, Master Thesis, Department of
Computer Science and Engineering, Tatung University,
July 2004
[13] Cheng-Chun Lin , “Load Speculation”, Master Thesis,
Department of Computer Science and Engineering, Tatung
University, July 2005
[14] Jon Paul Shen , Mikko H. Lipasti ,Modezn Processor
Design : Fundamentals of Superscalar Processors ,
McGraw-Hill,2005.

shieh06a

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (14)

Similar to shieh06a

Similar to shieh06a (20)

shieh06a