shashank_hpca1995_00386533

Design and Performance Evaluation
of a Multithreaded Architecture *
R. Govindarajan S.S.Nemawarkar Philip LeNir
Dept. of Computer Science
St. John’s, A1C 557,Canada
Dept. of Electrical Engineering
Montreal, H3A 2A7, Canada
Dept. of Electrical Engineering
Montreal, H3A 2A7,Canada
Memorial Univ. of Newfoundland McGill University McGill University
govind@cs.mun.ca shashankQmacs.ee.mcgill.ca lenirQee470.ee.mcgill.ca
Abstract
Multithreaded architeclures have the abzlity to tol-
erate long memory latencies and unpredictable syn-
chronization delays. I n this paper we propose a multi-
threaded architecture that is capable of exploiting both
coarse-gram parallelism, and fine-grain instruction-
level parallelism in a program. Instruction-level par-
allelism is exploited by grouping instructions from a
number of active threads at runtime. The architecture
supports multiple resident actzvations to improve the
extent of locality exploitpd. Further, a distributed data
structure cache organizatzon I S proposed to reduce both
the network traffic and the latency zn accessing remote
locations.
Initial performance evaluatzon using dzscrete-event
simulation zndicates that the architecture is capable of
achieving very high processor throughput. The intro-
duction of the data structure cache reduces the network
latency szgnificantly. The zmpact of various cache or-
ganizations on the performance of the architecture is
also discussed in this paper.
1 Introduction
Multithreaded architectures [lo, 111 are based on
a hybrid evaluation model which combines the von
Neumann execution model and the data-driven evalu-
ation. In the hybrid model, a program is represented
as a partially-ordered graph of nodes. The nodes,
called threads, consist of a sequence of instructions
which are executed in the conventional von Neumann
way. Individual threads are scheduled in a dataflow-
like manner, driven by the availability of necessary
input operands to the threads. Also, in the hybrid
evaluation model, a long-latency memory operation is
performed as a splzt-phase operation, where the ac-
*Thiswork was supported by MICRONET -Network Cen-
tres of Excellence, Canada.
0-8186-6445-2/95 $04.00 01995 IEEE
cess request is issued by one thread and the accessed
value is used in a different thread - the response to
the access request provides the necessary synchroniza-
tion for the second thread. As a result the processor
does not idle on a long-latency operation; instead it
switches to the execution of another thread.
In order to improve the extent of locality exploited
and to perform synchronization efficiently, it is nec-
essary to recognize the three levels of program hier-
archy, namely code-block, threads, and instructions,
and use appropriate synchronization and scheduling
mechanisms at each level of hierarchy [6]. Based on
the above design philosophy, in our earlier work, we
have proposed the Scalable Multithreaded Architec-
ture to exploit Large Locality (SMALL) [SI.A salient
feature of SMALL is maintaining multiple resident ac-
tivations in the processor which ensures the exploita-
tion of high locality and zero load stalls in accessing
the local variables of a function.
In this paper, we extend SMALL to exploit both
coarse-grain parallelism and fine-grain instruction-
level parallelism. Coarse-grain parallelism is exploited
by distributing the execution of various invocations of
a function body (or loop body) across several Process-
ing Elements (PES). Fine-grain instruction-level par-
allelism is exploited by a runtime grouping of instruc-
tions from multiple threads and executing them con-
currently on multiple execution pipes available in each
PE. Further, in the proposed architecture, we intro-
duce a distributed Data Structure cache (DS-Cache)
organization for shared data structures. Our archi-
tecture supports caching of two types of data struc-
tures, namely the I-Structures [4], where each loca-
tion is written at most once, and the normal data
structures, where an individual location can be writ-
ten many times. Coherence in caches is maintained by
using a special scheme for I-Structures, and by using
software cache coherence mechanisms for normal data
structures.
298

The performance of the multithreaded architecture
is evaluated using discrete-event simulation. Our sim-
ulation results indicate:
(1) With a small number (2 or 3) of execution pipes,
the architecture can effectively exploit all the paral-
lelism available at a PE.
(2) The throughput with two execution pipes per PE
is almost equal to that of a configuration with twice
the number of PES and a single execution pipe.
(3) The introduction of a set-associative cache leads to
a near-linear performance improvement with respect
to the number of PES. Further, our simulation results
show the presence of the DS-Cache is essential to re-
alize the performance gains due to multiple execution
pipes.
(4) Lastly, the presence of the DS-Cache reduces the
network latency experienced by remote read requests
by a factor of 3 to 6.
The details of the architecture are described in Sec-
tion 2. The cache organization and the protocol for
caching I-Structures are discussed in the subsequent
section. Section 4 deals with the performance eval-
uation of our architecture using various benchmark
programs. In Section 5 we investigate the feasibility
of the architecture by comparing the functionality of
various modules with the standard modules present
in other commercially available processors. We com-
pare our work with other multithreaded architectures
in Section 6. Concluding remarks are presented in
Section 7.
2 The Architecture
The execution model of our architecture, which is
similar to that proposed in the threaded abstract ma-
chine [6], is discussed in [8]. Readers are referred to [8]
for details.
The architecture consists of a number of Processing
Elements (PES) connected by an interconnection net-
work. For the purpose of this paper we consider the
binary n-cube network as the topology of the inter-
connection network. In this section, we describe the
organization of a PE (refer to Fig. 1)and allfunctional
units except the DS-Cache. The processor part of the
architecture is enclosed in a dotted box in the figure.
2.1 Frame Memory Unit
The activation frames of a program are stored in
the frame memory unit. The frame memory unit also
consists of a simple memory manager which is respon-
sible for managing the frame memory. On a func-
tion invocation, the frame memory manager allocates
F U W W IlR w " m d l & M M m R ~ , L ,
SL' s.lnt.*unl IP h.nrumRcg,rr
R W U d W n k S u u w . IS 1s-
Figure 1: Organization of a Processing Element
a frame of appropriate size in the frame memory. In-
put arguments to an activation also reach the frame
memory unit in the form of tokens, and the value gets
written in the respective frame memory location by
the memory manager. A signal indicating the arrival
of an operand is sent to the thread synchronization
unit where the synchronization of input arguments to
a thread is performed. The frame memory unit also
receives a signal from the execution pipe when an ac-
tivation terminates. In that case, the frame memory
manager deallocates the corresponding frame.
2.2 Thread Synchronization Unit
The thread synchronization unit performs the syn-
chronization of inputs to a thread. It is similar to the
explicit token store matching unit of the Monsoon ar-
chitecture [16].A thread becomes ena6kd when it has
received all its input arguments. The activation con-
taining this thread becomes enabled, if the activation
isn't already enabled. An enabled thread is sent to the
the filter unit.
2.3 Filter Unit
If the incoming thread belongs to an activation
whose activation frame is already resident in the high-
speed buffer (refer to Section 2.4), the filter unit for-
299

wards the thread to the scoreboard unit. If the thread
belongs to an activation which is not resident, then
the filter unit checks whether the activation can be
loaded in the high-speed buffer. If so it instructs the
frame memory unit to load the activation frame in the
high-speed buffer (refer to Section 2.4). When the ac-
tivation is successfully loaded, it is said to be resident
(in the high-speed buffer), and the incoming thread is
sent to the scoreboard unit. On the other hand, if the
activation corresponding to the incoming thread can-
not be loaded, then the thread is queued inside the
filter unit until an activation is freed. The filter unit
proceeds to service other threads in its input queue.
2.4 High-speed Buffer
The high-speed buffer in our architecture is a mul-
tiported cache-like memory with a single cycle access
time. On a request from the filter unit, the frame
memory loads a frame in the high-speed buffer with
the help of the buffer loader.
When there are no enabled threads in a resident
activation, the high-speed buffer receives signals from
the score-board unit either to flush or to off-load a
frame from the high-speed buffer. If the request is to
off-load the frame, then the contents are stored back
in the frame memory. A flush request issued by the
filter unit indicates that the Corresponding activation
has terminated and therefore there is no need to save
the frame.
2.5 Score-Board Unit
The score board unit performs book-keeping op-
erations such as how many threads in an activation
are currently enabled. When the number of enabled
threads in an activation becomes zero, it instructs the
high-speed buffer to either flush or off-load the cor-
responding activation frame. Threads arriving at the
score-board unit are sent to Ready Thread Queue af-
ter the book-keeping operations.
2.6 Ready Thread Queue
The Ready Thread Queue maintains a pool of
threads which await certain resources in the instruc-
tion scheduler. A resource corresponds to a set of
registers, namely a program counter, a base-address
register for the activation frame, an intermediate in-
struction register, an instruction register, and a stall
register. The role of these registers will be explained
in the following subsection on instruction scheduler.
When a resource becomes available, the instruction
scheduler takes a thread from the pool and allocates
the resource to it. A thread that has acquired a re-
source is said to be active.
2.7 Instruction Scheduler
The instruction scheduler consists of two units, the
fetch unit and the schedule unit. The instruction fetch
unit receives a thread from the ready thread queue
and loads the program counter allocated to it. The
instruction pointed by the program counter is fetched
from the instruction cache. The instruction is stored
in the Intermediate Instxuction Register and the as-
sociated stall count' is loaded into the stall register.
The set of intermediate registers together form the so
called instruction window for the multiple execution
pipes.
The schedule unit, at each execution cycle, checks
the intermediate instruction registers. It selects upto
R instructions for execution, where n is the number of
execution pipes. These selected instructions are then
moved to the instruction register. These instructions
are initiated in the execution pipes in the following
cycle. The associated stall registers are decremented
every cycle until they become zero. The corresponding
program counters are then incremented and the next
instruction from those threads are fetched.
2.8 Instruction Cache
The instruction cache unit is similar to the instruc-
tion cache in conventional machines. Only code-blocks
corresponding to resident activations are present in the
instruction cache.
2.9 Execution Pipe
Each PE consists of a number of instruction execu-
tion pipes. The execution pipes used in our architec-
ture are generic in nature, each capable of performing
any operation. They are assumed to be fully pipelined.
The execution pipes are h d / s t o r e in nature: every
operation other than memory load and store uses reg-
ister operands. Since an activation frame is pre-loaded
in the high-speed buffer before instructions of that ac-
tivation are scheduled for execution, it is guaranteed
that a load operation (on a frame location) does not
cause any load stalls. The execution pipes share a
register file. The register file is logically divided into
'At compile time, a stall count is associatedwith each in-
structionwhich indicates how many stall cycles are needed, to
take care of data dependencyin pipeline execution.before the
next instructionin this thread can be initiated.
300

a number of register banks, one corresponding to each
reszdent activation. In the execution model, the logi-
cal register name specified in an instruction is used as
an offset within the register bank.
The instruction set of our architecture includes spe-
cial instructions to invoke new activation, to signal the
termination of a thread, the termination of an activa-
tion, and to communicate results to an external activa-
tion. An execution pipe sends a message to the router
unit whenever it has to (i) access the data structure
memory (either local or remote), (ii) communicate ar-
guments to an activation, or (iii) invoke an activation.
2.10 Router Unit
The router unit receives messages from the execu-
tion pipes, from the data structure memory, or from
the network switches. The messages are appropriately
routed to the frame memory, the data structure mem-
ory, the DS-cache, or the network switch.
2.11 Network Switch
The PES of our architecture are connected by a bi-
nary n-cube interconnection network which provides a
high network bandwidth. In this paper, we focus on
the use of a fixed routing scheme along the shortest
path to the destination.
3 Cache Organization
In our architecture the data structure memory is
shared and is distributed across the PES. The data
structure elements are uniformly distributed across
the PES.
A set of remote memory locations, called a cache-
lane, can be cached in the data structure cache (DS-
Cache) of the PE. The DS-cache can have either a
direct-mapped or a k-way set-associative organha-
tion. However for expository purposes, in this section,
we will consider the &way set-associative organisa-
tion. The DS-Cache consists of two parts, one for I-
Structures, whose elements are written only once, and
the other for read-write data structures. The respec-
tive caches are referred to as IS-Cache and RW-Cache.
A remote memory access request arriving at the router
unit is sent to the DS-Cache. If the corresponding
cacheline is present in the local cache, the location
is accessed from the cache and the contents are sent
to the router unit. Otherwise a request to fetch the
cacheline will be sent to the router. The router in turn
forwards the request to the appropriate PE through
the network switch. In the following two subsections
we discuss the issues in caching shared data structures
and the policies followed by the DS-Cache.
3.1 Caching Read-Write Data Structures
Whenever an access to a remote memory location
arrives at the DS-cache, the cache memory is searched
for the cacheline. If a hit occurs the access request is
satisfied by the RW-Cache. On a miss, a free cache
block is reserved for this cacheline. The tag and refer-
ence count fields of this cache block are set appropri-
ately. A pending bit associated with the cache block
is set to 1. The pending bit indicates that a request
to the cacheline has been sent on the network and the
response is awaited. When the response is received
the pending flag is reset. The reserved cacheline along
with the pending bit avoids multiple requests for the
same cacheline being sent to the network. Such re-
quests are queued inside the RW-cache until the pend-
ing bit is reset. When the requested cacheline arrives,
the pending bit is reset and the waiting requests are
serviced by sending the appropriate data values.
A consequence of reserving cache blocks (at the
time of miss) is that all cache blocks may be reserved,
awaiting response from the remote locations. In this
case, new requests arriving the RW-cache are queued
until a cache block becomes free. We refer to such a
case as an overflow or collision in the RW-Cache. All
writes in a read-write structure are performed only in
the structure memory. In order to maintain coherence
in caches for read-write structures, we follow a selec-
tive invalidation scheme similar to the one proposed
in [5]. A special invalidation instruction is executed
prior to accessing a data structure memory location if
there is a possibility that the corresponding cacheline
is modified after its last access. The invalidation mes-
sage is sent to the RW-Cache which invalidates the
cacheline if the cacheline is present in the local RW-
Cache. However, due cache replacements the cacheline
may not be present in the RW-Cache in which case the
invalidation signal is simply ignored.
3.2 Caching I-Structures
A request to a remote I-Structure location is first
searched set-associatively in the IS-Cache. If the
cacheline is found in the IS-Cache, then depending
on whether the location is full or empty, the request
is either satisfied or suspended. A suspended request
is queued in the IS-Cache. On a miss, a cache block
is reserved as is done in the RW-Cache. A pending
bit is associated with the cache block to avoid further
requests on the pending cache block being sent to the
network.
301

At the remote I-structure memory, an access to
a cacheline results in the following actions: The re-
quested cacheline is fetched and sent to the requesting
PE. For each empty location in the cacheline, an entry
indicating a pending read is queued. This is because
a remote PE has cached this cacheline, and it (the
remote PE) may try to access some of the empty lo-
cations from its IS-Cache. These requests may get
queued in the remote PE. Therefore, to release the
above requests, a wake-up message should be sent from
the I-Structure memory whenever any of these empty
locations becomes full. It may well be the case that
the remote PE might not have accessed this location,
and, further, might have even discarded the cached
block. Nonetheless the wake-up message should be
sent to the remote PE. If the remote PE has discarded
this cached block it will be simply ignore this message.
On the other hand, if the remote PE still retains the
cached block in its IS-Cache, then the value passed
along with the wake-up message is written in the a p
propriate cache location. Any pending requests in the
remote PE gets reactivated.
A write to an I-structure location is sent to the I-
structure memory. An I-Structure write is never done
at the IS-Cache. This is to avoid inconsistent states
and multiple writes in the I-Structure memory. Since
an I-Structure location can be written at most once, it
becomes a read-only structure once it is written, and
hence does not cause any coherence problem.
4 Performance Evaluation
In this section we evaluate the performance of our
architecture using discrete-event simulation.
4.1 Simulator Details
A function-level simulator for the architecture de-
scribed in Section 2 with the DS-Cache organization
was written in C to run on an Unix platform. Con-
stant processing time, in number of clock cycles, was
assigned to each functional unit.
The number of PES, the number of execution pipes
per PE, the size of the DS-Cache, its organization,
and the size of cache-block can be varied in each sim-
ulation run. The number of resident activations and
the number of resources available in a PE were each
assumed to be 32 as our earlier work on the perfor-
mance of SMALL [8] has revealed no significant effect
in increasing these numbers beyond 16. Each resi-
dent activation is allocated 32 general purpose regis-
ters in the execution pipe. The sizes of the instruction
cache and the high-speed buffer were assumed to be
8K words each. A frame memory of 16K words per
PE was used in the simulation. Lastly, each PE has a
data structure memory of 16K words. The data struc-
tures used in the program are uniformly distributed
over the memory modules in all the PES.
The performance of our architecture is evaluated
using four representative scientific application pro-
grams, namely SAXPBY (5120 elements), Matrix
Multiplication (squaring a 24 x 24 matrix), the Liv-
ermore Loop2 (with 4096 elements) and Image Aver-
aging (on 64 x 64 pixels). The SAXPBY program
is similar to the Linpack kernel SAXPY, except that
SAXPBY computes A[i]t X +B[i]*Y. The livermore
loop is programmed using I-Structures. The image
averaging application is a low-pass filter which uses
a barrier synchronization. All application programs
were hand coded and run on the simulator. In hand-
coding application no special optimizations are per-
formed. Also,no effort was made to map the given a p
plication on the PES. Further, all performance exper-
iments for an application program are conducted with
the same problem size. The instruction mix in these
programs is shown in Table 1. The last column in the
table gives an estimate of the overhead introduced by
following the multithreaded approach. Synchroniza-
tion operations account for synchronization instruc-
tions. Integer operations include all address calcula-
tion instructions as well as instructions that perform
transfer operations between frame-memory and regis-
ters.
Benchmark %-age Instruction Mix
Ld. Ctrl. Synch. Over-
SAXPBY 53.12 10.87 10.87 1.86 17.27
Matrix 55.05 14.26 14.56 4.18 6.85 5.10
Ops. Ops. Store Ops. Ops.
Mult.
Livermore 145.46 I 8.33 I 12.50 I 5.75 I 9.76 I 18.20
Loop
Image 140.69 I 16.70 I 13.36 I 1.10 I 10.38 I 17.76
Averaging I
Table 1: Instruction Mix in Application Programs
4.2 Performance Results
Throughput vs. Number of PES
First we evaluate the throughput of our architec-
ture with respect to the various parameters. Through-
put is defined as the total number of instructions that
are completed by all the PESin an execution cycle. In
302

Figure 2: Throughput vs. Number of PES
Benchmark
Proeram
this experiment, we used a 4-way set-associative DS-
Cache, with a cache-line size of 4 words. In a later
experiment we study the effect of the DS-Cache on
the throughput.
Fig. 2 plots the overall throughput of our archi-
tecture against the number of PES, where both axes
are drawn in the logarithmic scale. We observe that
the throughput of our architecture increases almost
linearly with the number of PES for all benchmark
programs. When the number of execution pipes is in-
creased from 1 to 2, a considerable improvement in
throughput is observed for all benchmark programs.
This is true irrespective of the number of PES in the
system. Increasing the number of execution pipes be-
yond 3 does not result in any improvement in the
throughput. An interesting observation that can be
made from the plots of Fig. 2 is that the throughput
with 2 execution pipes per PE is almost equal to that
of a configuration with twice the number of PES and
a single execution pipe.
Average Instruction-Level Parallelism
Number of PES-
SAXPBY
Matrix
Mult.
1 2 4 8 16 32 64
1.97 1.96 1.95 1.88 1.80 1.46 1.12
1.66 1.68 1.68 1.67 1.66 1.37 1.03
Table 2: Average Instruction-Level Parallelism
The average number of instructions executed per
clock cycle per PE is the instruction-level parallelism
exploited by architecture. Table 2 tabulates the ex-
ploited instruction-level parallelism when 4 execution
pipes per PE were used. As seen from this table, the
instruction-level parallelism never exceeds 2 which in-
dicates that 2 execution pipes are sufficient to fully uti-
h e the synchronizing capabilities of a PE. For Liver-
more Loop and the Image Averaging programs, the ex-
ploited instruction-level parallelism is low. This is due
to high synchronization overheads involved in these
applications.
Effect of DS-Cache on Throughput and
Network Latency
As mentioned earlier, the introduction of the DS-
Cache reduces the network latency and thus improves
the performance of the system. In Fig. 3, we compare
an architecture without the DS-Cache (with 1,2, or 3
execution pipes) with an architecture with 3 execution
pipes and a 128-set, 4 way set-associative DS-cache for
the SAXPBY program. Other benchmarks exhibit a
similar trend and hence are not shown here. It can
be noticed that the introduction of the DS-Cache im-
proves the throughput significantly, especially when
the number of PES is large, i.e. greater than 8. This
is because, when the number of PES is small, the par-
allelism available per PE is large enough to tolerate
the remote memory latency. Hence, the presence of
the DS-Cache does not influence the performance in
a significant way. However, with a large number of
PES, the parallelism per PE decreases (since no scal-
ing of application programs is considered here), and
therefore the remote memory access latency becomes
crucial.
It may be observed from Fig 3 that in the absence
of the DS-Cache, there is no gain in the throughput
by increasing the number of execution pipes. Another
important observation that can be made from Fig. 3 is
303

-of-- *-o,)bOr-
Figure 2: (contd.) Throughput vs. Number of PES
Figure 3: Throughput with and without the DS-Cache Figure 4: Effect of DS-Cache on Network Latency
that the throughput of the architecture with the DS-
Cache equals the throughput of a configuration with-
out the DS-Cache, but with twice the number of PES.
Earlier we made a similar observation with respect to
the number of execution pipes. Thus one can infer
that it is the DS-Cache that allows the PES to exploit
higher instruction-level parallelism.
The introduction of the DS-Cache influences the
performance of the architecture in two ways. First,
it reduces the access time for remote read requests
whenever the corresponding cacheline is present in the
DS-Cache. Secondly, in the event of a hit, the re-
mote requests are serviced by the cache, and hence
the requests do not enter the network. This in turn
reduces the network traffic and the network latency.
In Fig. 4, we plot the average network latency ob-
served in two application programs when the num-
ber of PES is increased from 2 to 64. In this exper-
iment, we consider architecture configurations, with
and without DS-Cache, but with 3 execution pipes per
PE in both cases. The maximum average latency en-
countered without DS-Cache is more than 2500 time
units for the Image Averaging application and 1500
time units for SAXPBY. These large values indicate
the enormous contention and the queuing delay en-
countered in the network. Further, this shows that
the saturated network latency is two orders of magni-
tude higher than the unloaded network latency. When
the architecture supports DS-Cache, the network la-
tency increases to a maximum value of 600 time units
(roughly) which is only one order of magnitude higher
than the unloaded network latency. Thus supporting
a DS-Cache in the architecture reduces the network la-
tency by nearly a factor of 3 to 6. The higher network
latency values for the image averaging application is
due to the large number of synchronizing messages
used for the barrier synchronization.
The network latency for architecture configurations
with DS-caches increases initially with the number of
PES and then decreases in Fig. 4. Though this may
seem counter-intuitive, it can be explained in the fol-
lowing manner. When the number of PES increased,
304

the size of the network and the number of network
switches increase, decreasing the extent of contention
in the network. Further as the size of the applica-
tion is not scaled, the available parallelism per PE
decreases with increase in the number of PES. Thus
the ‘lack’ of work makes the PE wait for the response
to come from the network before the PE can pump
in additional messages into the network. This in turn
reduces the number of messages (remote memory re-
quests) sent by a PE to the network, which in turn fur-
ther reduces the contention and the network latency.
However, when the number of PES is 8, each PE had
‘enough’ parallelism to tolerate even a very high net-
work latency. Thus the PES kept pumping more and
more messages into the network, keeping the network
always in saturation.
Effect of Cache Organizations
In this experiment we keep the number of PESas 32
and the number of execution pipes as 4. We used two
different DS-Cache sizes, viz. 1K words and 2K words
(per PE), even though a much larger cache size can
be supported in practice. There are two reasons for
this. First, no improvement in the overall throughput
was observed for our benchmark programs when the
cache size is increased beyond 2K words. Secondly,
the benchmark programs that were considered have a
smaller problem size compared to real-world examples.
Thus, the smaller DS-Cache assumed in the simulation
provides some kind of a scalodown effect, matching
well with the smaller problem sizes considered in the
simulation experiments. For each of the cache sizes,
namely 1K and 2K words, we considered 2 different
cacheline sizes (4 and 8 words) and 4 different cache
organizations, namely direct mapped, 2-way, +way,
and 8-way set-associative Organizations.
The throughput of the architecture is once again an
important performance metric to judge the suitabil-
ity of the cache organization. The average number
of times each cache block is accessed before another
cacheline overwrites it, is a measure of the utilization
of the cache block. This is referred to as the average
cache block reuse. The number of times a cache block
cannot be reserved in the DS-Cache due to the non-
availability of a cache block in the corresponding set’,
is referred to as the collisions in the DS-Cache. When-
ever a collision occurs, the request is queued until a
cache block becomes free. The queuing of requests in-
’Recall that this could happen if all the blocks in the apprw
priate set are waiting- with their pending flagsset to 1 -- for
their read requests to be satisfied by the remote data structure
memory.
creases the response time for the read and hence will
affect the throughput of the architecture.
Table 3 summarizes these performance parameters
for the different cache organizations. From the table,
the following observations can be made.
(1) The direct-mapped cache organization yields a
lesser throughput compared to the set-associative or-
ganization. The lower throughput in direct-mapped
caches is an expected result, as direct-mapping pro-
vides less flexibility in mapping a cacheline to a cache
block.
(2) The value for the collision parameter decreases
drastically as the associativity is increased. Also, the
cache block reuse increases to a small extent with the
associativity.
(3) Increasing the cacheline size increases the cache
block reuse factor by a small amount. A small increase
in the throughput of the architecture is also observed
with the increase in the cacheline size.
(4) The throughput of the architecture is nearly the
same for all cache organizations with an associativity
greater than or equal to 2.
5 A Study on Feasibility
In this section we provide approximate
for transistor counts of various functional
estimates
blocks in
our architecture. These estimates are based on com-
parisons with the blocks of similar capability present
in currently available commercial and experimental
microprocessors, namely, Message-Driven Processor
(MDP) [7], SuperSPARC [l]and RS-6000 [14]. MDP
uses standard cell approach, and hence the corre-
sponding estimates are conservative. We plan to con-
duct VHDL simulations in future to provide detailed
understanding on the feasibility aspects.
The processor is divided into two parts, namely,
on-chip storage and data-path, control logic and in-
terface logic. On-chip storage on SuperSPARC and
MDP, uses nearly 70-80% of the total transistors, but
covers 30% area of the chip due to its regular struc-
ture. On-chip storage contains an 8K word Instruction
Cache and an 8K word high-speed buffer. Based on
the estimates for a 4K words SRAM in MDP, the 8K
word Instruction Cache requires 1.6M transistors, The
high-speed buffer of 8K words is divided into 32 pages
and is fully associative at page level only. We assume
5 ports to high-speed buffer, one for each (upto 4) ex-
ecution pipe and one for the buffer-loader. A 5-ported
64K Byte on-chip memory uses 4.5M transistors in
RS-6000, so the high-speed buffer will require nearly
3M transistors.
305

Table 3: Effect of Cache Organization on Performance
The register file contains upto 1024 registers and
8 ports3. Such register files are common in super-
scalar/VLIW processors. The instruction scheduler
contains 32 resources with a total of 128 registers.
A resource uses the space for 4 registers. The ready
thread queue has a length of 10 threads, and uses 3
registers for each thread. Thus thread management
functions use 158 registers with one read and one write
port.
Datapath, control logic and interface logic use the
remaining 20-30% transistors, but consume 70% of
the chip area. Our architecture uses generic execu-
tion pipes, similar to that of an MDP which requires
39K transistors. Buffer and instruction cache loaders
use a counter for loading an activation or a code-block,
once the base address is specified. Otherwise, address
logic is similar to the address arithmetic unit of MDP,
which consumes 75K transistors. Remaining control
and interface logic (for internal and external memory
and chip I/O) uses 56K transistors in MDP.
Thus, the multithreaded processor proposed in this
paper requires approximately 5M transistors (a more
extensive execution pipeline may increase this num-
ber to 6M transistors), a reasonable size for the cur-
rent technology. Now let us consider other functional
units, namely the filter unit and the thread synchro-
nization unit. The filter unit maintains two tables,
one for 32 resident activations (with base addresses of
the frames), and other for 32 active threads (instruc-
tion pointer). Thread synchronization unit maintains
a queue of upto 32 threads (3 addresses each) which
do not have resident activations. Logic in these units
is fairly simple and can be implemented as finite state
machines, using conventional approaches like PLAs or
3 A study in 1141 shows that on an averageone read and one
readfwrite port is sufficientfor one instruction execution.
FPGAs.
6 Related Work
Several multithreaded architectures have been pro-
posed in the literature (refer, for example, to [2, 3, 6,
12, 13, 151 and to [ll]for a survey). Like Threaded
Abstract Machine (TAM) [6] and +T [15],our ar-
chitecture realizes three levels of program hierarchy
based on synchronisation and scheduling. TAM uses
a compiler-controlled approach to achieve fine-grain
parallelism and synchronization. In contrast, we ad-
vocate the use of suitable compilation techniques and
necessary hardware support. Further, our architecture
supports multiple resident activations, which help to
mask the cost of context switching while off-loading a
resident frame.
The processor coupling proposal [12] and the ar-
chitecture proposed by Hirata, et al. [9]use dynamic
packing from different threads to exploit instruction
level parallelism. Our approach is similar to this, ex-
cept that each instruction in a thread in our archi-
tecture contains a single operation. While the use of
multi-operation instructions in a thread in processor
coupling [I21 improves the throughput, it makes the
runtime scheduler more complex. Our architecture
also supports a two-level cache structure to achieve
high throughput.
Our work differs from Tera [3] and its pre-
decessors in following ways: (i) Tera uses long-
word instructions-three operations per instruction
and (ii) no dynamic packing of instructions is per-
formed on Tera. In contrast, the processor in +T[15]
is a superscalar. Our results indicate that the syn-
chronizing capability of a PE can be fully utilized by
using more than one execution pipe.
306

7 Conclusions
In this paper we have described the design of a scal-
able multithreaded architecture. The salient features
of the architecture are (i) its ab&ty to exploit both
coarse-grain parallelism and fine-grain instruction-
level parallelism, (ii) a distributed DS-Cache which
significantly reduces the network latency and makes
the system scalable, (G)a high-speed buffer organi-
zation which completely avoids load stalls on access
to local variables of an activation, (iv) a layered ap-
proach to synchronization and scheduling which helps
to achieve very high processor throughput and utiliz*
tion. The performance of the architecture is evaluated
using simulation. Initial simulation results are promie
ing and indicate that:
(1) With a small number (2 or 3) of execution
pipes, the architecture can effectively exploit all the
instruction-level parallelism available at a PE. The
throughput of our architecture with two execution
pipes is almost equal to that of a configuration with
twice the number of PESand a single execution pipe.
(2) The use of a set-associative cache leads to a near-
linear performance improvement with respect to the
number of PES. The presence of the DS-Cache re
duces the network latency experienced by remote read
requests by a factor of 3 to 6.
(3) The DS-Cache reduces network traffic which is es-
sential in realizing the performance improvements due
to multiple execution pipes.
Acknowledgements
The authors are grateful to the reviewers whose
comments have improved the presentation of this pa-
per. This work was supported by MICRONET - Net-
work Centres of Excellence in Canada.
References
[l] F. Abu-Nofal and et al. A three million transistor
microprocessor. In Digest of Technicnl Papers, 199.2
IEEE International Solid State Circuits Conference,
pages 108-109, Feb. 1992.
[2] A. Agarwal, B-H. Lim, D. Kranz, and J. Kubiatowicz.
APRIL: A processor architecture for multiprocessing.
In Proc. of the 17th Ann. Intl. Symp. on Computer
Architecture, pages 104-114. Seattle, Wash., June
1990.
[3] R. Alverson, D. Callahan, D. Cummings, B. Koblenz,
A. Porterfield, and B. Smith. The Tera computer
system. In Conf. Proc., 1990 IntI. Conf. on Super-
computing, pages 1-6, Amsterdam, The Netherlands,
June 1990.
[4] Arvind, R. S. Nikhil, and K. Pingali. I-structures:
Data structures for parallel computing. ACM Trans.
on Programming Languages and Systems, 11(4):598-
632, Oct. 1989.
[5] H. Cheong and A.V. Veidenbaum. Compiler-directed
cache management in multiprocessors. IEEE Com-
puter, pages 3947, June 1990.
[6] D.E. Culler et al. Fine-grain parallelism with minimal
hardware support: A compiler-controlled threaded
abstract machine. In Proc. of the 4th Intl. Conf.
on Architectural Supportfor Programming Languages
and Operating Systems, pages 164-175, Santa Clara,
Calif., April 1991.
[7] W.J. Dally et al. A message-driven processor: A
multicomputer processing node with efficient mech-
anisms. IEEE Micro, pages 24-38, April 1992.
[8] R. Govindarajan and S.S. Nemawarkar. Small: A
scalable multithreaded architecture to exploit large
locality. In Proc. of the 4th IEEE Symp. on Parallel
and Distributed Processing, pages 32-39, Dec. 1992.
[9] H. Hirata et al. An elementary processor architecture
with simultaneous instruction issuing from multiple
threads. In Proc. of the 19thIntl. Symp. on Computer
Architecture, pages 136-145. Gold Coast, Australia,
May 1992.
[lo] R. A. Iannucci. Toward a dataflow/von Neumann
hybrid architecture. In Proc. of the 15th Ann. Intl.
Symp. on Computer Architecture, pages 131-140,
Honolulu, Hawaii, June 1988.
[ll] R. A. Iannucci, G. R. Gao, R. H. Halstead, Jr., and
B. Smith. Multithreaded Computer Architecture: A
Summary of the State of the Art. Kluwer, Norwell,
Mass., 1994.
Processor coupling:
Integration compile time and runtime scheduling for
parallelism. In Proc. of the 19th Intl. Symp. on Com-
puter Architecture, pages 202-213. Gold Coast, Aus-
tralia, May 1992.
[I31 Y. Kodama, S. Sakai, and Y. Yamaguchi. A proto-
type of a highly parallel dataflow machine EM-4 and
its preliminary evaluation. In Proc. of InfoJapan 90,
pages 291-298, Oct. 1990.
[la] M. Misra. IBM RS System/6000 Technology, First
edition, 1990. IBM, Austin, Tx., 1990.
[I51 R. S. Nikhil, G. M. Papadopoulos, and Arvind. *T:
A multithreaded massively parallel architecture. In
Proc. of the 19th Ann. Intl. Symp. on Computer
Architecture, pages 156-167, Gold Coast, Australia,
May 1992.
[16] G. M. Papadopoulos and D. E. Culler. Monsoon: an
explicit token-store architecture. In Proc. of the 17th
Ann. Intl. Symp. on Computer Architecture, pages
82-91. Seattle, Wash., June 1990.
[12] S.W. Keckler and W.J. Dally.
307

shashank_hpca1995_00386533

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to shashank_hpca1995_00386533

Similar to shashank_hpca1995_00386533 (20)

shashank_hpca1995_00386533