SlideShare a Scribd company logo
Design and Performance Evaluation
of a Multithreaded Architecture *
R. Govindarajan S.S.Nemawarkar Philip LeNir
Dept. of Computer Science
St. John’s, A1C 557,Canada
Dept. of Electrical Engineering
Montreal, H3A 2A7, Canada
Dept. of Electrical Engineering
Montreal, H3A 2A7,Canada
Memorial Univ. of Newfoundland McGill University McGill University
govind@cs.mun.ca shashankQmacs.ee.mcgill.ca lenirQee470.ee.mcgill.ca
Abstract
Multithreaded architeclures have the abzlity to tol-
erate long memory latencies and unpredictable syn-
chronization delays. I n this paper we propose a multi-
threaded architecture that is capable of exploiting both
coarse-gram parallelism, and fine-grain instruction-
level parallelism in a program. Instruction-level par-
allelism is exploited by grouping instructions from a
number of active threads at runtime. The architecture
supports multiple resident actzvations to improve the
extent of locality exploitpd. Further, a distributed data
structure cache organizatzon I S proposed to reduce both
the network traffic and the latency zn accessing remote
locations.
Initial performance evaluatzon using dzscrete-event
simulation zndicates that the architecture is capable of
achieving very high processor throughput. The intro-
duction of the data structure cache reduces the network
latency szgnificantly. The zmpact of various cache or-
ganizations on the performance of the architecture is
also discussed in this paper.
1 Introduction
Multithreaded architectures [lo, 111 are based on
a hybrid evaluation model which combines the von
Neumann execution model and the data-driven evalu-
ation. In the hybrid model, a program is represented
as a partially-ordered graph of nodes. The nodes,
called threads, consist of a sequence of instructions
which are executed in the conventional von Neumann
way. Individual threads are scheduled in a dataflow-
like manner, driven by the availability of necessary
input operands to the threads. Also, in the hybrid
evaluation model, a long-latency memory operation is
performed as a splzt-phase operation, where the ac-
*Thiswork was supported by MICRONET -Network Cen-
tres of Excellence, Canada.
0-8186-6445-2/95 $04.00 01995 IEEE
cess request is issued by one thread and the accessed
value is used in a different thread - the response to
the access request provides the necessary synchroniza-
tion for the second thread. As a result the processor
does not idle on a long-latency operation; instead it
switches to the execution of another thread.
In order to improve the extent of locality exploited
and to perform synchronization efficiently, it is nec-
essary to recognize the three levels of program hier-
archy, namely code-block, threads, and instructions,
and use appropriate synchronization and scheduling
mechanisms at each level of hierarchy [6]. Based on
the above design philosophy, in our earlier work, we
have proposed the Scalable Multithreaded Architec-
ture to exploit Large Locality (SMALL) [SI.A salient
feature of SMALL is maintaining multiple resident ac-
tivations in the processor which ensures the exploita-
tion of high locality and zero load stalls in accessing
the local variables of a function.
In this paper, we extend SMALL to exploit both
coarse-grain parallelism and fine-grain instruction-
level parallelism. Coarse-grain parallelism is exploited
by distributing the execution of various invocations of
a function body (or loop body) across several Process-
ing Elements (PES). Fine-grain instruction-level par-
allelism is exploited by a runtime grouping of instruc-
tions from multiple threads and executing them con-
currently on multiple execution pipes available in each
PE. Further, in the proposed architecture, we intro-
duce a distributed Data Structure cache (DS-Cache)
organization for shared data structures. Our archi-
tecture supports caching of two types of data struc-
tures, namely the I-Structures [4], where each loca-
tion is written at most once, and the normal data
structures, where an individual location can be writ-
ten many times. Coherence in caches is maintained by
using a special scheme for I-Structures, and by using
software cache coherence mechanisms for normal data
structures.
298
The performance of the multithreaded architecture
is evaluated using discrete-event simulation. Our sim-
ulation results indicate:
(1) With a small number (2 or 3) of execution pipes,
the architecture can effectively exploit all the paral-
lelism available at a PE.
(2) The throughput with two execution pipes per PE
is almost equal to that of a configuration with twice
the number of PES and a single execution pipe.
(3) The introduction of a set-associative cache leads to
a near-linear performance improvement with respect
to the number of PES. Further, our simulation results
show the presence of the DS-Cache is essential to re-
alize the performance gains due to multiple execution
pipes.
(4) Lastly, the presence of the DS-Cache reduces the
network latency experienced by remote read requests
by a factor of 3 to 6.
The details of the architecture are described in Sec-
tion 2. The cache organization and the protocol for
caching I-Structures are discussed in the subsequent
section. Section 4 deals with the performance eval-
uation of our architecture using various benchmark
programs. In Section 5 we investigate the feasibility
of the architecture by comparing the functionality of
various modules with the standard modules present
in other commercially available processors. We com-
pare our work with other multithreaded architectures
in Section 6. Concluding remarks are presented in
Section 7.
2 The Architecture
The execution model of our architecture, which is
similar to that proposed in the threaded abstract ma-
chine [6], is discussed in [8]. Readers are referred to [8]
for details.
The architecture consists of a number of Processing
Elements (PES) connected by an interconnection net-
work. For the purpose of this paper we consider the
binary n-cube network as the topology of the inter-
connection network. In this section, we describe the
organization of a PE (refer to Fig. 1)and allfunctional
units except the DS-Cache. The processor part of the
architecture is enclosed in a dotted box in the figure.
2.1 Frame Memory Unit
The activation frames of a program are stored in
the frame memory unit. The frame memory unit also
consists of a simple memory manager which is respon-
sible for managing the frame memory. On a func-
tion invocation, the frame memory manager allocates
F U W W IlR w " m d l & M M m R ~ , L ,
SL' s.lnt.*unl IP h.nrumRcg,rr
R W U d W n k S u u w . IS 1s-
Figure 1: Organization of a Processing Element
a frame of appropriate size in the frame memory. In-
put arguments to an activation also reach the frame
memory unit in the form of tokens, and the value gets
written in the respective frame memory location by
the memory manager. A signal indicating the arrival
of an operand is sent to the thread synchronization
unit where the synchronization of input arguments to
a thread is performed. The frame memory unit also
receives a signal from the execution pipe when an ac-
tivation terminates. In that case, the frame memory
manager deallocates the corresponding frame.
2.2 Thread Synchronization Unit
The thread synchronization unit performs the syn-
chronization of inputs to a thread. It is similar to the
explicit token store matching unit of the Monsoon ar-
chitecture [16].A thread becomes ena6kd when it has
received all its input arguments. The activation con-
taining this thread becomes enabled, if the activation
isn't already enabled. An enabled thread is sent to the
the filter unit.
2.3 Filter Unit
If the incoming thread belongs to an activation
whose activation frame is already resident in the high-
speed buffer (refer to Section 2.4), the filter unit for-
299
wards the thread to the scoreboard unit. If the thread
belongs to an activation which is not resident, then
the filter unit checks whether the activation can be
loaded in the high-speed buffer. If so it instructs the
frame memory unit to load the activation frame in the
high-speed buffer (refer to Section 2.4). When the ac-
tivation is successfully loaded, it is said to be resident
(in the high-speed buffer), and the incoming thread is
sent to the scoreboard unit. On the other hand, if the
activation corresponding to the incoming thread can-
not be loaded, then the thread is queued inside the
filter unit until an activation is freed. The filter unit
proceeds to service other threads in its input queue.
2.4 High-speed Buffer
The high-speed buffer in our architecture is a mul-
tiported cache-like memory with a single cycle access
time. On a request from the filter unit, the frame
memory loads a frame in the high-speed buffer with
the help of the buffer loader.
When there are no enabled threads in a resident
activation, the high-speed buffer receives signals from
the score-board unit either to flush or to off-load a
frame from the high-speed buffer. If the request is to
off-load the frame, then the contents are stored back
in the frame memory. A flush request issued by the
filter unit indicates that the Corresponding activation
has terminated and therefore there is no need to save
the frame.
2.5 Score-Board Unit
The score board unit performs book-keeping op-
erations such as how many threads in an activation
are currently enabled. When the number of enabled
threads in an activation becomes zero, it instructs the
high-speed buffer to either flush or off-load the cor-
responding activation frame. Threads arriving at the
score-board unit are sent to Ready Thread Queue af-
ter the book-keeping operations.
2.6 Ready Thread Queue
The Ready Thread Queue maintains a pool of
threads which await certain resources in the instruc-
tion scheduler. A resource corresponds to a set of
registers, namely a program counter, a base-address
register for the activation frame, an intermediate in-
struction register, an instruction register, and a stall
register. The role of these registers will be explained
in the following subsection on instruction scheduler.
When a resource becomes available, the instruction
scheduler takes a thread from the pool and allocates
the resource to it. A thread that has acquired a re-
source is said to be active.
2.7 Instruction Scheduler
The instruction scheduler consists of two units, the
fetch unit and the schedule unit. The instruction fetch
unit receives a thread from the ready thread queue
and loads the program counter allocated to it. The
instruction pointed by the program counter is fetched
from the instruction cache. The instruction is stored
in the Intermediate Instxuction Register and the as-
sociated stall count' is loaded into the stall register.
The set of intermediate registers together form the so
called instruction window for the multiple execution
pipes.
The schedule unit, at each execution cycle, checks
the intermediate instruction registers. It selects upto
R instructions for execution, where n is the number of
execution pipes. These selected instructions are then
moved to the instruction register. These instructions
are initiated in the execution pipes in the following
cycle. The associated stall registers are decremented
every cycle until they become zero. The corresponding
program counters are then incremented and the next
instruction from those threads are fetched.
2.8 Instruction Cache
The instruction cache unit is similar to the instruc-
tion cache in conventional machines. Only code-blocks
corresponding to resident activations are present in the
instruction cache.
2.9 Execution Pipe
Each PE consists of a number of instruction execu-
tion pipes. The execution pipes used in our architec-
ture are generic in nature, each capable of performing
any operation. They are assumed to be fully pipelined.
The execution pipes are h d / s t o r e in nature: every
operation other than memory load and store uses reg-
ister operands. Since an activation frame is pre-loaded
in the high-speed buffer before instructions of that ac-
tivation are scheduled for execution, it is guaranteed
that a load operation (on a frame location) does not
cause any load stalls. The execution pipes share a
register file. The register file is logically divided into
'At compile time, a stall count is associatedwith each in-
structionwhich indicates how many stall cycles are needed, to
take care of data dependencyin pipeline execution.before the
next instructionin this thread can be initiated.
300
a number of register banks, one corresponding to each
reszdent activation. In the execution model, the logi-
cal register name specified in an instruction is used as
an offset within the register bank.
The instruction set of our architecture includes spe-
cial instructions to invoke new activation, to signal the
termination of a thread, the termination of an activa-
tion, and to communicate results to an external activa-
tion. An execution pipe sends a message to the router
unit whenever it has to (i) access the data structure
memory (either local or remote), (ii) communicate ar-
guments to an activation, or (iii) invoke an activation.
2.10 Router Unit
The router unit receives messages from the execu-
tion pipes, from the data structure memory, or from
the network switches. The messages are appropriately
routed to the frame memory, the data structure mem-
ory, the DS-cache, or the network switch.
2.11 Network Switch
The PES of our architecture are connected by a bi-
nary n-cube interconnection network which provides a
high network bandwidth. In this paper, we focus on
the use of a fixed routing scheme along the shortest
path to the destination.
3 Cache Organization
In our architecture the data structure memory is
shared and is distributed across the PES. The data
structure elements are uniformly distributed across
the PES.
A set of remote memory locations, called a cache-
lane, can be cached in the data structure cache (DS-
Cache) of the PE. The DS-cache can have either a
direct-mapped or a k-way set-associative organha-
tion. However for expository purposes, in this section,
we will consider the &way set-associative organisa-
tion. The DS-Cache consists of two parts, one for I-
Structures, whose elements are written only once, and
the other for read-write data structures. The respec-
tive caches are referred to as IS-Cache and RW-Cache.
A remote memory access request arriving at the router
unit is sent to the DS-Cache. If the corresponding
cacheline is present in the local cache, the location
is accessed from the cache and the contents are sent
to the router unit. Otherwise a request to fetch the
cacheline will be sent to the router. The router in turn
forwards the request to the appropriate PE through
the network switch. In the following two subsections
we discuss the issues in caching shared data structures
and the policies followed by the DS-Cache.
3.1 Caching Read-Write Data Structures
Whenever an access to a remote memory location
arrives at the DS-cache, the cache memory is searched
for the cacheline. If a hit occurs the access request is
satisfied by the RW-Cache. On a miss, a free cache
block is reserved for this cacheline. The tag and refer-
ence count fields of this cache block are set appropri-
ately. A pending bit associated with the cache block
is set to 1. The pending bit indicates that a request
to the cacheline has been sent on the network and the
response is awaited. When the response is received
the pending flag is reset. The reserved cacheline along
with the pending bit avoids multiple requests for the
same cacheline being sent to the network. Such re-
quests are queued inside the RW-cache until the pend-
ing bit is reset. When the requested cacheline arrives,
the pending bit is reset and the waiting requests are
serviced by sending the appropriate data values.
A consequence of reserving cache blocks (at the
time of miss) is that all cache blocks may be reserved,
awaiting response from the remote locations. In this
case, new requests arriving the RW-cache are queued
until a cache block becomes free. We refer to such a
case as an overflow or collision in the RW-Cache. All
writes in a read-write structure are performed only in
the structure memory. In order to maintain coherence
in caches for read-write structures, we follow a selec-
tive invalidation scheme similar to the one proposed
in [5]. A special invalidation instruction is executed
prior to accessing a data structure memory location if
there is a possibility that the corresponding cacheline
is modified after its last access. The invalidation mes-
sage is sent to the RW-Cache which invalidates the
cacheline if the cacheline is present in the local RW-
Cache. However, due cache replacements the cacheline
may not be present in the RW-Cache in which case the
invalidation signal is simply ignored.
3.2 Caching I-Structures
A request to a remote I-Structure location is first
searched set-associatively in the IS-Cache. If the
cacheline is found in the IS-Cache, then depending
on whether the location is full or empty, the request
is either satisfied or suspended. A suspended request
is queued in the IS-Cache. On a miss, a cache block
is reserved as is done in the RW-Cache. A pending
bit is associated with the cache block to avoid further
requests on the pending cache block being sent to the
network.
301
At the remote I-structure memory, an access to
a cacheline results in the following actions: The re-
quested cacheline is fetched and sent to the requesting
PE. For each empty location in the cacheline, an entry
indicating a pending read is queued. This is because
a remote PE has cached this cacheline, and it (the
remote PE) may try to access some of the empty lo-
cations from its IS-Cache. These requests may get
queued in the remote PE. Therefore, to release the
above requests, a wake-up message should be sent from
the I-Structure memory whenever any of these empty
locations becomes full. It may well be the case that
the remote PE might not have accessed this location,
and, further, might have even discarded the cached
block. Nonetheless the wake-up message should be
sent to the remote PE. If the remote PE has discarded
this cached block it will be simply ignore this message.
On the other hand, if the remote PE still retains the
cached block in its IS-Cache, then the value passed
along with the wake-up message is written in the a p
propriate cache location. Any pending requests in the
remote PE gets reactivated.
A write to an I-structure location is sent to the I-
structure memory. An I-Structure write is never done
at the IS-Cache. This is to avoid inconsistent states
and multiple writes in the I-Structure memory. Since
an I-Structure location can be written at most once, it
becomes a read-only structure once it is written, and
hence does not cause any coherence problem.
4 Performance Evaluation
In this section we evaluate the performance of our
architecture using discrete-event simulation.
4.1 Simulator Details
A function-level simulator for the architecture de-
scribed in Section 2 with the DS-Cache organization
was written in C to run on an Unix platform. Con-
stant processing time, in number of clock cycles, was
assigned to each functional unit.
The number of PES, the number of execution pipes
per PE, the size of the DS-Cache, its organization,
and the size of cache-block can be varied in each sim-
ulation run. The number of resident activations and
the number of resources available in a PE were each
assumed to be 32 as our earlier work on the perfor-
mance of SMALL [8] has revealed no significant effect
in increasing these numbers beyond 16. Each resi-
dent activation is allocated 32 general purpose regis-
ters in the execution pipe. The sizes of the instruction
cache and the high-speed buffer were assumed to be
8K words each. A frame memory of 16K words per
PE was used in the simulation. Lastly, each PE has a
data structure memory of 16K words. The data struc-
tures used in the program are uniformly distributed
over the memory modules in all the PES.
The performance of our architecture is evaluated
using four representative scientific application pro-
grams, namely SAXPBY (5120 elements), Matrix
Multiplication (squaring a 24 x 24 matrix), the Liv-
ermore Loop2 (with 4096 elements) and Image Aver-
aging (on 64 x 64 pixels). The SAXPBY program
is similar to the Linpack kernel SAXPY, except that
SAXPBY computes A[i]t X +B[i]*Y. The livermore
loop is programmed using I-Structures. The image
averaging application is a low-pass filter which uses
a barrier synchronization. All application programs
were hand coded and run on the simulator. In hand-
coding application no special optimizations are per-
formed. Also,no effort was made to map the given a p
plication on the PES. Further, all performance exper-
iments for an application program are conducted with
the same problem size. The instruction mix in these
programs is shown in Table 1. The last column in the
table gives an estimate of the overhead introduced by
following the multithreaded approach. Synchroniza-
tion operations account for synchronization instruc-
tions. Integer operations include all address calcula-
tion instructions as well as instructions that perform
transfer operations between frame-memory and regis-
ters.
Benchmark %-age Instruction Mix
Ld. Ctrl. Synch. Over-
SAXPBY 53.12 10.87 10.87 1.86 17.27
Matrix 55.05 14.26 14.56 4.18 6.85 5.10
Ops. Ops. Store Ops. Ops.
Mult.
Livermore 145.46 I 8.33 I 12.50 I 5.75 I 9.76 I 18.20
Loop
Image 140.69 I 16.70 I 13.36 I 1.10 I 10.38 I 17.76
Averaging I
Table 1: Instruction Mix in Application Programs
4.2 Performance Results
Throughput vs. Number of PES
First we evaluate the throughput of our architec-
ture with respect to the various parameters. Through-
put is defined as the total number of instructions that
are completed by all the PESin an execution cycle. In
302
Figure 2: Throughput vs. Number of PES
Benchmark
Proeram
this experiment, we used a 4-way set-associative DS-
Cache, with a cache-line size of 4 words. In a later
experiment we study the effect of the DS-Cache on
the throughput.
Fig. 2 plots the overall throughput of our archi-
tecture against the number of PES, where both axes
are drawn in the logarithmic scale. We observe that
the throughput of our architecture increases almost
linearly with the number of PES for all benchmark
programs. When the number of execution pipes is in-
creased from 1 to 2, a considerable improvement in
throughput is observed for all benchmark programs.
This is true irrespective of the number of PES in the
system. Increasing the number of execution pipes be-
yond 3 does not result in any improvement in the
throughput. An interesting observation that can be
made from the plots of Fig. 2 is that the throughput
with 2 execution pipes per PE is almost equal to that
of a configuration with twice the number of PES and
a single execution pipe.
Average Instruction-Level Parallelism
Number of PES-
SAXPBY
Matrix
Mult.
1 2 4 8 16 32 64
1.97 1.96 1.95 1.88 1.80 1.46 1.12
1.66 1.68 1.68 1.67 1.66 1.37 1.03
Table 2: Average Instruction-Level Parallelism
The average number of instructions executed per
clock cycle per PE is the instruction-level parallelism
exploited by architecture. Table 2 tabulates the ex-
ploited instruction-level parallelism when 4 execution
pipes per PE were used. As seen from this table, the
instruction-level parallelism never exceeds 2 which in-
dicates that 2 execution pipes are sufficient to fully uti-
h e the synchronizing capabilities of a PE. For Liver-
more Loop and the Image Averaging programs, the ex-
ploited instruction-level parallelism is low. This is due
to high synchronization overheads involved in these
applications.
Effect of DS-Cache on Throughput and
Network Latency
As mentioned earlier, the introduction of the DS-
Cache reduces the network latency and thus improves
the performance of the system. In Fig. 3, we compare
an architecture without the DS-Cache (with 1,2, or 3
execution pipes) with an architecture with 3 execution
pipes and a 128-set, 4 way set-associative DS-cache for
the SAXPBY program. Other benchmarks exhibit a
similar trend and hence are not shown here. It can
be noticed that the introduction of the DS-Cache im-
proves the throughput significantly, especially when
the number of PES is large, i.e. greater than 8. This
is because, when the number of PES is small, the par-
allelism available per PE is large enough to tolerate
the remote memory latency. Hence, the presence of
the DS-Cache does not influence the performance in
a significant way. However, with a large number of
PES, the parallelism per PE decreases (since no scal-
ing of application programs is considered here), and
therefore the remote memory access latency becomes
crucial.
It may be observed from Fig 3 that in the absence
of the DS-Cache, there is no gain in the throughput
by increasing the number of execution pipes. Another
important observation that can be made from Fig. 3 is
303
-of-- *-o,)bOr-
Figure 2: (contd.) Throughput vs. Number of PES
Figure 3: Throughput with and without the DS-Cache Figure 4: Effect of DS-Cache on Network Latency
that the throughput of the architecture with the DS-
Cache equals the throughput of a configuration with-
out the DS-Cache, but with twice the number of PES.
Earlier we made a similar observation with respect to
the number of execution pipes. Thus one can infer
that it is the DS-Cache that allows the PES to exploit
higher instruction-level parallelism.
The introduction of the DS-Cache influences the
performance of the architecture in two ways. First,
it reduces the access time for remote read requests
whenever the corresponding cacheline is present in the
DS-Cache. Secondly, in the event of a hit, the re-
mote requests are serviced by the cache, and hence
the requests do not enter the network. This in turn
reduces the network traffic and the network latency.
In Fig. 4, we plot the average network latency ob-
served in two application programs when the num-
ber of PES is increased from 2 to 64. In this exper-
iment, we consider architecture configurations, with
and without DS-Cache, but with 3 execution pipes per
PE in both cases. The maximum average latency en-
countered without DS-Cache is more than 2500 time
units for the Image Averaging application and 1500
time units for SAXPBY. These large values indicate
the enormous contention and the queuing delay en-
countered in the network. Further, this shows that
the saturated network latency is two orders of magni-
tude higher than the unloaded network latency. When
the architecture supports DS-Cache, the network la-
tency increases to a maximum value of 600 time units
(roughly) which is only one order of magnitude higher
than the unloaded network latency. Thus supporting
a DS-Cache in the architecture reduces the network la-
tency by nearly a factor of 3 to 6. The higher network
latency values for the image averaging application is
due to the large number of synchronizing messages
used for the barrier synchronization.
The network latency for architecture configurations
with DS-caches increases initially with the number of
PES and then decreases in Fig. 4. Though this may
seem counter-intuitive, it can be explained in the fol-
lowing manner. When the number of PES increased,
304
the size of the network and the number of network
switches increase, decreasing the extent of contention
in the network. Further as the size of the applica-
tion is not scaled, the available parallelism per PE
decreases with increase in the number of PES. Thus
the ‘lack’ of work makes the PE wait for the response
to come from the network before the PE can pump
in additional messages into the network. This in turn
reduces the number of messages (remote memory re-
quests) sent by a PE to the network, which in turn fur-
ther reduces the contention and the network latency.
However, when the number of PES is 8, each PE had
‘enough’ parallelism to tolerate even a very high net-
work latency. Thus the PES kept pumping more and
more messages into the network, keeping the network
always in saturation.
Effect of Cache Organizations
In this experiment we keep the number of PESas 32
and the number of execution pipes as 4. We used two
different DS-Cache sizes, viz. 1K words and 2K words
(per PE), even though a much larger cache size can
be supported in practice. There are two reasons for
this. First, no improvement in the overall throughput
was observed for our benchmark programs when the
cache size is increased beyond 2K words. Secondly,
the benchmark programs that were considered have a
smaller problem size compared to real-world examples.
Thus, the smaller DS-Cache assumed in the simulation
provides some kind of a scalodown effect, matching
well with the smaller problem sizes considered in the
simulation experiments. For each of the cache sizes,
namely 1K and 2K words, we considered 2 different
cacheline sizes (4 and 8 words) and 4 different cache
organizations, namely direct mapped, 2-way, +way,
and 8-way set-associative Organizations.
The throughput of the architecture is once again an
important performance metric to judge the suitabil-
ity of the cache organization. The average number
of times each cache block is accessed before another
cacheline overwrites it, is a measure of the utilization
of the cache block. This is referred to as the average
cache block reuse. The number of times a cache block
cannot be reserved in the DS-Cache due to the non-
availability of a cache block in the corresponding set’,
is referred to as the collisions in the DS-Cache. When-
ever a collision occurs, the request is queued until a
cache block becomes free. The queuing of requests in-
’Recall that this could happen if all the blocks in the apprw
priate set are waiting- with their pending flagsset to 1 -- for
their read requests to be satisfied by the remote data structure
memory.
creases the response time for the read and hence will
affect the throughput of the architecture.
Table 3 summarizes these performance parameters
for the different cache organizations. From the table,
the following observations can be made.
(1) The direct-mapped cache organization yields a
lesser throughput compared to the set-associative or-
ganization. The lower throughput in direct-mapped
caches is an expected result, as direct-mapping pro-
vides less flexibility in mapping a cacheline to a cache
block.
(2) The value for the collision parameter decreases
drastically as the associativity is increased. Also, the
cache block reuse increases to a small extent with the
associativity.
(3) Increasing the cacheline size increases the cache
block reuse factor by a small amount. A small increase
in the throughput of the architecture is also observed
with the increase in the cacheline size.
(4) The throughput of the architecture is nearly the
same for all cache organizations with an associativity
greater than or equal to 2.
5 A Study on Feasibility
In this section we provide approximate
for transistor counts of various functional
estimates
blocks in
our architecture. These estimates are based on com-
parisons with the blocks of similar capability present
in currently available commercial and experimental
microprocessors, namely, Message-Driven Processor
(MDP) [7], SuperSPARC [l]and RS-6000 [14]. MDP
uses standard cell approach, and hence the corre-
sponding estimates are conservative. We plan to con-
duct VHDL simulations in future to provide detailed
understanding on the feasibility aspects.
The processor is divided into two parts, namely,
on-chip storage and data-path, control logic and in-
terface logic. On-chip storage on SuperSPARC and
MDP, uses nearly 70-80% of the total transistors, but
covers 30% area of the chip due to its regular struc-
ture. On-chip storage contains an 8K word Instruction
Cache and an 8K word high-speed buffer. Based on
the estimates for a 4K words SRAM in MDP, the 8K
word Instruction Cache requires 1.6M transistors, The
high-speed buffer of 8K words is divided into 32 pages
and is fully associative at page level only. We assume
5 ports to high-speed buffer, one for each (upto 4) ex-
ecution pipe and one for the buffer-loader. A 5-ported
64K Byte on-chip memory uses 4.5M transistors in
RS-6000, so the high-speed buffer will require nearly
3M transistors.
305
Table 3: Effect of Cache Organization on Performance
The register file contains upto 1024 registers and
8 ports3. Such register files are common in super-
scalar/VLIW processors. The instruction scheduler
contains 32 resources with a total of 128 registers.
A resource uses the space for 4 registers. The ready
thread queue has a length of 10 threads, and uses 3
registers for each thread. Thus thread management
functions use 158 registers with one read and one write
port.
Datapath, control logic and interface logic use the
remaining 20-30% transistors, but consume 70% of
the chip area. Our architecture uses generic execu-
tion pipes, similar to that of an MDP which requires
39K transistors. Buffer and instruction cache loaders
use a counter for loading an activation or a code-block,
once the base address is specified. Otherwise, address
logic is similar to the address arithmetic unit of MDP,
which consumes 75K transistors. Remaining control
and interface logic (for internal and external memory
and chip I/O) uses 56K transistors in MDP.
Thus, the multithreaded processor proposed in this
paper requires approximately 5M transistors (a more
extensive execution pipeline may increase this num-
ber to 6M transistors), a reasonable size for the cur-
rent technology. Now let us consider other functional
units, namely the filter unit and the thread synchro-
nization unit. The filter unit maintains two tables,
one for 32 resident activations (with base addresses of
the frames), and other for 32 active threads (instruc-
tion pointer). Thread synchronization unit maintains
a queue of upto 32 threads (3 addresses each) which
do not have resident activations. Logic in these units
is fairly simple and can be implemented as finite state
machines, using conventional approaches like PLAs or
3 A study in 1141 shows that on an averageone read and one
readfwrite port is sufficientfor one instruction execution.
FPGAs.
6 Related Work
Several multithreaded architectures have been pro-
posed in the literature (refer, for example, to [2, 3, 6,
12, 13, 151 and to [ll]for a survey). Like Threaded
Abstract Machine (TAM) [6] and +T [15],our ar-
chitecture realizes three levels of program hierarchy
based on synchronisation and scheduling. TAM uses
a compiler-controlled approach to achieve fine-grain
parallelism and synchronization. In contrast, we ad-
vocate the use of suitable compilation techniques and
necessary hardware support. Further, our architecture
supports multiple resident activations, which help to
mask the cost of context switching while off-loading a
resident frame.
The processor coupling proposal [12] and the ar-
chitecture proposed by Hirata, et al. [9]use dynamic
packing from different threads to exploit instruction
level parallelism. Our approach is similar to this, ex-
cept that each instruction in a thread in our archi-
tecture contains a single operation. While the use of
multi-operation instructions in a thread in processor
coupling [I21 improves the throughput, it makes the
runtime scheduler more complex. Our architecture
also supports a two-level cache structure to achieve
high throughput.
Our work differs from Tera [3] and its pre-
decessors in following ways: (i) Tera uses long-
word instructions-three operations per instruction
and (ii) no dynamic packing of instructions is per-
formed on Tera. In contrast, the processor in +T[15]
is a superscalar. Our results indicate that the syn-
chronizing capability of a PE can be fully utilized by
using more than one execution pipe.
306
7 Conclusions
In this paper we have described the design of a scal-
able multithreaded architecture. The salient features
of the architecture are (i) its ab&ty to exploit both
coarse-grain parallelism and fine-grain instruction-
level parallelism, (ii) a distributed DS-Cache which
significantly reduces the network latency and makes
the system scalable, (G)a high-speed buffer organi-
zation which completely avoids load stalls on access
to local variables of an activation, (iv) a layered ap-
proach to synchronization and scheduling which helps
to achieve very high processor throughput and utiliz*
tion. The performance of the architecture is evaluated
using simulation. Initial simulation results are promie
ing and indicate that:
(1) With a small number (2 or 3) of execution
pipes, the architecture can effectively exploit all the
instruction-level parallelism available at a PE. The
throughput of our architecture with two execution
pipes is almost equal to that of a configuration with
twice the number of PESand a single execution pipe.
(2) The use of a set-associative cache leads to a near-
linear performance improvement with respect to the
number of PES. The presence of the DS-Cache re
duces the network latency experienced by remote read
requests by a factor of 3 to 6.
(3) The DS-Cache reduces network traffic which is es-
sential in realizing the performance improvements due
to multiple execution pipes.
Acknowledgements
The authors are grateful to the reviewers whose
comments have improved the presentation of this pa-
per. This work was supported by MICRONET - Net-
work Centres of Excellence in Canada.
References
[l] F. Abu-Nofal and et al. A three million transistor
microprocessor. In Digest of Technicnl Papers, 199.2
IEEE International Solid State Circuits Conference,
pages 108-109, Feb. 1992.
[2] A. Agarwal, B-H. Lim, D. Kranz, and J. Kubiatowicz.
APRIL: A processor architecture for multiprocessing.
In Proc. of the 17th Ann. Intl. Symp. on Computer
Architecture, pages 104-114. Seattle, Wash., June
1990.
[3] R. Alverson, D. Callahan, D. Cummings, B. Koblenz,
A. Porterfield, and B. Smith. The Tera computer
system. In Conf. Proc., 1990 IntI. Conf. on Super-
computing, pages 1-6, Amsterdam, The Netherlands,
June 1990.
[4] Arvind, R. S. Nikhil, and K. Pingali. I-structures:
Data structures for parallel computing. ACM Trans.
on Programming Languages and Systems, 11(4):598-
632, Oct. 1989.
[5] H. Cheong and A.V. Veidenbaum. Compiler-directed
cache management in multiprocessors. IEEE Com-
puter, pages 3947, June 1990.
[6] D.E. Culler et al. Fine-grain parallelism with minimal
hardware support: A compiler-controlled threaded
abstract machine. In Proc. of the 4th Intl. Conf.
on Architectural Supportfor Programming Languages
and Operating Systems, pages 164-175, Santa Clara,
Calif., April 1991.
[7] W.J. Dally et al. A message-driven processor: A
multicomputer processing node with efficient mech-
anisms. IEEE Micro, pages 24-38, April 1992.
[8] R. Govindarajan and S.S. Nemawarkar. Small: A
scalable multithreaded architecture to exploit large
locality. In Proc. of the 4th IEEE Symp. on Parallel
and Distributed Processing, pages 32-39, Dec. 1992.
[9] H. Hirata et al. An elementary processor architecture
with simultaneous instruction issuing from multiple
threads. In Proc. of the 19thIntl. Symp. on Computer
Architecture, pages 136-145. Gold Coast, Australia,
May 1992.
[lo] R. A. Iannucci. Toward a dataflow/von Neumann
hybrid architecture. In Proc. of the 15th Ann. Intl.
Symp. on Computer Architecture, pages 131-140,
Honolulu, Hawaii, June 1988.
[ll] R. A. Iannucci, G. R. Gao, R. H. Halstead, Jr., and
B. Smith. Multithreaded Computer Architecture: A
Summary of the State of the Art. Kluwer, Norwell,
Mass., 1994.
Processor coupling:
Integration compile time and runtime scheduling for
parallelism. In Proc. of the 19th Intl. Symp. on Com-
puter Architecture, pages 202-213. Gold Coast, Aus-
tralia, May 1992.
[I31 Y. Kodama, S. Sakai, and Y. Yamaguchi. A proto-
type of a highly parallel dataflow machine EM-4 and
its preliminary evaluation. In Proc. of InfoJapan 90,
pages 291-298, Oct. 1990.
[la] M. Misra. IBM RS System/6000 Technology, First
edition, 1990. IBM, Austin, Tx., 1990.
[I51 R. S. Nikhil, G. M. Papadopoulos, and Arvind. *T:
A multithreaded massively parallel architecture. In
Proc. of the 19th Ann. Intl. Symp. on Computer
Architecture, pages 156-167, Gold Coast, Australia,
May 1992.
[16] G. M. Papadopoulos and D. E. Culler. Monsoon: an
explicit token-store architecture. In Proc. of the 17th
Ann. Intl. Symp. on Computer Architecture, pages
82-91. Seattle, Wash., June 1990.
[12] S.W. Keckler and W.J. Dally.
307

More Related Content

What's hot

Analysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsAnalysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsJames McGalliard
 
Memory consistency models
Memory consistency modelsMemory consistency models
Memory consistency models
palani kumar
 
Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...
eSAT Publishing House
 
Compositional Analysis for the Multi-Resource Server
Compositional Analysis for the Multi-Resource ServerCompositional Analysis for the Multi-Resource Server
Compositional Analysis for the Multi-Resource Server
Ericsson
 
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORSSTUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
ijdpsjournal
 
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERSVTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
vtunotesbysree
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution model
Vajira Thambawita
 
Parallel Processing Concepts
Parallel Processing Concepts Parallel Processing Concepts
Parallel Processing Concepts
Dr Shashikant Athawale
 
A Kernel-Level Traffic Probe to Capture and Analyze Data Flows with Priorities
A Kernel-Level Traffic Probe to Capture and Analyze Data Flows with PrioritiesA Kernel-Level Traffic Probe to Capture and Analyze Data Flows with Priorities
A Kernel-Level Traffic Probe to Capture and Analyze Data Flows with Priorities
idescitation
 
Crayl
CraylCrayl
Cray xt3
Cray xt3Cray xt3
Cray xt3
Léia de Sousa
 
Survey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning AlgorithmSurvey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning AlgorithmSahil Kaw
 
Cache memory
Cache memoryCache memory
Cache memory
Eklavya Gupta
 
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor ...Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor ...
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
Ahmed kasim
 
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0Sahil Kaw
 
Lecture 3 parallel programming platforms
Lecture 3   parallel programming platformsLecture 3   parallel programming platforms
Lecture 3 parallel programming platforms
Vajira Thambawita
 
Latency aware write buffer resource
Latency aware write buffer resourceLatency aware write buffer resource
Latency aware write buffer resource
ijdpsjournal
 

What's hot (20)

Analysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsAnalysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific Applications
 
Memory consistency models
Memory consistency modelsMemory consistency models
Memory consistency models
 
Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...
 
Compositional Analysis for the Multi-Resource Server
Compositional Analysis for the Multi-Resource ServerCompositional Analysis for the Multi-Resource Server
Compositional Analysis for the Multi-Resource Server
 
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORSSTUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
 
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERSVTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
VTU 5TH SEM CSE OPERATING SYSTEMS SOLVED PAPERS
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution model
 
Parallel Processing Concepts
Parallel Processing Concepts Parallel Processing Concepts
Parallel Processing Concepts
 
A Kernel-Level Traffic Probe to Capture and Analyze Data Flows with Priorities
A Kernel-Level Traffic Probe to Capture and Analyze Data Flows with PrioritiesA Kernel-Level Traffic Probe to Capture and Analyze Data Flows with Priorities
A Kernel-Level Traffic Probe to Capture and Analyze Data Flows with Priorities
 
Crayl
CraylCrayl
Crayl
 
Cray xt3
Cray xt3Cray xt3
Cray xt3
 
Survey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning AlgorithmSurvey_Report_Deep Learning Algorithm
Survey_Report_Deep Learning Algorithm
 
Cache memory
Cache memoryCache memory
Cache memory
 
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor ...Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor ...
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
 
SoC-2012-pres-2
SoC-2012-pres-2SoC-2012-pres-2
SoC-2012-pres-2
 
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
 
Lecture 3 parallel programming platforms
Lecture 3   parallel programming platformsLecture 3   parallel programming platforms
Lecture 3 parallel programming platforms
 
S peculative multi
S peculative multiS peculative multi
S peculative multi
 
Latency aware write buffer resource
Latency aware write buffer resourceLatency aware write buffer resource
Latency aware write buffer resource
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 

Viewers also liked

Como funciona un virus informático trabajo MERLIS SALINAS LEAL
Como funciona un virus informático trabajo MERLIS SALINAS LEALComo funciona un virus informático trabajo MERLIS SALINAS LEAL
Como funciona un virus informático trabajo MERLIS SALINAS LEAL
MYGUEDVY
 
Cómo Pasar Datos desde el iPhone a Samsung
Cómo Pasar Datos desde el iPhone a SamsungCómo Pasar Datos desde el iPhone a Samsung
Cómo Pasar Datos desde el iPhone a Samsung
Jihosoft
 
Tecnologias aplicadas na educação
Tecnologias aplicadas na educaçãoTecnologias aplicadas na educação
Tecnologias aplicadas na educação
Jonalto Guirra
 
Funcionamiento de los virus informáticos ingrid palacio
Funcionamiento de los virus informáticos ingrid palacioFuncionamiento de los virus informáticos ingrid palacio
Funcionamiento de los virus informáticos ingrid palacio
ingrid margarita palacio bolaño
 
Presentación unidad ii
Presentación unidad iiPresentación unidad ii
Presentación unidad ii
Gloria Viridiana Valencia Cid
 
CS1 Brochure
CS1 BrochureCS1 Brochure
CS1 BrochureTim Arnst
 
Cómo funcionan los virus informáticos
Cómo funcionan los virus informáticosCómo funcionan los virus informáticos
Cómo funcionan los virus informáticos
yulisa del carmen carrasquilla mijares
 
Cómo Pasar Contactos de iPhone a Android
Cómo Pasar Contactos de iPhone a AndroidCómo Pasar Contactos de iPhone a Android
Cómo Pasar Contactos de iPhone a Android
Jihosoft
 
como funcionan los virus informaticos
como funcionan los virus informaticos como funcionan los virus informaticos
como funcionan los virus informaticos
Karoll Perez Hernandez
 
Como funcionan los virus informaticos
Como funcionan los virus informaticosComo funcionan los virus informaticos
Como funcionan los virus informaticos
Jeisson David Santoya Mendoza
 

Viewers also liked (12)

Como funciona un virus informático trabajo MERLIS SALINAS LEAL
Como funciona un virus informático trabajo MERLIS SALINAS LEALComo funciona un virus informático trabajo MERLIS SALINAS LEAL
Como funciona un virus informático trabajo MERLIS SALINAS LEAL
 
Cómo Pasar Datos desde el iPhone a Samsung
Cómo Pasar Datos desde el iPhone a SamsungCómo Pasar Datos desde el iPhone a Samsung
Cómo Pasar Datos desde el iPhone a Samsung
 
Tecnologias aplicadas na educação
Tecnologias aplicadas na educaçãoTecnologias aplicadas na educação
Tecnologias aplicadas na educação
 
Funcionamiento de los virus informáticos ingrid palacio
Funcionamiento de los virus informáticos ingrid palacioFuncionamiento de los virus informáticos ingrid palacio
Funcionamiento de los virus informáticos ingrid palacio
 
Presentación unidad ii
Presentación unidad iiPresentación unidad ii
Presentación unidad ii
 
CS1 Brochure
CS1 BrochureCS1 Brochure
CS1 Brochure
 
Cómo funcionan los virus informáticos
Cómo funcionan los virus informáticosCómo funcionan los virus informáticos
Cómo funcionan los virus informáticos
 
Cómo Pasar Contactos de iPhone a Android
Cómo Pasar Contactos de iPhone a AndroidCómo Pasar Contactos de iPhone a Android
Cómo Pasar Contactos de iPhone a Android
 
como funcionan los virus informaticos
como funcionan los virus informaticos como funcionan los virus informaticos
como funcionan los virus informaticos
 
shashank_mascots1996_00501002
shashank_mascots1996_00501002shashank_mascots1996_00501002
shashank_mascots1996_00501002
 
Como funcionan los virus informaticos
Como funcionan los virus informaticosComo funcionan los virus informaticos
Como funcionan los virus informaticos
 
MS project poster
MS project posterMS project poster
MS project poster
 

Similar to shashank_hpca1995_00386533

DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
Ilango Jeyasubramanian
 
Co question bank LAKSHMAIAH
Co question bank LAKSHMAIAH Co question bank LAKSHMAIAH
Co question bank LAKSHMAIAH
veena babu
 
Cloud Module 3 .pptx
Cloud Module 3 .pptxCloud Module 3 .pptx
Cloud Module 3 .pptx
ssuser41d319
 
CS8603_Notes_003-1_edubuzz360.pdf
CS8603_Notes_003-1_edubuzz360.pdfCS8603_Notes_003-1_edubuzz360.pdf
CS8603_Notes_003-1_edubuzz360.pdf
KishaKiddo
 
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
CSCJournals
 
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docxCS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
faithxdunce63732
 
Performance evaluation of ecc in single and multi( eliptic curve)
Performance evaluation of ecc in single and multi( eliptic curve)Performance evaluation of ecc in single and multi( eliptic curve)
Performance evaluation of ecc in single and multi( eliptic curve)
Danilo Calle
 
ICT III - MPMC - Answer Key.pdf
ICT III - MPMC - Answer Key.pdfICT III - MPMC - Answer Key.pdf
ICT III - MPMC - Answer Key.pdf
GowriShankar881783
 
Computer architecture
Computer architectureComputer architecture
Computer architecture
PrabhanshuKatiyar1
 
Oversimplified CA
Oversimplified CAOversimplified CA
Oversimplified CA
PrabhanshuKatiyar1
 
Second phase report on "ANALYZING THE EFFECTIVENESS OF THE ADVANCED ENCRYPTIO...
Second phase report on "ANALYZING THE EFFECTIVENESS OF THE ADVANCED ENCRYPTIO...Second phase report on "ANALYZING THE EFFECTIVENESS OF THE ADVANCED ENCRYPTIO...
Second phase report on "ANALYZING THE EFFECTIVENESS OF THE ADVANCED ENCRYPTIO...
Nikhil Jain
 
Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...
Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...
Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...
csandit
 
Microx - A Unix like kernel for Embedded Systems written from scratch.
Microx - A Unix like kernel for Embedded Systems written from scratch.Microx - A Unix like kernel for Embedded Systems written from scratch.
Microx - A Unix like kernel for Embedded Systems written from scratch.Waqar Sheikh
 
Design of storage benchmark kit framework for supporting the file storage ret...
Design of storage benchmark kit framework for supporting the file storage ret...Design of storage benchmark kit framework for supporting the file storage ret...
Design of storage benchmark kit framework for supporting the file storage ret...
IJECEIAES
 
1.multicore processors
1.multicore processors1.multicore processors
1.multicore processors
Hebeon1
 
Ef35745749
Ef35745749Ef35745749
Ef35745749
IJERA Editor
 

Similar to shashank_hpca1995_00386533 (20)

DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
 
Co question bank LAKSHMAIAH
Co question bank LAKSHMAIAH Co question bank LAKSHMAIAH
Co question bank LAKSHMAIAH
 
Cloud Module 3 .pptx
Cloud Module 3 .pptxCloud Module 3 .pptx
Cloud Module 3 .pptx
 
CS8603_Notes_003-1_edubuzz360.pdf
CS8603_Notes_003-1_edubuzz360.pdfCS8603_Notes_003-1_edubuzz360.pdf
CS8603_Notes_003-1_edubuzz360.pdf
 
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
 
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docxCS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docx
 
Performance evaluation of ecc in single and multi( eliptic curve)
Performance evaluation of ecc in single and multi( eliptic curve)Performance evaluation of ecc in single and multi( eliptic curve)
Performance evaluation of ecc in single and multi( eliptic curve)
 
ICT III - MPMC - Answer Key.pdf
ICT III - MPMC - Answer Key.pdfICT III - MPMC - Answer Key.pdf
ICT III - MPMC - Answer Key.pdf
 
Os
OsOs
Os
 
Computer architecture
Computer architectureComputer architecture
Computer architecture
 
Oversimplified CA
Oversimplified CAOversimplified CA
Oversimplified CA
 
Second phase report on "ANALYZING THE EFFECTIVENESS OF THE ADVANCED ENCRYPTIO...
Second phase report on "ANALYZING THE EFFECTIVENESS OF THE ADVANCED ENCRYPTIO...Second phase report on "ANALYZING THE EFFECTIVENESS OF THE ADVANCED ENCRYPTIO...
Second phase report on "ANALYZING THE EFFECTIVENESS OF THE ADVANCED ENCRYPTIO...
 
Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...
Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...
Design and Implementation of a Cache Hierarchy-Aware Task Scheduling for Para...
 
PID2143641
PID2143641PID2143641
PID2143641
 
Microx - A Unix like kernel for Embedded Systems written from scratch.
Microx - A Unix like kernel for Embedded Systems written from scratch.Microx - A Unix like kernel for Embedded Systems written from scratch.
Microx - A Unix like kernel for Embedded Systems written from scratch.
 
Hyper threading
Hyper threadingHyper threading
Hyper threading
 
Design of storage benchmark kit framework for supporting the file storage ret...
Design of storage benchmark kit framework for supporting the file storage ret...Design of storage benchmark kit framework for supporting the file storage ret...
Design of storage benchmark kit framework for supporting the file storage ret...
 
1.multicore processors
1.multicore processors1.multicore processors
1.multicore processors
 
Ef35745749
Ef35745749Ef35745749
Ef35745749
 
Chapter 6 os
Chapter 6 osChapter 6 os
Chapter 6 os
 

shashank_hpca1995_00386533

  • 1. Design and Performance Evaluation of a Multithreaded Architecture * R. Govindarajan S.S.Nemawarkar Philip LeNir Dept. of Computer Science St. John’s, A1C 557,Canada Dept. of Electrical Engineering Montreal, H3A 2A7, Canada Dept. of Electrical Engineering Montreal, H3A 2A7,Canada Memorial Univ. of Newfoundland McGill University McGill University govind@cs.mun.ca shashankQmacs.ee.mcgill.ca lenirQee470.ee.mcgill.ca Abstract Multithreaded architeclures have the abzlity to tol- erate long memory latencies and unpredictable syn- chronization delays. I n this paper we propose a multi- threaded architecture that is capable of exploiting both coarse-gram parallelism, and fine-grain instruction- level parallelism in a program. Instruction-level par- allelism is exploited by grouping instructions from a number of active threads at runtime. The architecture supports multiple resident actzvations to improve the extent of locality exploitpd. Further, a distributed data structure cache organizatzon I S proposed to reduce both the network traffic and the latency zn accessing remote locations. Initial performance evaluatzon using dzscrete-event simulation zndicates that the architecture is capable of achieving very high processor throughput. The intro- duction of the data structure cache reduces the network latency szgnificantly. The zmpact of various cache or- ganizations on the performance of the architecture is also discussed in this paper. 1 Introduction Multithreaded architectures [lo, 111 are based on a hybrid evaluation model which combines the von Neumann execution model and the data-driven evalu- ation. In the hybrid model, a program is represented as a partially-ordered graph of nodes. The nodes, called threads, consist of a sequence of instructions which are executed in the conventional von Neumann way. Individual threads are scheduled in a dataflow- like manner, driven by the availability of necessary input operands to the threads. Also, in the hybrid evaluation model, a long-latency memory operation is performed as a splzt-phase operation, where the ac- *Thiswork was supported by MICRONET -Network Cen- tres of Excellence, Canada. 0-8186-6445-2/95 $04.00 01995 IEEE cess request is issued by one thread and the accessed value is used in a different thread - the response to the access request provides the necessary synchroniza- tion for the second thread. As a result the processor does not idle on a long-latency operation; instead it switches to the execution of another thread. In order to improve the extent of locality exploited and to perform synchronization efficiently, it is nec- essary to recognize the three levels of program hier- archy, namely code-block, threads, and instructions, and use appropriate synchronization and scheduling mechanisms at each level of hierarchy [6]. Based on the above design philosophy, in our earlier work, we have proposed the Scalable Multithreaded Architec- ture to exploit Large Locality (SMALL) [SI.A salient feature of SMALL is maintaining multiple resident ac- tivations in the processor which ensures the exploita- tion of high locality and zero load stalls in accessing the local variables of a function. In this paper, we extend SMALL to exploit both coarse-grain parallelism and fine-grain instruction- level parallelism. Coarse-grain parallelism is exploited by distributing the execution of various invocations of a function body (or loop body) across several Process- ing Elements (PES). Fine-grain instruction-level par- allelism is exploited by a runtime grouping of instruc- tions from multiple threads and executing them con- currently on multiple execution pipes available in each PE. Further, in the proposed architecture, we intro- duce a distributed Data Structure cache (DS-Cache) organization for shared data structures. Our archi- tecture supports caching of two types of data struc- tures, namely the I-Structures [4], where each loca- tion is written at most once, and the normal data structures, where an individual location can be writ- ten many times. Coherence in caches is maintained by using a special scheme for I-Structures, and by using software cache coherence mechanisms for normal data structures. 298
  • 2. The performance of the multithreaded architecture is evaluated using discrete-event simulation. Our sim- ulation results indicate: (1) With a small number (2 or 3) of execution pipes, the architecture can effectively exploit all the paral- lelism available at a PE. (2) The throughput with two execution pipes per PE is almost equal to that of a configuration with twice the number of PES and a single execution pipe. (3) The introduction of a set-associative cache leads to a near-linear performance improvement with respect to the number of PES. Further, our simulation results show the presence of the DS-Cache is essential to re- alize the performance gains due to multiple execution pipes. (4) Lastly, the presence of the DS-Cache reduces the network latency experienced by remote read requests by a factor of 3 to 6. The details of the architecture are described in Sec- tion 2. The cache organization and the protocol for caching I-Structures are discussed in the subsequent section. Section 4 deals with the performance eval- uation of our architecture using various benchmark programs. In Section 5 we investigate the feasibility of the architecture by comparing the functionality of various modules with the standard modules present in other commercially available processors. We com- pare our work with other multithreaded architectures in Section 6. Concluding remarks are presented in Section 7. 2 The Architecture The execution model of our architecture, which is similar to that proposed in the threaded abstract ma- chine [6], is discussed in [8]. Readers are referred to [8] for details. The architecture consists of a number of Processing Elements (PES) connected by an interconnection net- work. For the purpose of this paper we consider the binary n-cube network as the topology of the inter- connection network. In this section, we describe the organization of a PE (refer to Fig. 1)and allfunctional units except the DS-Cache. The processor part of the architecture is enclosed in a dotted box in the figure. 2.1 Frame Memory Unit The activation frames of a program are stored in the frame memory unit. The frame memory unit also consists of a simple memory manager which is respon- sible for managing the frame memory. On a func- tion invocation, the frame memory manager allocates F U W W IlR w " m d l & M M m R ~ , L , SL' s.lnt.*unl IP h.nrumRcg,rr R W U d W n k S u u w . IS 1s- Figure 1: Organization of a Processing Element a frame of appropriate size in the frame memory. In- put arguments to an activation also reach the frame memory unit in the form of tokens, and the value gets written in the respective frame memory location by the memory manager. A signal indicating the arrival of an operand is sent to the thread synchronization unit where the synchronization of input arguments to a thread is performed. The frame memory unit also receives a signal from the execution pipe when an ac- tivation terminates. In that case, the frame memory manager deallocates the corresponding frame. 2.2 Thread Synchronization Unit The thread synchronization unit performs the syn- chronization of inputs to a thread. It is similar to the explicit token store matching unit of the Monsoon ar- chitecture [16].A thread becomes ena6kd when it has received all its input arguments. The activation con- taining this thread becomes enabled, if the activation isn't already enabled. An enabled thread is sent to the the filter unit. 2.3 Filter Unit If the incoming thread belongs to an activation whose activation frame is already resident in the high- speed buffer (refer to Section 2.4), the filter unit for- 299
  • 3. wards the thread to the scoreboard unit. If the thread belongs to an activation which is not resident, then the filter unit checks whether the activation can be loaded in the high-speed buffer. If so it instructs the frame memory unit to load the activation frame in the high-speed buffer (refer to Section 2.4). When the ac- tivation is successfully loaded, it is said to be resident (in the high-speed buffer), and the incoming thread is sent to the scoreboard unit. On the other hand, if the activation corresponding to the incoming thread can- not be loaded, then the thread is queued inside the filter unit until an activation is freed. The filter unit proceeds to service other threads in its input queue. 2.4 High-speed Buffer The high-speed buffer in our architecture is a mul- tiported cache-like memory with a single cycle access time. On a request from the filter unit, the frame memory loads a frame in the high-speed buffer with the help of the buffer loader. When there are no enabled threads in a resident activation, the high-speed buffer receives signals from the score-board unit either to flush or to off-load a frame from the high-speed buffer. If the request is to off-load the frame, then the contents are stored back in the frame memory. A flush request issued by the filter unit indicates that the Corresponding activation has terminated and therefore there is no need to save the frame. 2.5 Score-Board Unit The score board unit performs book-keeping op- erations such as how many threads in an activation are currently enabled. When the number of enabled threads in an activation becomes zero, it instructs the high-speed buffer to either flush or off-load the cor- responding activation frame. Threads arriving at the score-board unit are sent to Ready Thread Queue af- ter the book-keeping operations. 2.6 Ready Thread Queue The Ready Thread Queue maintains a pool of threads which await certain resources in the instruc- tion scheduler. A resource corresponds to a set of registers, namely a program counter, a base-address register for the activation frame, an intermediate in- struction register, an instruction register, and a stall register. The role of these registers will be explained in the following subsection on instruction scheduler. When a resource becomes available, the instruction scheduler takes a thread from the pool and allocates the resource to it. A thread that has acquired a re- source is said to be active. 2.7 Instruction Scheduler The instruction scheduler consists of two units, the fetch unit and the schedule unit. The instruction fetch unit receives a thread from the ready thread queue and loads the program counter allocated to it. The instruction pointed by the program counter is fetched from the instruction cache. The instruction is stored in the Intermediate Instxuction Register and the as- sociated stall count' is loaded into the stall register. The set of intermediate registers together form the so called instruction window for the multiple execution pipes. The schedule unit, at each execution cycle, checks the intermediate instruction registers. It selects upto R instructions for execution, where n is the number of execution pipes. These selected instructions are then moved to the instruction register. These instructions are initiated in the execution pipes in the following cycle. The associated stall registers are decremented every cycle until they become zero. The corresponding program counters are then incremented and the next instruction from those threads are fetched. 2.8 Instruction Cache The instruction cache unit is similar to the instruc- tion cache in conventional machines. Only code-blocks corresponding to resident activations are present in the instruction cache. 2.9 Execution Pipe Each PE consists of a number of instruction execu- tion pipes. The execution pipes used in our architec- ture are generic in nature, each capable of performing any operation. They are assumed to be fully pipelined. The execution pipes are h d / s t o r e in nature: every operation other than memory load and store uses reg- ister operands. Since an activation frame is pre-loaded in the high-speed buffer before instructions of that ac- tivation are scheduled for execution, it is guaranteed that a load operation (on a frame location) does not cause any load stalls. The execution pipes share a register file. The register file is logically divided into 'At compile time, a stall count is associatedwith each in- structionwhich indicates how many stall cycles are needed, to take care of data dependencyin pipeline execution.before the next instructionin this thread can be initiated. 300
  • 4. a number of register banks, one corresponding to each reszdent activation. In the execution model, the logi- cal register name specified in an instruction is used as an offset within the register bank. The instruction set of our architecture includes spe- cial instructions to invoke new activation, to signal the termination of a thread, the termination of an activa- tion, and to communicate results to an external activa- tion. An execution pipe sends a message to the router unit whenever it has to (i) access the data structure memory (either local or remote), (ii) communicate ar- guments to an activation, or (iii) invoke an activation. 2.10 Router Unit The router unit receives messages from the execu- tion pipes, from the data structure memory, or from the network switches. The messages are appropriately routed to the frame memory, the data structure mem- ory, the DS-cache, or the network switch. 2.11 Network Switch The PES of our architecture are connected by a bi- nary n-cube interconnection network which provides a high network bandwidth. In this paper, we focus on the use of a fixed routing scheme along the shortest path to the destination. 3 Cache Organization In our architecture the data structure memory is shared and is distributed across the PES. The data structure elements are uniformly distributed across the PES. A set of remote memory locations, called a cache- lane, can be cached in the data structure cache (DS- Cache) of the PE. The DS-cache can have either a direct-mapped or a k-way set-associative organha- tion. However for expository purposes, in this section, we will consider the &way set-associative organisa- tion. The DS-Cache consists of two parts, one for I- Structures, whose elements are written only once, and the other for read-write data structures. The respec- tive caches are referred to as IS-Cache and RW-Cache. A remote memory access request arriving at the router unit is sent to the DS-Cache. If the corresponding cacheline is present in the local cache, the location is accessed from the cache and the contents are sent to the router unit. Otherwise a request to fetch the cacheline will be sent to the router. The router in turn forwards the request to the appropriate PE through the network switch. In the following two subsections we discuss the issues in caching shared data structures and the policies followed by the DS-Cache. 3.1 Caching Read-Write Data Structures Whenever an access to a remote memory location arrives at the DS-cache, the cache memory is searched for the cacheline. If a hit occurs the access request is satisfied by the RW-Cache. On a miss, a free cache block is reserved for this cacheline. The tag and refer- ence count fields of this cache block are set appropri- ately. A pending bit associated with the cache block is set to 1. The pending bit indicates that a request to the cacheline has been sent on the network and the response is awaited. When the response is received the pending flag is reset. The reserved cacheline along with the pending bit avoids multiple requests for the same cacheline being sent to the network. Such re- quests are queued inside the RW-cache until the pend- ing bit is reset. When the requested cacheline arrives, the pending bit is reset and the waiting requests are serviced by sending the appropriate data values. A consequence of reserving cache blocks (at the time of miss) is that all cache blocks may be reserved, awaiting response from the remote locations. In this case, new requests arriving the RW-cache are queued until a cache block becomes free. We refer to such a case as an overflow or collision in the RW-Cache. All writes in a read-write structure are performed only in the structure memory. In order to maintain coherence in caches for read-write structures, we follow a selec- tive invalidation scheme similar to the one proposed in [5]. A special invalidation instruction is executed prior to accessing a data structure memory location if there is a possibility that the corresponding cacheline is modified after its last access. The invalidation mes- sage is sent to the RW-Cache which invalidates the cacheline if the cacheline is present in the local RW- Cache. However, due cache replacements the cacheline may not be present in the RW-Cache in which case the invalidation signal is simply ignored. 3.2 Caching I-Structures A request to a remote I-Structure location is first searched set-associatively in the IS-Cache. If the cacheline is found in the IS-Cache, then depending on whether the location is full or empty, the request is either satisfied or suspended. A suspended request is queued in the IS-Cache. On a miss, a cache block is reserved as is done in the RW-Cache. A pending bit is associated with the cache block to avoid further requests on the pending cache block being sent to the network. 301
  • 5. At the remote I-structure memory, an access to a cacheline results in the following actions: The re- quested cacheline is fetched and sent to the requesting PE. For each empty location in the cacheline, an entry indicating a pending read is queued. This is because a remote PE has cached this cacheline, and it (the remote PE) may try to access some of the empty lo- cations from its IS-Cache. These requests may get queued in the remote PE. Therefore, to release the above requests, a wake-up message should be sent from the I-Structure memory whenever any of these empty locations becomes full. It may well be the case that the remote PE might not have accessed this location, and, further, might have even discarded the cached block. Nonetheless the wake-up message should be sent to the remote PE. If the remote PE has discarded this cached block it will be simply ignore this message. On the other hand, if the remote PE still retains the cached block in its IS-Cache, then the value passed along with the wake-up message is written in the a p propriate cache location. Any pending requests in the remote PE gets reactivated. A write to an I-structure location is sent to the I- structure memory. An I-Structure write is never done at the IS-Cache. This is to avoid inconsistent states and multiple writes in the I-Structure memory. Since an I-Structure location can be written at most once, it becomes a read-only structure once it is written, and hence does not cause any coherence problem. 4 Performance Evaluation In this section we evaluate the performance of our architecture using discrete-event simulation. 4.1 Simulator Details A function-level simulator for the architecture de- scribed in Section 2 with the DS-Cache organization was written in C to run on an Unix platform. Con- stant processing time, in number of clock cycles, was assigned to each functional unit. The number of PES, the number of execution pipes per PE, the size of the DS-Cache, its organization, and the size of cache-block can be varied in each sim- ulation run. The number of resident activations and the number of resources available in a PE were each assumed to be 32 as our earlier work on the perfor- mance of SMALL [8] has revealed no significant effect in increasing these numbers beyond 16. Each resi- dent activation is allocated 32 general purpose regis- ters in the execution pipe. The sizes of the instruction cache and the high-speed buffer were assumed to be 8K words each. A frame memory of 16K words per PE was used in the simulation. Lastly, each PE has a data structure memory of 16K words. The data struc- tures used in the program are uniformly distributed over the memory modules in all the PES. The performance of our architecture is evaluated using four representative scientific application pro- grams, namely SAXPBY (5120 elements), Matrix Multiplication (squaring a 24 x 24 matrix), the Liv- ermore Loop2 (with 4096 elements) and Image Aver- aging (on 64 x 64 pixels). The SAXPBY program is similar to the Linpack kernel SAXPY, except that SAXPBY computes A[i]t X +B[i]*Y. The livermore loop is programmed using I-Structures. The image averaging application is a low-pass filter which uses a barrier synchronization. All application programs were hand coded and run on the simulator. In hand- coding application no special optimizations are per- formed. Also,no effort was made to map the given a p plication on the PES. Further, all performance exper- iments for an application program are conducted with the same problem size. The instruction mix in these programs is shown in Table 1. The last column in the table gives an estimate of the overhead introduced by following the multithreaded approach. Synchroniza- tion operations account for synchronization instruc- tions. Integer operations include all address calcula- tion instructions as well as instructions that perform transfer operations between frame-memory and regis- ters. Benchmark %-age Instruction Mix Ld. Ctrl. Synch. Over- SAXPBY 53.12 10.87 10.87 1.86 17.27 Matrix 55.05 14.26 14.56 4.18 6.85 5.10 Ops. Ops. Store Ops. Ops. Mult. Livermore 145.46 I 8.33 I 12.50 I 5.75 I 9.76 I 18.20 Loop Image 140.69 I 16.70 I 13.36 I 1.10 I 10.38 I 17.76 Averaging I Table 1: Instruction Mix in Application Programs 4.2 Performance Results Throughput vs. Number of PES First we evaluate the throughput of our architec- ture with respect to the various parameters. Through- put is defined as the total number of instructions that are completed by all the PESin an execution cycle. In 302
  • 6. Figure 2: Throughput vs. Number of PES Benchmark Proeram this experiment, we used a 4-way set-associative DS- Cache, with a cache-line size of 4 words. In a later experiment we study the effect of the DS-Cache on the throughput. Fig. 2 plots the overall throughput of our archi- tecture against the number of PES, where both axes are drawn in the logarithmic scale. We observe that the throughput of our architecture increases almost linearly with the number of PES for all benchmark programs. When the number of execution pipes is in- creased from 1 to 2, a considerable improvement in throughput is observed for all benchmark programs. This is true irrespective of the number of PES in the system. Increasing the number of execution pipes be- yond 3 does not result in any improvement in the throughput. An interesting observation that can be made from the plots of Fig. 2 is that the throughput with 2 execution pipes per PE is almost equal to that of a configuration with twice the number of PES and a single execution pipe. Average Instruction-Level Parallelism Number of PES- SAXPBY Matrix Mult. 1 2 4 8 16 32 64 1.97 1.96 1.95 1.88 1.80 1.46 1.12 1.66 1.68 1.68 1.67 1.66 1.37 1.03 Table 2: Average Instruction-Level Parallelism The average number of instructions executed per clock cycle per PE is the instruction-level parallelism exploited by architecture. Table 2 tabulates the ex- ploited instruction-level parallelism when 4 execution pipes per PE were used. As seen from this table, the instruction-level parallelism never exceeds 2 which in- dicates that 2 execution pipes are sufficient to fully uti- h e the synchronizing capabilities of a PE. For Liver- more Loop and the Image Averaging programs, the ex- ploited instruction-level parallelism is low. This is due to high synchronization overheads involved in these applications. Effect of DS-Cache on Throughput and Network Latency As mentioned earlier, the introduction of the DS- Cache reduces the network latency and thus improves the performance of the system. In Fig. 3, we compare an architecture without the DS-Cache (with 1,2, or 3 execution pipes) with an architecture with 3 execution pipes and a 128-set, 4 way set-associative DS-cache for the SAXPBY program. Other benchmarks exhibit a similar trend and hence are not shown here. It can be noticed that the introduction of the DS-Cache im- proves the throughput significantly, especially when the number of PES is large, i.e. greater than 8. This is because, when the number of PES is small, the par- allelism available per PE is large enough to tolerate the remote memory latency. Hence, the presence of the DS-Cache does not influence the performance in a significant way. However, with a large number of PES, the parallelism per PE decreases (since no scal- ing of application programs is considered here), and therefore the remote memory access latency becomes crucial. It may be observed from Fig 3 that in the absence of the DS-Cache, there is no gain in the throughput by increasing the number of execution pipes. Another important observation that can be made from Fig. 3 is 303
  • 7. -of-- *-o,)bOr- Figure 2: (contd.) Throughput vs. Number of PES Figure 3: Throughput with and without the DS-Cache Figure 4: Effect of DS-Cache on Network Latency that the throughput of the architecture with the DS- Cache equals the throughput of a configuration with- out the DS-Cache, but with twice the number of PES. Earlier we made a similar observation with respect to the number of execution pipes. Thus one can infer that it is the DS-Cache that allows the PES to exploit higher instruction-level parallelism. The introduction of the DS-Cache influences the performance of the architecture in two ways. First, it reduces the access time for remote read requests whenever the corresponding cacheline is present in the DS-Cache. Secondly, in the event of a hit, the re- mote requests are serviced by the cache, and hence the requests do not enter the network. This in turn reduces the network traffic and the network latency. In Fig. 4, we plot the average network latency ob- served in two application programs when the num- ber of PES is increased from 2 to 64. In this exper- iment, we consider architecture configurations, with and without DS-Cache, but with 3 execution pipes per PE in both cases. The maximum average latency en- countered without DS-Cache is more than 2500 time units for the Image Averaging application and 1500 time units for SAXPBY. These large values indicate the enormous contention and the queuing delay en- countered in the network. Further, this shows that the saturated network latency is two orders of magni- tude higher than the unloaded network latency. When the architecture supports DS-Cache, the network la- tency increases to a maximum value of 600 time units (roughly) which is only one order of magnitude higher than the unloaded network latency. Thus supporting a DS-Cache in the architecture reduces the network la- tency by nearly a factor of 3 to 6. The higher network latency values for the image averaging application is due to the large number of synchronizing messages used for the barrier synchronization. The network latency for architecture configurations with DS-caches increases initially with the number of PES and then decreases in Fig. 4. Though this may seem counter-intuitive, it can be explained in the fol- lowing manner. When the number of PES increased, 304
  • 8. the size of the network and the number of network switches increase, decreasing the extent of contention in the network. Further as the size of the applica- tion is not scaled, the available parallelism per PE decreases with increase in the number of PES. Thus the ‘lack’ of work makes the PE wait for the response to come from the network before the PE can pump in additional messages into the network. This in turn reduces the number of messages (remote memory re- quests) sent by a PE to the network, which in turn fur- ther reduces the contention and the network latency. However, when the number of PES is 8, each PE had ‘enough’ parallelism to tolerate even a very high net- work latency. Thus the PES kept pumping more and more messages into the network, keeping the network always in saturation. Effect of Cache Organizations In this experiment we keep the number of PESas 32 and the number of execution pipes as 4. We used two different DS-Cache sizes, viz. 1K words and 2K words (per PE), even though a much larger cache size can be supported in practice. There are two reasons for this. First, no improvement in the overall throughput was observed for our benchmark programs when the cache size is increased beyond 2K words. Secondly, the benchmark programs that were considered have a smaller problem size compared to real-world examples. Thus, the smaller DS-Cache assumed in the simulation provides some kind of a scalodown effect, matching well with the smaller problem sizes considered in the simulation experiments. For each of the cache sizes, namely 1K and 2K words, we considered 2 different cacheline sizes (4 and 8 words) and 4 different cache organizations, namely direct mapped, 2-way, +way, and 8-way set-associative Organizations. The throughput of the architecture is once again an important performance metric to judge the suitabil- ity of the cache organization. The average number of times each cache block is accessed before another cacheline overwrites it, is a measure of the utilization of the cache block. This is referred to as the average cache block reuse. The number of times a cache block cannot be reserved in the DS-Cache due to the non- availability of a cache block in the corresponding set’, is referred to as the collisions in the DS-Cache. When- ever a collision occurs, the request is queued until a cache block becomes free. The queuing of requests in- ’Recall that this could happen if all the blocks in the apprw priate set are waiting- with their pending flagsset to 1 -- for their read requests to be satisfied by the remote data structure memory. creases the response time for the read and hence will affect the throughput of the architecture. Table 3 summarizes these performance parameters for the different cache organizations. From the table, the following observations can be made. (1) The direct-mapped cache organization yields a lesser throughput compared to the set-associative or- ganization. The lower throughput in direct-mapped caches is an expected result, as direct-mapping pro- vides less flexibility in mapping a cacheline to a cache block. (2) The value for the collision parameter decreases drastically as the associativity is increased. Also, the cache block reuse increases to a small extent with the associativity. (3) Increasing the cacheline size increases the cache block reuse factor by a small amount. A small increase in the throughput of the architecture is also observed with the increase in the cacheline size. (4) The throughput of the architecture is nearly the same for all cache organizations with an associativity greater than or equal to 2. 5 A Study on Feasibility In this section we provide approximate for transistor counts of various functional estimates blocks in our architecture. These estimates are based on com- parisons with the blocks of similar capability present in currently available commercial and experimental microprocessors, namely, Message-Driven Processor (MDP) [7], SuperSPARC [l]and RS-6000 [14]. MDP uses standard cell approach, and hence the corre- sponding estimates are conservative. We plan to con- duct VHDL simulations in future to provide detailed understanding on the feasibility aspects. The processor is divided into two parts, namely, on-chip storage and data-path, control logic and in- terface logic. On-chip storage on SuperSPARC and MDP, uses nearly 70-80% of the total transistors, but covers 30% area of the chip due to its regular struc- ture. On-chip storage contains an 8K word Instruction Cache and an 8K word high-speed buffer. Based on the estimates for a 4K words SRAM in MDP, the 8K word Instruction Cache requires 1.6M transistors, The high-speed buffer of 8K words is divided into 32 pages and is fully associative at page level only. We assume 5 ports to high-speed buffer, one for each (upto 4) ex- ecution pipe and one for the buffer-loader. A 5-ported 64K Byte on-chip memory uses 4.5M transistors in RS-6000, so the high-speed buffer will require nearly 3M transistors. 305
  • 9. Table 3: Effect of Cache Organization on Performance The register file contains upto 1024 registers and 8 ports3. Such register files are common in super- scalar/VLIW processors. The instruction scheduler contains 32 resources with a total of 128 registers. A resource uses the space for 4 registers. The ready thread queue has a length of 10 threads, and uses 3 registers for each thread. Thus thread management functions use 158 registers with one read and one write port. Datapath, control logic and interface logic use the remaining 20-30% transistors, but consume 70% of the chip area. Our architecture uses generic execu- tion pipes, similar to that of an MDP which requires 39K transistors. Buffer and instruction cache loaders use a counter for loading an activation or a code-block, once the base address is specified. Otherwise, address logic is similar to the address arithmetic unit of MDP, which consumes 75K transistors. Remaining control and interface logic (for internal and external memory and chip I/O) uses 56K transistors in MDP. Thus, the multithreaded processor proposed in this paper requires approximately 5M transistors (a more extensive execution pipeline may increase this num- ber to 6M transistors), a reasonable size for the cur- rent technology. Now let us consider other functional units, namely the filter unit and the thread synchro- nization unit. The filter unit maintains two tables, one for 32 resident activations (with base addresses of the frames), and other for 32 active threads (instruc- tion pointer). Thread synchronization unit maintains a queue of upto 32 threads (3 addresses each) which do not have resident activations. Logic in these units is fairly simple and can be implemented as finite state machines, using conventional approaches like PLAs or 3 A study in 1141 shows that on an averageone read and one readfwrite port is sufficientfor one instruction execution. FPGAs. 6 Related Work Several multithreaded architectures have been pro- posed in the literature (refer, for example, to [2, 3, 6, 12, 13, 151 and to [ll]for a survey). Like Threaded Abstract Machine (TAM) [6] and +T [15],our ar- chitecture realizes three levels of program hierarchy based on synchronisation and scheduling. TAM uses a compiler-controlled approach to achieve fine-grain parallelism and synchronization. In contrast, we ad- vocate the use of suitable compilation techniques and necessary hardware support. Further, our architecture supports multiple resident activations, which help to mask the cost of context switching while off-loading a resident frame. The processor coupling proposal [12] and the ar- chitecture proposed by Hirata, et al. [9]use dynamic packing from different threads to exploit instruction level parallelism. Our approach is similar to this, ex- cept that each instruction in a thread in our archi- tecture contains a single operation. While the use of multi-operation instructions in a thread in processor coupling [I21 improves the throughput, it makes the runtime scheduler more complex. Our architecture also supports a two-level cache structure to achieve high throughput. Our work differs from Tera [3] and its pre- decessors in following ways: (i) Tera uses long- word instructions-three operations per instruction and (ii) no dynamic packing of instructions is per- formed on Tera. In contrast, the processor in +T[15] is a superscalar. Our results indicate that the syn- chronizing capability of a PE can be fully utilized by using more than one execution pipe. 306
  • 10. 7 Conclusions In this paper we have described the design of a scal- able multithreaded architecture. The salient features of the architecture are (i) its ab&ty to exploit both coarse-grain parallelism and fine-grain instruction- level parallelism, (ii) a distributed DS-Cache which significantly reduces the network latency and makes the system scalable, (G)a high-speed buffer organi- zation which completely avoids load stalls on access to local variables of an activation, (iv) a layered ap- proach to synchronization and scheduling which helps to achieve very high processor throughput and utiliz* tion. The performance of the architecture is evaluated using simulation. Initial simulation results are promie ing and indicate that: (1) With a small number (2 or 3) of execution pipes, the architecture can effectively exploit all the instruction-level parallelism available at a PE. The throughput of our architecture with two execution pipes is almost equal to that of a configuration with twice the number of PESand a single execution pipe. (2) The use of a set-associative cache leads to a near- linear performance improvement with respect to the number of PES. The presence of the DS-Cache re duces the network latency experienced by remote read requests by a factor of 3 to 6. (3) The DS-Cache reduces network traffic which is es- sential in realizing the performance improvements due to multiple execution pipes. Acknowledgements The authors are grateful to the reviewers whose comments have improved the presentation of this pa- per. This work was supported by MICRONET - Net- work Centres of Excellence in Canada. References [l] F. Abu-Nofal and et al. A three million transistor microprocessor. In Digest of Technicnl Papers, 199.2 IEEE International Solid State Circuits Conference, pages 108-109, Feb. 1992. [2] A. Agarwal, B-H. Lim, D. Kranz, and J. Kubiatowicz. APRIL: A processor architecture for multiprocessing. In Proc. of the 17th Ann. Intl. Symp. on Computer Architecture, pages 104-114. Seattle, Wash., June 1990. [3] R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith. The Tera computer system. In Conf. Proc., 1990 IntI. Conf. on Super- computing, pages 1-6, Amsterdam, The Netherlands, June 1990. [4] Arvind, R. S. Nikhil, and K. Pingali. I-structures: Data structures for parallel computing. ACM Trans. on Programming Languages and Systems, 11(4):598- 632, Oct. 1989. [5] H. Cheong and A.V. Veidenbaum. Compiler-directed cache management in multiprocessors. IEEE Com- puter, pages 3947, June 1990. [6] D.E. Culler et al. Fine-grain parallelism with minimal hardware support: A compiler-controlled threaded abstract machine. In Proc. of the 4th Intl. Conf. on Architectural Supportfor Programming Languages and Operating Systems, pages 164-175, Santa Clara, Calif., April 1991. [7] W.J. Dally et al. A message-driven processor: A multicomputer processing node with efficient mech- anisms. IEEE Micro, pages 24-38, April 1992. [8] R. Govindarajan and S.S. Nemawarkar. Small: A scalable multithreaded architecture to exploit large locality. In Proc. of the 4th IEEE Symp. on Parallel and Distributed Processing, pages 32-39, Dec. 1992. [9] H. Hirata et al. An elementary processor architecture with simultaneous instruction issuing from multiple threads. In Proc. of the 19thIntl. Symp. on Computer Architecture, pages 136-145. Gold Coast, Australia, May 1992. [lo] R. A. Iannucci. Toward a dataflow/von Neumann hybrid architecture. In Proc. of the 15th Ann. Intl. Symp. on Computer Architecture, pages 131-140, Honolulu, Hawaii, June 1988. [ll] R. A. Iannucci, G. R. Gao, R. H. Halstead, Jr., and B. Smith. Multithreaded Computer Architecture: A Summary of the State of the Art. Kluwer, Norwell, Mass., 1994. Processor coupling: Integration compile time and runtime scheduling for parallelism. In Proc. of the 19th Intl. Symp. on Com- puter Architecture, pages 202-213. Gold Coast, Aus- tralia, May 1992. [I31 Y. Kodama, S. Sakai, and Y. Yamaguchi. A proto- type of a highly parallel dataflow machine EM-4 and its preliminary evaluation. In Proc. of InfoJapan 90, pages 291-298, Oct. 1990. [la] M. Misra. IBM RS System/6000 Technology, First edition, 1990. IBM, Austin, Tx., 1990. [I51 R. S. Nikhil, G. M. Papadopoulos, and Arvind. *T: A multithreaded massively parallel architecture. In Proc. of the 19th Ann. Intl. Symp. on Computer Architecture, pages 156-167, Gold Coast, Australia, May 1992. [16] G. M. Papadopoulos and D. E. Culler. Monsoon: an explicit token-store architecture. In Proc. of the 17th Ann. Intl. Symp. on Computer Architecture, pages 82-91. Seattle, Wash., June 1990. [12] S.W. Keckler and W.J. Dally. 307