shashank_spdp1993_00395543

Analysis of Multithreaded Multiprocessors with
Distributed Shared Memory*
S.S. Nemawarkart, R. Govindarajant, G.R. Gao$, V.K. Agarwalt
tDepartment of Electrical Engineering, $School of Computer Science
McGill University, Montreal, H3A 2A7, Canada
{shashank,govindr,gao,agarwal}@pike.ee.mcgill.ca
Abstract
In this paper we propose an analytical model, based
on multi-chain closed queuing networks, to evaluate
the performance of multithreaded multiprocessors. The
queuing network is solved by using approximate Mean
Value Analysis. Unlike earlier work which modeled in-
dividual subsystems in isolation, our work models pro-
cessor, memory and network subsystems an an inte-
grated manner. Such an approach brings out a strong
coupling between each pair of subsystems. For ex-
ample, the processor and memory Utilizations respond
identically to the variations in the network character-
istics. Further, we observe that high performance on
an application is achieved when the memory request
rate of a processor equals the weighted sum of memory
bandwidth and the average round trip distance of the
remote memory across the network.
1 Introduction
A multithreaded processor has the potential to tol-
erate the effects of long memory latency, high network
delays and unpredictable synchronization delays, as
long as it has some available tasks to execute on. The
processor executes on small sequences of instructions,
called threads, interspersed with memory accesses and
synchronizations. Multi-threaded architectures mask
the long memory latency by suspending the execution
of the current thread upon encountering the long la-
tency operation and switching to another thread. This
improves the processor utilization as the computation
is overlapped with the memory access and synchro-
nization delays. A context switch incurs an associated
overhead of saving the state of the current thread and
restoring that of the newly scheduled thread.
The performance of a multithreaded architecture
depends on a number of parameters related to the
architecture-memory latency, context switch time,
switch delay, and to the application- number of
threads, thread runlength and remote memory access
pattern'. In this paper, we are interested in an an-
alytical study on the performance of a multithreaded
multiprocessor with the processing elements connected
across a two-dimensional mesh. The processors access
a shared memory which is physically distributed across
the processing elements.
Previous studies on multithreaded architectures
have focussed only on the processor utilization. Sim-
ilarly, the studies on network characteristics concen-
trate only on the network switch responses. These
work [l,3, 111analyze a subsystem in isolation without
considering the feedback effect of other subsystems.
In contrast, we develop an integrated model of pro-
cessor, memory and network subsystems, which helps
in identifying the relationships among various param-
eters for achieving high performance. The model is
based on the multi-chain closed queuing network. A
key intuitive observation for the model is that with the
availability of an adequate number of enabled threads,
we can anticipate that the system will behave as a
closed queuing network provided that sufficiently large
queues are maintained at each stage of the service sta-
tion. In [6], Kruskal and Snir show that the difference
between finite buffer and infinite buffer is not signifi-
cant when the buffer size is greater than a few. Based
on these observations, we can solve the closed queu-
ing network. Owing to the amount of computations
needed to study a multiprocessor system with many
nodes [lo], we use an Approximate Mean Value Anal-
ysis (AMVA). This analysis yields the performance
measures such as processor and memory utilizations,
which can be used to fine tune the program partition-
ing for high performance. The performance results of
our study establish that:
(i) both the memory and network subsystems strongly
*This work has been supported by MICRONET - Network
Centers of Excellence
'Refer to Section 2.1 for a precise definition of some of the
terms used here.
114
1063-637#93 $03.000 1993IEEE

influence the processor utilization. A strong coupling
exists between each pair of subsystems. Such behavior
is not explicitly reported in earlier work.
(ii)- when the rate of issue of memory requests in an
application program is such that the resulting aver-
age thread runlength equals the weighted sum of local
memory latency and the network latencies, the proces-
sor gets the response from the memory and the net-
work before it runs out of workload.
(iii)- the above matching of thread runlength with
memory and network latencies also ensures a high uti-
lization of all subsystems, and hence represents a suit-
able operating range on an architecture.
An equivalence between markov queuing models
and the generalized stochastic petri nets has been
shown in [8].So for validating these analytical solu-
tions, we develop a stochastic timed petri net (STPN)
model for a multithreaded architecture.
In
the following section we develop the analytical model.
Section 3 describes the validation of the analytical re-
sults using the STPN model. The results of this study
are presented in Section 4. In Section 5, we compare
our approach with related work.
2 Analytical Model for Multithreaded
2.1 A Multithreaded Architecture
The rest of this paper organized as follows.
Architecture
In the multithreaded execution model, a program
is a collection of partially-ordered threads. A thread
consists of a sequence of instructions which are ex-
ecuted sequentially as in the von Neumann model.
The scheduling of individual threads is similar to a
dataflow model. We enlist the terminology to be used
in the rest of the paper:
A thread undergoes followingstates during its lifetime:
Suspended when it is waiting for its long-latency op-
eration to be satisfied. Only one such operation per
thread is allowed.
Ready: when the long-latency operation for which a
thread was waiting, is satisfied.
Executing: when a ready thread is scheduled for exe-
cution on processor pipeline.
Each processor executes on a set of nt parallel threads.
We assume that nt is a constant. Further, the threads
interact only through the locations in either the lo-
cal memory or the remote memory. A processor on
which a group of nt threads executes, is called the
host processor. These threads can not be scheduled
for execution on any other processor.
A thread is executed on the processor pipeline for the
duration called runlength R, before getting suspended
I
Processor Subsystem
- - - - - -
Memory Subsystem
I 1 . 1
I ' 4 ' 5 ' 6 ' 7
I I I II
Figure 1: A Processing Element
on a memory access. On suspension, the state of the
outgoing thread is saved and the context of the newly
scheduled thread is restored. This context switch time
is C.
A memory access is directed to a remote memory mod-
ule with a probability premote.Local memory module
services the remaining fraction (1- premote) of mem-
ory accesses. At each switch node, a remote memory
access suffers a delay of S time units.
Memory latency L , is the access time of the local mem-
ory without queuing delays. Observed memory latency
Lobs is the response time experienced by a memory ac-
cess, including the waiting time at the memory.
The remote memory access pattern across the memory
modules follows a geometric distribution. With geo-
metric distribution, the probability of a remote mem-
ory access requesting a memory module at a distance
of h hops from the host processor, is p t w / u , where
p,, is the probability of accessing the nearest mem-
ory module across the network, and U is a normalizing
constant. Higher value of p,, leads to higher locality
of memory accesses.
Processor utilization U,, is the fraction of time the
processor is performing useful task in the execution
pipeline. Similarly Memory utilization U,, is the
fraction of time for which the memory port is busy,
and Switch Utilization Unet,is the fraction of time for
which the switch is busy. System Utilization Usys,is
115

the average of the utilizations of all three components
in a processing element.
R, L, C and S can be measured as number of cycles,
or in time units.
Figure 1 shows a processing element consisting of
a processor pipeline, memory and its switch inter-
face. Multiple such processing elements are connected
across a network of switches to form a multiprocessor
system. A processor sends a remote memory access
to the network through the switch. The access is for-
warded along any of the available shortest-paths to the
remote memory module. Upon service, the response is
returned to the processor issuing this memory request.
In this paper, we use a two-dimensional, bidirectional
torus topology for the interconnection network (shown
in Figure 2 ), similar to the one used in the ALEWIFE
multiprocessor system [a].
Figure 2: 4 x 4 Multiprocessor with 2-dimensional
mesh
2.2 Analytical Model and Assumptions
We propose the use of a closed queuing network
as shown in Figure 3 to model a multithreaded ar-
chitecture described in Section 2.1. Such a queuing
model has a product-form solution (refer to [4, 101for
details) due to two reasons. One, there is sufficient
buffer space in the thread pool, so that the processor
is not blocked when a thread may wait at the thread
pool for memory system to respond to its access. Two,
there is no build up of active threads in the system,
allowing the use of finite queues in the system without
blocking the network switches.
The queuing network model is composed of the pro-
cessing elements with three types of nodes, namely :
processor, memory and switch, a s shown in Figure 3.
Processor node :
The processor is a single server node. Ready threads
are executed one at a time with the FCFS service dis-
cipline, with exponential service time having a mean
R. Further, all the processor nodes have same service
time distribution. In our model, a thread is statically
assigned to one of the processors and gets executed
only in that processor. The threads executed on pro-
cessor i are considered to be class i customers in the
queuing network. A thread alternates between the ex-
ecution on its host processor, and an access to memory
module. Thus, the visit ratio' of a thread to its host
processor is unity, and to other processors is zero.
Memory Module :
A memory module has a single server, with exponen-
tially distributed access time, and the mean value is
L time units. The visit ratio of a thread (belonging to
a host i ) to a memory module j is emij. The value of
emij depends on the distribution of remote memory
requests across the memory modules. In this paper,
we consider geometric distribution as discussed in Sec-
tion 2.1.
Switch Node :
A switch node has a single server with an exponen-
tial service time distribution with mean value S. The
switch node interfaces a processor-memory block with
the four3 neighboring switch nodes in a 2-dimensional
torus network. The visit ratio esij ,from a processor i
to a switch j, is the sum of :
(i) twice the visit ratio emij to the memory module j,
for class i, and (ii) the contribution due to the mem-
ory modules situated at a distance greater than the
number of hops from node i to node j .
In summary, a processing element consisting of a
processor, a memory, and a switch node, is connected
to other processing elements through one or more
switch nodes, as shown in Figure 3.
Solving the above queuing network accurately is
computationally intensive. The state space to be con-
sidered for computing the normalization constant, G,
of the product form solution [lo] is enormous. For
example, a two processor system with 10 threads on
each processor has l0C3x10 Cs = 1166400 states. So
we prefer the use of Approximate Mean Value Analysis
(AMVA) [7]. For nt threads on each of the P proces-
sors in the system, the AMVA evaluates :
'The visit ratio for a class of threads at a node in a chain of
a closed queuing network is the frequency with which a thread
belonging to that class visits the node with respect to a node
(say processor, selected arbitrarily for monitoring the system)
in that chain.
3This number is dependent upon the topology of the inter-
connection network.
116

a a
Figure 3: Queueing Network Model
(i) the arrival rate X i for the threads belonging to ea(
processor i; (ii) the waiting time w:,~ at each node 1;
and (iii) the queue length nr,l for population vectors
N = (nt, ...,nt) and N - l i i.e. one customer less in
the i-th class.
Based on Xi, w ~ , L ,service times and the visit ratios,
we can evaluate the U,, U, , U n e t and Lob,. We inves-
tigate the behavior of these performance measures in
terms of architectural and application parameters, in
Section 4. Section 3 presents the details of the simu-
lation model based on STPN used for the validation
of the analytical results.
3 Validation of the Analytical Model
In this section we describe an STPN model for
the multithreaded multiprocessor architecture which
mimics the behavior of the queuing network model.
3.1 The STPN Model
Figure 4 shows the STPN model for a processing
element containing a multithreaded processor , a mem-
ory module and the network interface. The processor
subsystem is modeled by the transitions to,tl, and t2,
and the places PO, pl, and p2. The place p4 maintains
a pool of ready threads. A thread executes for the du-
ration R, at the transition tl, before it encounters a
long latency memory access. Transition t2 models the
context switch time C for saving/restoring the states
of the outgoing and the newly scheduled threads re-
spectively. Further, t2 sends the memory access to
a remote memory through p7 with a probability of
premote, and to the local memory through p3. The
memory access distribution is used to determine the
destination for a remote access.
' From% Network Switch '
Figure 4: Petri Net Model for a Processing Element
The transitions t,,,, t s e n d , and t8, and the places
p7, p8 and pport,model the network interface. Place
pportmodels the state of the network port. An incom-
ing message from the network for a suspended thread
in this processor, is forwarded to p4, while the request
to access the memory is forwarded to p3.
The memory port modeled by a token in p6, takes
up a memory access from p3 for service at t3 with a
duration L. Transition t3 routes the response to p4 or
p7 based on the processor originating the request.
Transitions with non-zero delays are represented us-
ing rectangular boxes. R and L have exponentially
distributed service time, and C has a fixed time delay.
3.2 Validation
The above STPN model is simulated using an
event-driven Petri Net simulator Voltaire [9]. The per-
formance results from the simulations for various sets
of input parameters, are compared with the analytical
results. We report two representative cases with the
following parameters : R = 10, C = 0, L = 10, S = 10
and premote= 0.1 or 0.5. The memory access pattern
is geometric with p,, = 0.5.
Figure 5 shows the processor utilization values ob-
tained from the analysis and the Petri Net simula-
tions represented as MODEL-Uti1 and SIM-Uti1 re-
spectively. MODEL-Latency and SIM-Latency repre-
sent Lob, from the analysis and the simulations. First,
we observe that U, increases rapidly with nt and then
saturates. For example, at premote= 0.1, values of U,
are 57%, 79% and 89% for nt = 2, 5 and 10, respec-
tively. At premote= 0.5, the corresponding values of
U, are lower. Secondly, Lobs increases linearly with
117

lOOr /-
5 10 15 20
NumberofThreads
Figure 5: U, and Lobs with respect to nt.
nt, at Premote = 0.1, since more threads wait at the
memory module for service. At premote = 0.5, Lob, is
low and almost constant after nt = 4since a saturated
network limits the memory access rate. In both cases,
the analytical results match well with the simulations
throughout the range of the experiment. Thus STPN
simulations confirm the accuracy of our queuing anal-
ysis.
Section 4 discusses the results based on the analysis
of Section 2.2. As the analytical results match well
with the simulations, we report the former only.
4 Results
Our objective is to identify the relationships among
various application and architecture parameters so
that we can achieve low execution time while main-
taining high utilization of all subsystems. We use the
analytical model developed above, with the base ar-
chitecture parameters as : L = 10, S = 10, on a 4 x 4
mesh of processing elements. The default values for
application parameters R, nt, and premote are 10, 8
and 0.1, unless stated otherwise. The default distri-
bution for remote memory accesses is geometric with
p,, = 0.5. These values ensure saturated processor
and memory performance, if premoteor S is zero.
4.1 Subsystem Utilizations
With premote = 0, the memory requests are re-
stricted the local memory module. An increase in the
premoteincreases the messages routed to remote mem-
ory modules across the network. This has a two-fold
effect on performance : (i) Since the latency for remote
access is higher (than the local memory latency) due
to extra time spent in traversing the network, the cor-
responding thread is suspended for a longer duration.
(ii) Larger number of messages on the network leads
to higher contention or network congestion, which in
turn increases the network latency. This reduces the
utilization of the processor and memory subsystems.
Figure 6 shows this effect of premote on the subsystem
utilizations, for L = 10, and L = 20. At L = 10, in-
crease in premotefrom 0.2 to 0.8 reduces the values of
U, and U, from nearly 90% to 23% and 22%,respec-
tively. When Unet saturates, the fall in the values of
U, and U, is steep. For L = 20, U, and U, decrease
rapidly after the network saturates in the same way.
Also the variation in premote affects both U, and U,
identically.
0.2 0.4 0.6 0.8 1
RemoteAccess Prab~Miiy
Figure 6: Subsystem Utilizations
Similar observations could be made when we con-
sider the effect of memory latency on the processor
and network utilizations or the effect of S on the pro-
cessor and memory utilizations. If the memory latency
is increased, then the number of requests waiting at
the memory increases, reducing the values of U, and
Unet. Similarly, an increase in S increases the net-
work latency, so more threads at the processor remain
suspended waiting for the corresponding memory re-
sponses to arrive. This in turn decreases the rate at
which memory accesses are sent resulting in a fall in
U, and U, values with respect to an increase in S.
Thus we observe a close coupling among the sub-
systems, based on our integrated model of processor,
memory and network subsystem.
4.2 System Utilization
Since variations in any parameter of the system can
affect the utilizations of all subsystems, we define sys-
tem utilization, Usys,as the average of the utilizations
of all subsystems. Having known the behavior of sub-
118

Prac(R=tO)
. ... P,a-(R=20) *--+--* :
MemciyLatency
Figure 7: The System Utilization with respect to L.
system utilizations (from Section 4.1), we are inter-
ested in the ability of Usysto track the transitions
corresponding to saturation of these subsystems. Fig-
ure 7 plots the subsystem utilizations and Usyswith
respect to memory latency for R = 10, and R = 20.
When L is close to zero, the system utilization is low
due to the low utilization of memory. At values of L
close to 100, the memory subsystem saturates but the
Usysis low (limiting value is 33%), due to low U, and
Unet. For Usys,a peak occurs when L = R = S = 10,
since all subsystems are close to their maximum uti-
lization values. With L > 10, both U, and Unet drop
off sharply with L, and only a small rise occurs in U,,, ,
resulting in low value of Usys.The maximum value of
Usysis referred as peak system utilization (PSU). Let
the corresponding memory latency be Lpsu. From
Figure 7 we observe that:
(i) Usysreflects the relative values of U,, U,,, and
Unet. When parameters of processor and memory sub-
systems are considered, PSU occurs at L = R. We
note that PSU represents a transition phase in which
one subsystem approaches saturation and utilizations
of other subsystems drop. This is due to balance of
throughput between any pair of subsystems.
(ii) For R = 10 and L 5 10, at PSU, U, is only 5% less
than its maximum value while Usyshas improved by
close to 25%. For R = 20 and L 5 20, these differences
for U, and UsYsare 7% and 30%. Thus by keeping the
operating range near PSU we can gain considerably in
overall system utilization with a small loss in processor
utilization.
(iii) For any value of L less than Lpsu, U, is high.
Thus Lpsu represents the slowest memory we can op-
erate without hampering a high system performance
significantly.
The bell shaped plot for system utilization also oc-
curs with respect to changes in other parameters such
as R and S.
Effect of Network Parameters
Figure 8 shows the effect of premote on the system
utilization for various values of S. We observe that :
(i) PSU lies between 70 to 80% for a wide range of
S. (ii) For faster switches i.e. low S,Unet does not
saturate until premote is high.
80
70
g 60
X 50
j 40
30
-
I
c
1
Figure 8: Effect of premote on Usysfor various S.
Effect of Thread Runlength
Figure 9 plots the system utilization with respect to
premote for various values of thread runlength. Let
network latency Taugbe the average time taken by a
message on the unloaded network to complete a round
trip. For geometric distribution of memory accesses
with psw = 0.5, a remote memory access travels a dis-
tance davg= 1.733 hops on a 4 x 4 mesh. Thus a
round trip takes 2x 1 . 7 3 3 ~10 time units in the un-
loaded network. In addition, a delay of S (= 10) time
units is incurred at the local switch on the forward as
well as the return path of the message. Hence Tavg(=
34.66+ 20 = 54.66) is given by :
In Figure 9, for R 5 10, PSU increases with R from
67% to 79%. Also, the PSU almost always occurs at
m 0.18. Since R 5 L , a thread spends less time at the
processor than it spends at the memory module, PSU
is a result of the matching of throughput between the
119

,
0 2 0 4 06 08 1
Remote Memow Access Probablity
Figure 9: Effect of premoteon Usysfor various R.
memory and network subsystems. A memory module
returns the remote memory accesses to the network at
the rate of Pre;Otc. At PSU,throughput of the incom-
ing messages from the network (= &)equals the
throughput of the responses from the memory module
the processor and network subsystems govern the PSU
value. A processor sends out memory requests at the
rate of &. A fraction (=prenote) of these are directed
across the network to remote memory modules. The
network delivers the messages to processor at the rate
of &.As the throughputs should match at PSU,
premoteshould equal &.Considering these two sce-
narios together, the maximum value of PSU occurs
when throughputs of the three subsystems are equal.
That is, the thread runlength, memory latency and
network latency should be such that :
(= p r c ~ o r e ) .SOpremote is $$ = 0.18293. For R >_ 10,
(3)
Equation 3 is a direct result of Equation 2, when
we consider the throughput balance at the processor
subsystem. In Figure 6, we observe that upon net-
work saturation the values of U, and U, are close to
creasing L , when the memory subsystem reaches sat-
uration, the values of U, and Unet are proportional to
4.3 Locality of Memory Accesses
If the remote memory access pattern is a geomet-
ric distribution, an increase in p,, increases davgfor
R L
the P r c m o t c X T a " s and PremotaXTo"s. Similarly, for in-
and z,respectively.
a message on the network, and hence the network la-
tency. Figure 10 shows the effect of increasing p,, on
system utilization, for various values of thread run-
length when premote= 0.17. For low value P,,, PSU
occurs due to saturation at processor and memory sub-
systems. PSU increases from 65% to 78% when p,,
is increased from 0.1 to 0.7 due to an increase in the
value of Unet. Further increase in p,, to 0.9 brings
down PSU to 72%, due to lower values of U,, and U,.
100
4
PSW43, --*-- II
PswQ5 D - - Q --e
P s w 46 x --+- --x
PSW a7*- - *-- 1
10 pSW49-
20 40 60 80 100
ThreadRunlength
Figure 10: Usyswith Geometric Distribution.
4.4 Summary of Results
achieving high performance :
Our study suggests the following conditions for
0 Overall high utilization of all subsystems is
achieved when (i) the thread runlength R equals
with the memory latency L; and (ii) the remote
memory access rate (v)equals the network
service rate 1.T,",
0 The above condition is necessary irrespective of
the large value of nt.
0 The applications with larger locality can toler-
ate slower networks without much degradation in
performance due to reduced network traffic.
5 Related Work
A few analytical studies on multithreaded architec-
tures have been reported in the literature [I, 3, 111.
In [3] and [ll],Stochastic timed petri nets (STPN)
have been used for modeling. These analyses assume
that the response time of memory is constant, equiva-
lently, the parallelism in the application (i.e. nt), has
120

no impact on the throughput of the memory subsys-
tem. Further [3] studies a bus-based multiprocessor
without contention effect on the bus. In contrast, us-
ing an integrated model for a multiprocessor, we study
a realistic system with queuing delays at the network
and memory subsystems.
In [l],the analysis has been performed for a cache-
based multiprocessor architecture. The analysis mod-
els a finite number of threads and their interference
in cache. It focuses on the performance of a proces-
sor, but other subsystems (like memory and network)
have not been studied. On the other hand, our analy-
sis does not model caches explicitly. The thread run-
length RIis related to the cache miss rate for an appli-
cation. The two approaches are complementary. The
analysis presented by Johnson 151 provides a frame-
work by combining simple models of application, pro-
cessor and network behavior. The model uses unsat-
urated network, but does not consider the memory
subsystem in detail. We develop a fairly simple inte-
grated model of the system which can be adapted to
different networks quickly. Our model is applicable to
saturated and unsaturated subsystems.
In [12], performance results for trace-driven sim-
ulations of multithreaded system with a shared bus,
are reported. They conclude that a small number
of threads in an application can achieve near 100%
processor utilization, but large global traffic can limit
the performance benefits due to multithreading. Our
study extends these results by suggesting the operat-
ing range for obtaining higher performance.
6 Conclusions
In this paper, we have proposed a simple analytical
model for a multithreaded multiprocessor architecture
based on closed queuing network with finite thread
population. The performance study based on this an-
alytical model integrating the processor, memory and
network subsystems shows that :
0 a strong coupling exists between these subsys-
tems. Variations in parameters of one subsystem
affect the utilizations of other subsystems as well.
0 for high performance, the partitioning of a pro-
gram should result in thread runlengths close to
the weighted sum of memory latency and network
latency. This is necessary for high performance
irrespective of the application parallelism nt.
References
A. Agarwal. Performace tradeoffs in multithreaded
processors. IEEE Transactions on Parallel and Dis-
tributed Systems, 2(4), September 1992.
A. Agarwal, B.H. Lim, D. Kranz, and J. Kubiatow-
icz. April: A processor architecture for multiprocess-
ing. In Proc. of the 17th Int’l. Symp. on Computer
Architecture, pages 104-114, 1990.
L. Alkalaj and R.V. Bopanna. Performance of multi-
threaded execution in a shared-memory multiproces-
sor. In Proc. of 3rd Ann. IEEE Symp. on Parallel and
Distributed Processing, pages 330-333, Dallas, USA,
December 1991. IEEE.
F. Baskett, K. Mani Chandy, R.R. Muntz, and F.G.
Palacios. Open, closed, and mixed network of queues
with different classes of customers. Journal of the
ACM, 22(2):248-260, April 1975.
K. Johnson. The impact of communication locality on
large-scale multiprocessor performance. In Proceed-
ings of the 19th Insternational Symposium on Com-
puter Architecture, pages 392-402. ACM, May 1992.
C.P. KrusM and M. Snir. The performance of mul-
tistage interconnection networks. IEEE Transactions
on Computers, C-32(12):1091-1098, Jan 1983.
E.D. Lazowska, 3. Zahorjan,G.S. Graham, and K.C.
Sevcik. Quantitative System Performance: Computer
System enalysis Using Queueing Network Models.
Prentice-Hall, Inc., Englewood Cliffs, NJ, 1984.
T. Murata. Petri nets: Properties, analysis and ap-
plications. Proceedings of the IEEE, 77(4):541-580,
April 1989.
P. Parent and 0.Tanir. Voltaire: a discrete event sim-
ulator. In Proceedings of Fourth International Work-
shop on Petri Nets and Performance Models, Mel-
bourne, Australia, December 1991.
M. Reiser and S. Lavenberg. Mean value andysis
of closed multichain queueing networks. Journal of
R.H. Saavedra-Barrera, D.E. Culler, and T. v. Eicken.
Analysis of multithreaded architectures for parallel
computing. In Proc. of 2nd Ann. ACM Symp. on
Parallel Algorithms and Architectures, Crete, Greece,
July 1990. ACM.
ACM, 27(2):313-322, April 1980.
[12] W.D. Weber and A. Gupta. Exploring the benefits
of multiple contexts in a multiprocessor architecture:
Preliminary results. In Proceedings of the 16th Annual
International Symposium on Computer Architecture,
pages 273-280. ACM, 1989.
0 a larger locality in application program reduces
the network traffic, resulting in a higher perfor-
mance.
121

shashank_spdp1993_00395543

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (6)

Similar to shashank_spdp1993_00395543

Similar to shashank_spdp1993_00395543 (20)

shashank_spdp1993_00395543