SlideShare a Scribd company logo
1 of 8
Download to read offline
Analysis of Multithreaded Multiprocessors with
Distributed Shared Memory*
S.S. Nemawarkart, R. Govindarajant, G.R. Gao$, V.K. Agarwalt
tDepartment of Electrical Engineering, $School of Computer Science
McGill University, Montreal, H3A 2A7, Canada
{shashank,govindr,gao,agarwal}@pike.ee.mcgill.ca
Abstract
In this paper we propose an analytical model, based
on multi-chain closed queuing networks, to evaluate
the performance of multithreaded multiprocessors. The
queuing network is solved by using approximate Mean
Value Analysis. Unlike earlier work which modeled in-
dividual subsystems in isolation, our work models pro-
cessor, memory and network subsystems an an inte-
grated manner. Such an approach brings out a strong
coupling between each pair of subsystems. For ex-
ample, the processor and memory Utilizations respond
identically to the variations in the network character-
istics. Further, we observe that high performance on
an application is achieved when the memory request
rate of a processor equals the weighted sum of memory
bandwidth and the average round trip distance of the
remote memory across the network.
1 Introduction
A multithreaded processor has the potential to tol-
erate the effects of long memory latency, high network
delays and unpredictable synchronization delays, as
long as it has some available tasks to execute on. The
processor executes on small sequences of instructions,
called threads, interspersed with memory accesses and
synchronizations. Multi-threaded architectures mask
the long memory latency by suspending the execution
of the current thread upon encountering the long la-
tency operation and switching to another thread. This
improves the processor utilization as the computation
is overlapped with the memory access and synchro-
nization delays. A context switch incurs an associated
overhead of saving the state of the current thread and
restoring that of the newly scheduled thread.
The performance of a multithreaded architecture
depends on a number of parameters related to the
architecture-memory latency, context switch time,
switch delay, and to the application- number of
threads, thread runlength and remote memory access
pattern'. In this paper, we are interested in an an-
alytical study on the performance of a multithreaded
multiprocessor with the processing elements connected
across a two-dimensional mesh. The processors access
a shared memory which is physically distributed across
the processing elements.
Previous studies on multithreaded architectures
have focussed only on the processor utilization. Sim-
ilarly, the studies on network characteristics concen-
trate only on the network switch responses. These
work [l,3, 111analyze a subsystem in isolation without
considering the feedback effect of other subsystems.
In contrast, we develop an integrated model of pro-
cessor, memory and network subsystems, which helps
in identifying the relationships among various param-
eters for achieving high performance. The model is
based on the multi-chain closed queuing network. A
key intuitive observation for the model is that with the
availability of an adequate number of enabled threads,
we can anticipate that the system will behave as a
closed queuing network provided that sufficiently large
queues are maintained at each stage of the service sta-
tion. In [6], Kruskal and Snir show that the difference
between finite buffer and infinite buffer is not signifi-
cant when the buffer size is greater than a few. Based
on these observations, we can solve the closed queu-
ing network. Owing to the amount of computations
needed to study a multiprocessor system with many
nodes [lo], we use an Approximate Mean Value Anal-
ysis (AMVA). This analysis yields the performance
measures such as processor and memory utilizations,
which can be used to fine tune the program partition-
ing for high performance. The performance results of
our study establish that:
(i) both the memory and network subsystems strongly
*This work has been supported by MICRONET - Network
Centers of Excellence
'Refer to Section 2.1 for a precise definition of some of the
terms used here.
114
1063-637#93 $03.000 1993IEEE
influence the processor utilization. A strong coupling
exists between each pair of subsystems. Such behavior
is not explicitly reported in earlier work.
(ii)- when the rate of issue of memory requests in an
application program is such that the resulting aver-
age thread runlength equals the weighted sum of local
memory latency and the network latencies, the proces-
sor gets the response from the memory and the net-
work before it runs out of workload.
(iii)- the above matching of thread runlength with
memory and network latencies also ensures a high uti-
lization of all subsystems, and hence represents a suit-
able operating range on an architecture.
An equivalence between markov queuing models
and the generalized stochastic petri nets has been
shown in [8].So for validating these analytical solu-
tions, we develop a stochastic timed petri net (STPN)
model for a multithreaded architecture.
In
the following section we develop the analytical model.
Section 3 describes the validation of the analytical re-
sults using the STPN model. The results of this study
are presented in Section 4. In Section 5, we compare
our approach with related work.
2 Analytical Model for Multithreaded
2.1 A Multithreaded Architecture
The rest of this paper organized as follows.
Architecture
In the multithreaded execution model, a program
is a collection of partially-ordered threads. A thread
consists of a sequence of instructions which are ex-
ecuted sequentially as in the von Neumann model.
The scheduling of individual threads is similar to a
dataflow model. We enlist the terminology to be used
in the rest of the paper:
A thread undergoes followingstates during its lifetime:
Suspended when it is waiting for its long-latency op-
eration to be satisfied. Only one such operation per
thread is allowed.
Ready: when the long-latency operation for which a
thread was waiting, is satisfied.
Executing: when a ready thread is scheduled for exe-
cution on processor pipeline.
Each processor executes on a set of nt parallel threads.
We assume that nt is a constant. Further, the threads
interact only through the locations in either the lo-
cal memory or the remote memory. A processor on
which a group of nt threads executes, is called the
host processor. These threads can not be scheduled
for execution on any other processor.
A thread is executed on the processor pipeline for the
duration called runlength R, before getting suspended
I
Processor Subsystem
- - - - - -
Memory Subsystem
I 1 . 1
I ' 4 ' 5 ' 6 ' 7
I I I II
Figure 1: A Processing Element
on a memory access. On suspension, the state of the
outgoing thread is saved and the context of the newly
scheduled thread is restored. This context switch time
is C.
A memory access is directed to a remote memory mod-
ule with a probability premote.Local memory module
services the remaining fraction (1- premote) of mem-
ory accesses. At each switch node, a remote memory
access suffers a delay of S time units.
Memory latency L , is the access time of the local mem-
ory without queuing delays. Observed memory latency
Lobs is the response time experienced by a memory ac-
cess, including the waiting time at the memory.
The remote memory access pattern across the memory
modules follows a geometric distribution. With geo-
metric distribution, the probability of a remote mem-
ory access requesting a memory module at a distance
of h hops from the host processor, is p t w / u , where
p,, is the probability of accessing the nearest mem-
ory module across the network, and U is a normalizing
constant. Higher value of p,, leads to higher locality
of memory accesses.
Processor utilization U,, is the fraction of time the
processor is performing useful task in the execution
pipeline. Similarly Memory utilization U,, is the
fraction of time for which the memory port is busy,
and Switch Utilization Unet,is the fraction of time for
which the switch is busy. System Utilization Usys,is
115
the average of the utilizations of all three components
in a processing element.
R, L, C and S can be measured as number of cycles,
or in time units.
Figure 1 shows a processing element consisting of
a processor pipeline, memory and its switch inter-
face. Multiple such processing elements are connected
across a network of switches to form a multiprocessor
system. A processor sends a remote memory access
to the network through the switch. The access is for-
warded along any of the available shortest-paths to the
remote memory module. Upon service, the response is
returned to the processor issuing this memory request.
In this paper, we use a two-dimensional, bidirectional
torus topology for the interconnection network (shown
in Figure 2 ), similar to the one used in the ALEWIFE
multiprocessor system [a].
Figure 2: 4 x 4 Multiprocessor with 2-dimensional
mesh
2.2 Analytical Model and Assumptions
We propose the use of a closed queuing network
as shown in Figure 3 to model a multithreaded ar-
chitecture described in Section 2.1. Such a queuing
model has a product-form solution (refer to [4, 101for
details) due to two reasons. One, there is sufficient
buffer space in the thread pool, so that the processor
is not blocked when a thread may wait at the thread
pool for memory system to respond to its access. Two,
there is no build up of active threads in the system,
allowing the use of finite queues in the system without
blocking the network switches.
The queuing network model is composed of the pro-
cessing elements with three types of nodes, namely :
processor, memory and switch, a s shown in Figure 3.
Processor node :
The processor is a single server node. Ready threads
are executed one at a time with the FCFS service dis-
cipline, with exponential service time having a mean
R. Further, all the processor nodes have same service
time distribution. In our model, a thread is statically
assigned to one of the processors and gets executed
only in that processor. The threads executed on pro-
cessor i are considered to be class i customers in the
queuing network. A thread alternates between the ex-
ecution on its host processor, and an access to memory
module. Thus, the visit ratio' of a thread to its host
processor is unity, and to other processors is zero.
Memory Module :
A memory module has a single server, with exponen-
tially distributed access time, and the mean value is
L time units. The visit ratio of a thread (belonging to
a host i ) to a memory module j is emij. The value of
emij depends on the distribution of remote memory
requests across the memory modules. In this paper,
we consider geometric distribution as discussed in Sec-
tion 2.1.
Switch Node :
A switch node has a single server with an exponen-
tial service time distribution with mean value S. The
switch node interfaces a processor-memory block with
the four3 neighboring switch nodes in a 2-dimensional
torus network. The visit ratio esij ,from a processor i
to a switch j, is the sum of :
(i) twice the visit ratio emij to the memory module j,
for class i, and (ii) the contribution due to the mem-
ory modules situated at a distance greater than the
number of hops from node i to node j .
In summary, a processing element consisting of a
processor, a memory, and a switch node, is connected
to other processing elements through one or more
switch nodes, as shown in Figure 3.
Solving the above queuing network accurately is
computationally intensive. The state space to be con-
sidered for computing the normalization constant, G,
of the product form solution [lo] is enormous. For
example, a two processor system with 10 threads on
each processor has l0C3x10 Cs = 1166400 states. So
we prefer the use of Approximate Mean Value Analysis
(AMVA) [7]. For nt threads on each of the P proces-
sors in the system, the AMVA evaluates :
'The visit ratio for a class of threads at a node in a chain of
a closed queuing network is the frequency with which a thread
belonging to that class visits the node with respect to a node
(say processor, selected arbitrarily for monitoring the system)
in that chain.
3This number is dependent upon the topology of the inter-
connection network.
116
a a
Figure 3: Queueing Network Model
(i) the arrival rate X i for the threads belonging to ea(
processor i; (ii) the waiting time w:,~ at each node 1;
and (iii) the queue length nr,l for population vectors
N = (nt, ...,nt) and N - l i i.e. one customer less in
the i-th class.
Based on Xi, w ~ , L ,service times and the visit ratios,
we can evaluate the U,, U, , U n e t and Lob,. We inves-
tigate the behavior of these performance measures in
terms of architectural and application parameters, in
Section 4. Section 3 presents the details of the simu-
lation model based on STPN used for the validation
of the analytical results.
3 Validation of the Analytical Model
In this section we describe an STPN model for
the multithreaded multiprocessor architecture which
mimics the behavior of the queuing network model.
3.1 The STPN Model
Figure 4 shows the STPN model for a processing
element containing a multithreaded processor , a mem-
ory module and the network interface. The processor
subsystem is modeled by the transitions to,tl, and t2,
and the places PO, pl, and p2. The place p4 maintains
a pool of ready threads. A thread executes for the du-
ration R, at the transition tl, before it encounters a
long latency memory access. Transition t2 models the
context switch time C for saving/restoring the states
of the outgoing and the newly scheduled threads re-
spectively. Further, t2 sends the memory access to
a remote memory through p7 with a probability of
premote, and to the local memory through p3. The
memory access distribution is used to determine the
destination for a remote access.
' From% Network Switch '
Figure 4: Petri Net Model for a Processing Element
The transitions t,,,, t s e n d , and t8, and the places
p7, p8 and pport,model the network interface. Place
pportmodels the state of the network port. An incom-
ing message from the network for a suspended thread
in this processor, is forwarded to p4, while the request
to access the memory is forwarded to p3.
The memory port modeled by a token in p6, takes
up a memory access from p3 for service at t3 with a
duration L. Transition t3 routes the response to p4 or
p7 based on the processor originating the request.
Transitions with non-zero delays are represented us-
ing rectangular boxes. R and L have exponentially
distributed service time, and C has a fixed time delay.
3.2 Validation
The above STPN model is simulated using an
event-driven Petri Net simulator Voltaire [9]. The per-
formance results from the simulations for various sets
of input parameters, are compared with the analytical
results. We report two representative cases with the
following parameters : R = 10, C = 0, L = 10, S = 10
and premote= 0.1 or 0.5. The memory access pattern
is geometric with p,, = 0.5.
Figure 5 shows the processor utilization values ob-
tained from the analysis and the Petri Net simula-
tions represented as MODEL-Uti1 and SIM-Uti1 re-
spectively. MODEL-Latency and SIM-Latency repre-
sent Lob, from the analysis and the simulations. First,
we observe that U, increases rapidly with nt and then
saturates. For example, at premote= 0.1, values of U,
are 57%, 79% and 89% for nt = 2, 5 and 10, respec-
tively. At premote= 0.5, the corresponding values of
U, are lower. Secondly, Lobs increases linearly with
117
lOOr /-
5 10 15 20
NumberofThreads
Figure 5: U, and Lobs with respect to nt.
nt, at Premote = 0.1, since more threads wait at the
memory module for service. At premote = 0.5, Lob, is
low and almost constant after nt = 4since a saturated
network limits the memory access rate. In both cases,
the analytical results match well with the simulations
throughout the range of the experiment. Thus STPN
simulations confirm the accuracy of our queuing anal-
ysis.
Section 4 discusses the results based on the analysis
of Section 2.2. As the analytical results match well
with the simulations, we report the former only.
4 Results
Our objective is to identify the relationships among
various application and architecture parameters so
that we can achieve low execution time while main-
taining high utilization of all subsystems. We use the
analytical model developed above, with the base ar-
chitecture parameters as : L = 10, S = 10, on a 4 x 4
mesh of processing elements. The default values for
application parameters R, nt, and premote are 10, 8
and 0.1, unless stated otherwise. The default distri-
bution for remote memory accesses is geometric with
p,, = 0.5. These values ensure saturated processor
and memory performance, if premoteor S is zero.
4.1 Subsystem Utilizations
With premote = 0, the memory requests are re-
stricted the local memory module. An increase in the
premoteincreases the messages routed to remote mem-
ory modules across the network. This has a two-fold
effect on performance : (i) Since the latency for remote
access is higher (than the local memory latency) due
to extra time spent in traversing the network, the cor-
responding thread is suspended for a longer duration.
(ii) Larger number of messages on the network leads
to higher contention or network congestion, which in
turn increases the network latency. This reduces the
utilization of the processor and memory subsystems.
Figure 6 shows this effect of premote on the subsystem
utilizations, for L = 10, and L = 20. At L = 10, in-
crease in premotefrom 0.2 to 0.8 reduces the values of
U, and U, from nearly 90% to 23% and 22%,respec-
tively. When Unet saturates, the fall in the values of
U, and U, is steep. For L = 20, U, and U, decrease
rapidly after the network saturates in the same way.
Also the variation in premote affects both U, and U,
identically.
0.2 0.4 0.6 0.8 1
RemoteAccess Prab~Miiy
Figure 6: Subsystem Utilizations
Similar observations could be made when we con-
sider the effect of memory latency on the processor
and network utilizations or the effect of S on the pro-
cessor and memory utilizations. If the memory latency
is increased, then the number of requests waiting at
the memory increases, reducing the values of U, and
Unet. Similarly, an increase in S increases the net-
work latency, so more threads at the processor remain
suspended waiting for the corresponding memory re-
sponses to arrive. This in turn decreases the rate at
which memory accesses are sent resulting in a fall in
U, and U, values with respect to an increase in S.
Thus we observe a close coupling among the sub-
systems, based on our integrated model of processor,
memory and network subsystem.
4.2 System Utilization
Since variations in any parameter of the system can
affect the utilizations of all subsystems, we define sys-
tem utilization, Usys,as the average of the utilizations
of all subsystems. Having known the behavior of sub-
118
Prac(R=tO)
. ... P,a-(R=20) *--+--* :
MemciyLatency
Figure 7: The System Utilization with respect to L.
system utilizations (from Section 4.1), we are inter-
ested in the ability of Usysto track the transitions
corresponding to saturation of these subsystems. Fig-
ure 7 plots the subsystem utilizations and Usyswith
respect to memory latency for R = 10, and R = 20.
When L is close to zero, the system utilization is low
due to the low utilization of memory. At values of L
close to 100, the memory subsystem saturates but the
Usysis low (limiting value is 33%), due to low U, and
Unet. For Usys,a peak occurs when L = R = S = 10,
since all subsystems are close to their maximum uti-
lization values. With L > 10, both U, and Unet drop
off sharply with L, and only a small rise occurs in U,,, ,
resulting in low value of Usys.The maximum value of
Usysis referred as peak system utilization (PSU). Let
the corresponding memory latency be Lpsu. From
Figure 7 we observe that:
(i) Usysreflects the relative values of U,, U,,, and
Unet. When parameters of processor and memory sub-
systems are considered, PSU occurs at L = R. We
note that PSU represents a transition phase in which
one subsystem approaches saturation and utilizations
of other subsystems drop. This is due to balance of
throughput between any pair of subsystems.
(ii) For R = 10 and L 5 10, at PSU, U, is only 5% less
than its maximum value while Usyshas improved by
close to 25%. For R = 20 and L 5 20, these differences
for U, and UsYsare 7% and 30%. Thus by keeping the
operating range near PSU we can gain considerably in
overall system utilization with a small loss in processor
utilization.
(iii) For any value of L less than Lpsu, U, is high.
Thus Lpsu represents the slowest memory we can op-
erate without hampering a high system performance
significantly.
The bell shaped plot for system utilization also oc-
curs with respect to changes in other parameters such
as R and S.
Effect of Network Parameters
Figure 8 shows the effect of premote on the system
utilization for various values of S. We observe that :
(i) PSU lies between 70 to 80% for a wide range of
S. (ii) For faster switches i.e. low S,Unet does not
saturate until premote is high.
80
70
g 60
X 50
j 40
30
-
I
c
1
Figure 8: Effect of premote on Usysfor various S.
Effect of Thread Runlength
Figure 9 plots the system utilization with respect to
premote for various values of thread runlength. Let
network latency Taugbe the average time taken by a
message on the unloaded network to complete a round
trip. For geometric distribution of memory accesses
with psw = 0.5, a remote memory access travels a dis-
tance davg= 1.733 hops on a 4 x 4 mesh. Thus a
round trip takes 2x 1 . 7 3 3 ~10 time units in the un-
loaded network. In addition, a delay of S (= 10) time
units is incurred at the local switch on the forward as
well as the return path of the message. Hence Tavg(=
34.66+ 20 = 54.66) is given by :
In Figure 9, for R 5 10, PSU increases with R from
67% to 79%. Also, the PSU almost always occurs at
m 0.18. Since R 5 L , a thread spends less time at the
processor than it spends at the memory module, PSU
is a result of the matching of throughput between the
119
,
0 2 0 4 06 08 1
Remote Memow Access Probablity
Figure 9: Effect of premoteon Usysfor various R.
memory and network subsystems. A memory module
returns the remote memory accesses to the network at
the rate of Pre;Otc. At PSU,throughput of the incom-
ing messages from the network (= &)equals the
throughput of the responses from the memory module
the processor and network subsystems govern the PSU
value. A processor sends out memory requests at the
rate of &. A fraction (=prenote) of these are directed
across the network to remote memory modules. The
network delivers the messages to processor at the rate
of &.As the throughputs should match at PSU,
premoteshould equal &.Considering these two sce-
narios together, the maximum value of PSU occurs
when throughputs of the three subsystems are equal.
That is, the thread runlength, memory latency and
network latency should be such that :
(= p r c ~ o r e ) .SOpremote is $$ = 0.18293. For R >_ 10,
(3)
Equation 3 is a direct result of Equation 2, when
we consider the throughput balance at the processor
subsystem. In Figure 6, we observe that upon net-
work saturation the values of U, and U, are close to
creasing L , when the memory subsystem reaches sat-
uration, the values of U, and Unet are proportional to
4.3 Locality of Memory Accesses
If the remote memory access pattern is a geomet-
ric distribution, an increase in p,, increases davgfor
R L
the P r c m o t c X T a " s and PremotaXTo"s. Similarly, for in-
and z,respectively.
a message on the network, and hence the network la-
tency. Figure 10 shows the effect of increasing p,, on
system utilization, for various values of thread run-
length when premote= 0.17. For low value P,,, PSU
occurs due to saturation at processor and memory sub-
systems. PSU increases from 65% to 78% when p,,
is increased from 0.1 to 0.7 due to an increase in the
value of Unet. Further increase in p,, to 0.9 brings
down PSU to 72%, due to lower values of U,, and U,.
100
4
PSW43, --*-- II
PswQ5 D - - Q --e
P s w 46 x --+- --x
PSW a7*- - *-- 1
10 pSW49-
20 40 60 80 100
ThreadRunlength
Figure 10: Usyswith Geometric Distribution.
4.4 Summary of Results
achieving high performance :
Our study suggests the following conditions for
0 Overall high utilization of all subsystems is
achieved when (i) the thread runlength R equals
with the memory latency L; and (ii) the remote
memory access rate (v)equals the network
service rate 1.T,",
0 The above condition is necessary irrespective of
the large value of nt.
0 The applications with larger locality can toler-
ate slower networks without much degradation in
performance due to reduced network traffic.
5 Related Work
A few analytical studies on multithreaded architec-
tures have been reported in the literature [I, 3, 111.
In [3] and [ll],Stochastic timed petri nets (STPN)
have been used for modeling. These analyses assume
that the response time of memory is constant, equiva-
lently, the parallelism in the application (i.e. nt), has
120
no impact on the throughput of the memory subsys-
tem. Further [3] studies a bus-based multiprocessor
without contention effect on the bus. In contrast, us-
ing an integrated model for a multiprocessor, we study
a realistic system with queuing delays at the network
and memory subsystems.
In [l],the analysis has been performed for a cache-
based multiprocessor architecture. The analysis mod-
els a finite number of threads and their interference
in cache. It focuses on the performance of a proces-
sor, but other subsystems (like memory and network)
have not been studied. On the other hand, our analy-
sis does not model caches explicitly. The thread run-
length RIis related to the cache miss rate for an appli-
cation. The two approaches are complementary. The
analysis presented by Johnson 151 provides a frame-
work by combining simple models of application, pro-
cessor and network behavior. The model uses unsat-
urated network, but does not consider the memory
subsystem in detail. We develop a fairly simple inte-
grated model of the system which can be adapted to
different networks quickly. Our model is applicable to
saturated and unsaturated subsystems.
In [12], performance results for trace-driven sim-
ulations of multithreaded system with a shared bus,
are reported. They conclude that a small number
of threads in an application can achieve near 100%
processor utilization, but large global traffic can limit
the performance benefits due to multithreading. Our
study extends these results by suggesting the operat-
ing range for obtaining higher performance.
6 Conclusions
In this paper, we have proposed a simple analytical
model for a multithreaded multiprocessor architecture
based on closed queuing network with finite thread
population. The performance study based on this an-
alytical model integrating the processor, memory and
network subsystems shows that :
0 a strong coupling exists between these subsys-
tems. Variations in parameters of one subsystem
affect the utilizations of other subsystems as well.
0 for high performance, the partitioning of a pro-
gram should result in thread runlengths close to
the weighted sum of memory latency and network
latency. This is necessary for high performance
irrespective of the application parallelism nt.
References
A. Agarwal. Performace tradeoffs in multithreaded
processors. IEEE Transactions on Parallel and Dis-
tributed Systems, 2(4), September 1992.
A. Agarwal, B.H. Lim, D. Kranz, and J. Kubiatow-
icz. April: A processor architecture for multiprocess-
ing. In Proc. of the 17th Int’l. Symp. on Computer
Architecture, pages 104-114, 1990.
L. Alkalaj and R.V. Bopanna. Performance of multi-
threaded execution in a shared-memory multiproces-
sor. In Proc. of 3rd Ann. IEEE Symp. on Parallel and
Distributed Processing, pages 330-333, Dallas, USA,
December 1991. IEEE.
F. Baskett, K. Mani Chandy, R.R. Muntz, and F.G.
Palacios. Open, closed, and mixed network of queues
with different classes of customers. Journal of the
ACM, 22(2):248-260, April 1975.
K. Johnson. The impact of communication locality on
large-scale multiprocessor performance. In Proceed-
ings of the 19th Insternational Symposium on Com-
puter Architecture, pages 392-402. ACM, May 1992.
C.P. KrusM and M. Snir. The performance of mul-
tistage interconnection networks. IEEE Transactions
on Computers, C-32(12):1091-1098, Jan 1983.
E.D. Lazowska, 3. Zahorjan,G.S. Graham, and K.C.
Sevcik. Quantitative System Performance: Computer
System enalysis Using Queueing Network Models.
Prentice-Hall, Inc., Englewood Cliffs, NJ, 1984.
T. Murata. Petri nets: Properties, analysis and ap-
plications. Proceedings of the IEEE, 77(4):541-580,
April 1989.
P. Parent and 0.Tanir. Voltaire: a discrete event sim-
ulator. In Proceedings of Fourth International Work-
shop on Petri Nets and Performance Models, Mel-
bourne, Australia, December 1991.
M. Reiser and S. Lavenberg. Mean value andysis
of closed multichain queueing networks. Journal of
R.H. Saavedra-Barrera, D.E. Culler, and T. v. Eicken.
Analysis of multithreaded architectures for parallel
computing. In Proc. of 2nd Ann. ACM Symp. on
Parallel Algorithms and Architectures, Crete, Greece,
July 1990. ACM.
ACM, 27(2):313-322, April 1980.
[12] W.D. Weber and A. Gupta. Exploring the benefits
of multiple contexts in a multiprocessor architecture:
Preliminary results. In Proceedings of the 16th Annual
International Symposium on Computer Architecture,
pages 273-280. ACM, 1989.
0 a larger locality in application program reduces
the network traffic, resulting in a higher perfor-
mance.
121

More Related Content

What's hot

A New Approach to Improve the Efficiency of Distributed Scheduling in IEEE 80...
A New Approach to Improve the Efficiency of Distributed Scheduling in IEEE 80...A New Approach to Improve the Efficiency of Distributed Scheduling in IEEE 80...
A New Approach to Improve the Efficiency of Distributed Scheduling in IEEE 80...IDES Editor
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingeSAT Journals
 
Communication synchronization in cluster based wireless sensor network a re...
Communication synchronization in cluster based wireless sensor network   a re...Communication synchronization in cluster based wireless sensor network   a re...
Communication synchronization in cluster based wireless sensor network a re...eSAT Journals
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentEricsson
 
AN EFFICIENT ROUTING PROTOCOL FOR DELAY TOLERANT NETWORKS (DTNs)
AN EFFICIENT ROUTING PROTOCOL FOR DELAY TOLERANT NETWORKS (DTNs)AN EFFICIENT ROUTING PROTOCOL FOR DELAY TOLERANT NETWORKS (DTNs)
AN EFFICIENT ROUTING PROTOCOL FOR DELAY TOLERANT NETWORKS (DTNs)cscpconf
 
Conference Paper: Towards High Performance Packet Processing for 5G
Conference Paper: Towards High Performance Packet Processing for 5GConference Paper: Towards High Performance Packet Processing for 5G
Conference Paper: Towards High Performance Packet Processing for 5GEricsson
 
Memory consistency models
Memory consistency modelsMemory consistency models
Memory consistency modelspalani kumar
 
system interconnect architectures in ACA
system interconnect architectures in ACAsystem interconnect architectures in ACA
system interconnect architectures in ACAPankaj Kumar Jain
 
Clustering effects on wireless mobile ad hoc networks performances
Clustering effects on wireless mobile ad hoc networks performancesClustering effects on wireless mobile ad hoc networks performances
Clustering effects on wireless mobile ad hoc networks performancesijcsit
 
On the Tree Construction of Multi hop Wireless Mesh Networks with Evolutionar...
On the Tree Construction of Multi hop Wireless Mesh Networks with Evolutionar...On the Tree Construction of Multi hop Wireless Mesh Networks with Evolutionar...
On the Tree Construction of Multi hop Wireless Mesh Networks with Evolutionar...CSCJournals
 
18068 system software suppor t for router fault tolerance(word 2 column)
18068 system software suppor t for router fault tolerance(word 2 column)18068 system software suppor t for router fault tolerance(word 2 column)
18068 system software suppor t for router fault tolerance(word 2 column)Ashenafi Workie
 
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORSSTUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORSijdpsjournal
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009Léia de Sousa
 
Implementation of Spanning Tree Protocol using ns-3
Implementation of Spanning Tree Protocol using ns-3Implementation of Spanning Tree Protocol using ns-3
Implementation of Spanning Tree Protocol using ns-3Naishil Shah
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 

What's hot (19)

A New Approach to Improve the Efficiency of Distributed Scheduling in IEEE 80...
A New Approach to Improve the Efficiency of Distributed Scheduling in IEEE 80...A New Approach to Improve the Efficiency of Distributed Scheduling in IEEE 80...
A New Approach to Improve the Efficiency of Distributed Scheduling in IEEE 80...
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passing
 
Communication synchronization in cluster based wireless sensor network a re...
Communication synchronization in cluster based wireless sensor network   a re...Communication synchronization in cluster based wireless sensor network   a re...
Communication synchronization in cluster based wireless sensor network a re...
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
AN EFFICIENT ROUTING PROTOCOL FOR DELAY TOLERANT NETWORKS (DTNs)
AN EFFICIENT ROUTING PROTOCOL FOR DELAY TOLERANT NETWORKS (DTNs)AN EFFICIENT ROUTING PROTOCOL FOR DELAY TOLERANT NETWORKS (DTNs)
AN EFFICIENT ROUTING PROTOCOL FOR DELAY TOLERANT NETWORKS (DTNs)
 
Conference Paper: Towards High Performance Packet Processing for 5G
Conference Paper: Towards High Performance Packet Processing for 5GConference Paper: Towards High Performance Packet Processing for 5G
Conference Paper: Towards High Performance Packet Processing for 5G
 
Memory consistency models
Memory consistency modelsMemory consistency models
Memory consistency models
 
system interconnect architectures in ACA
system interconnect architectures in ACAsystem interconnect architectures in ACA
system interconnect architectures in ACA
 
Clustering effects on wireless mobile ad hoc networks performances
Clustering effects on wireless mobile ad hoc networks performancesClustering effects on wireless mobile ad hoc networks performances
Clustering effects on wireless mobile ad hoc networks performances
 
Dos unit3
Dos unit3Dos unit3
Dos unit3
 
On the Tree Construction of Multi hop Wireless Mesh Networks with Evolutionar...
On the Tree Construction of Multi hop Wireless Mesh Networks with Evolutionar...On the Tree Construction of Multi hop Wireless Mesh Networks with Evolutionar...
On the Tree Construction of Multi hop Wireless Mesh Networks with Evolutionar...
 
18068 system software suppor t for router fault tolerance(word 2 column)
18068 system software suppor t for router fault tolerance(word 2 column)18068 system software suppor t for router fault tolerance(word 2 column)
18068 system software suppor t for router fault tolerance(word 2 column)
 
SoC-2012-pres-2
SoC-2012-pres-2SoC-2012-pres-2
SoC-2012-pres-2
 
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORSSTUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009
 
1
11
1
 
Implementation of Spanning Tree Protocol using ns-3
Implementation of Spanning Tree Protocol using ns-3Implementation of Spanning Tree Protocol using ns-3
Implementation of Spanning Tree Protocol using ns-3
 
C04511822
C04511822C04511822
C04511822
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 

Viewers also liked

Hyundai mobile
Hyundai mobileHyundai mobile
Hyundai mobileHannah Lee
 
Omnichannel Commerce: la reinvención del comercio
Omnichannel Commerce: la reinvención del comercioOmnichannel Commerce: la reinvención del comercio
Omnichannel Commerce: la reinvención del comercioFrancisco Egea Castejón
 
Priyanka_Resume_Oct102015
Priyanka_Resume_Oct102015Priyanka_Resume_Oct102015
Priyanka_Resume_Oct102015priyanka gadia
 
Certification of employement And Recommendation
Certification of employement And RecommendationCertification of employement And Recommendation
Certification of employement And RecommendationMAYSAM GAMINI
 
гуравдугаар сарын гео уулзалт(1)
гуравдугаар сарын гео уулзалт(1)гуравдугаар сарын гео уулзалт(1)
гуравдугаар сарын гео уулзалт(1)GeoMedeelel
 

Viewers also liked (6)

Hyundai mobile
Hyundai mobileHyundai mobile
Hyundai mobile
 
Medina Sidonia
Medina SidoniaMedina Sidonia
Medina Sidonia
 
Omnichannel Commerce: la reinvención del comercio
Omnichannel Commerce: la reinvención del comercioOmnichannel Commerce: la reinvención del comercio
Omnichannel Commerce: la reinvención del comercio
 
Priyanka_Resume_Oct102015
Priyanka_Resume_Oct102015Priyanka_Resume_Oct102015
Priyanka_Resume_Oct102015
 
Certification of employement And Recommendation
Certification of employement And RecommendationCertification of employement And Recommendation
Certification of employement And Recommendation
 
гуравдугаар сарын гео уулзалт(1)
гуравдугаар сарын гео уулзалт(1)гуравдугаар сарын гео уулзалт(1)
гуравдугаар сарын гео уулзалт(1)
 

Similar to shashank_spdp1993_00395543

Operating Systems R20 Unit 2.pptx
Operating Systems R20 Unit 2.pptxOperating Systems R20 Unit 2.pptx
Operating Systems R20 Unit 2.pptxPrudhvi668506
 
Enhanced transformer long short-term memory framework for datastream prediction
Enhanced transformer long short-term memory framework for datastream predictionEnhanced transformer long short-term memory framework for datastream prediction
Enhanced transformer long short-term memory framework for datastream predictionIJECEIAES
 
Distributed system lectures
Distributed system lecturesDistributed system lectures
Distributed system lecturesmarwaeng
 
QoS Framework for a Multi-stack based Heterogeneous Wireless Sensor Network
QoS Framework for a Multi-stack based Heterogeneous Wireless Sensor Network QoS Framework for a Multi-stack based Heterogeneous Wireless Sensor Network
QoS Framework for a Multi-stack based Heterogeneous Wireless Sensor Network IJECEIAES
 
DL for sentence classification project Write-up
DL for sentence classification project Write-upDL for sentence classification project Write-up
DL for sentence classification project Write-upHoàng Triều Trịnh
 
Compositional Analysis for the Multi-Resource Server
Compositional Analysis for the Multi-Resource ServerCompositional Analysis for the Multi-Resource Server
Compositional Analysis for the Multi-Resource ServerEricsson
 
week_2Lec02_CS422.pptx
week_2Lec02_CS422.pptxweek_2Lec02_CS422.pptx
week_2Lec02_CS422.pptxmivomi1
 
CS8603_Notes_003-1_edubuzz360.pdf
CS8603_Notes_003-1_edubuzz360.pdfCS8603_Notes_003-1_edubuzz360.pdf
CS8603_Notes_003-1_edubuzz360.pdfKishaKiddo
 
Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performanceSyed Zaid Irshad
 
Traffic Engineering in Metro Ethernet
Traffic Engineering in Metro EthernetTraffic Engineering in Metro Ethernet
Traffic Engineering in Metro EthernetCSCJournals
 
Benefit based data caching in ad hoc networks (synopsis)
Benefit based data caching in ad hoc networks (synopsis)Benefit based data caching in ad hoc networks (synopsis)
Benefit based data caching in ad hoc networks (synopsis)Mumbai Academisc
 
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...Optimization of Remote Core Locking Synchronization in Multithreaded Programs...
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...ITIIIndustries
 
thread_ multiprocessor_ scheduling_a.ppt
thread_ multiprocessor_ scheduling_a.pptthread_ multiprocessor_ scheduling_a.ppt
thread_ multiprocessor_ scheduling_a.pptnaghamallella
 
Analysis of data transmission in wireless lan for 802.11
Analysis of data transmission in wireless lan for 802.11Analysis of data transmission in wireless lan for 802.11
Analysis of data transmission in wireless lan for 802.11eSAT Publishing House
 

Similar to shashank_spdp1993_00395543 (20)

Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
Compiler design
Compiler designCompiler design
Compiler design
 
Operating Systems R20 Unit 2.pptx
Operating Systems R20 Unit 2.pptxOperating Systems R20 Unit 2.pptx
Operating Systems R20 Unit 2.pptx
 
Enhanced transformer long short-term memory framework for datastream prediction
Enhanced transformer long short-term memory framework for datastream predictionEnhanced transformer long short-term memory framework for datastream prediction
Enhanced transformer long short-term memory framework for datastream prediction
 
Distributed system lectures
Distributed system lecturesDistributed system lectures
Distributed system lectures
 
QoS Framework for a Multi-stack based Heterogeneous Wireless Sensor Network
QoS Framework for a Multi-stack based Heterogeneous Wireless Sensor Network QoS Framework for a Multi-stack based Heterogeneous Wireless Sensor Network
QoS Framework for a Multi-stack based Heterogeneous Wireless Sensor Network
 
Os
OsOs
Os
 
ICICCE0298
ICICCE0298ICICCE0298
ICICCE0298
 
DL for sentence classification project Write-up
DL for sentence classification project Write-upDL for sentence classification project Write-up
DL for sentence classification project Write-up
 
Compositional Analysis for the Multi-Resource Server
Compositional Analysis for the Multi-Resource ServerCompositional Analysis for the Multi-Resource Server
Compositional Analysis for the Multi-Resource Server
 
week_2Lec02_CS422.pptx
week_2Lec02_CS422.pptxweek_2Lec02_CS422.pptx
week_2Lec02_CS422.pptx
 
CS8603_Notes_003-1_edubuzz360.pdf
CS8603_Notes_003-1_edubuzz360.pdfCS8603_Notes_003-1_edubuzz360.pdf
CS8603_Notes_003-1_edubuzz360.pdf
 
Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performance
 
Traffic Engineering in Metro Ethernet
Traffic Engineering in Metro EthernetTraffic Engineering in Metro Ethernet
Traffic Engineering in Metro Ethernet
 
Benefit based data caching in ad hoc networks (synopsis)
Benefit based data caching in ad hoc networks (synopsis)Benefit based data caching in ad hoc networks (synopsis)
Benefit based data caching in ad hoc networks (synopsis)
 
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...Optimization of Remote Core Locking Synchronization in Multithreaded Programs...
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...
 
thread_ multiprocessor_ scheduling_a.ppt
thread_ multiprocessor_ scheduling_a.pptthread_ multiprocessor_ scheduling_a.ppt
thread_ multiprocessor_ scheduling_a.ppt
 
DESIGN AND IMPLEMENTATION OF ADVANCED MULTILEVEL PRIORITY PACKET SCHEDULING S...
DESIGN AND IMPLEMENTATION OF ADVANCED MULTILEVEL PRIORITY PACKET SCHEDULING S...DESIGN AND IMPLEMENTATION OF ADVANCED MULTILEVEL PRIORITY PACKET SCHEDULING S...
DESIGN AND IMPLEMENTATION OF ADVANCED MULTILEVEL PRIORITY PACKET SCHEDULING S...
 
Mq3624532158
Mq3624532158Mq3624532158
Mq3624532158
 
Analysis of data transmission in wireless lan for 802.11
Analysis of data transmission in wireless lan for 802.11Analysis of data transmission in wireless lan for 802.11
Analysis of data transmission in wireless lan for 802.11
 

shashank_spdp1993_00395543

  • 1. Analysis of Multithreaded Multiprocessors with Distributed Shared Memory* S.S. Nemawarkart, R. Govindarajant, G.R. Gao$, V.K. Agarwalt tDepartment of Electrical Engineering, $School of Computer Science McGill University, Montreal, H3A 2A7, Canada {shashank,govindr,gao,agarwal}@pike.ee.mcgill.ca Abstract In this paper we propose an analytical model, based on multi-chain closed queuing networks, to evaluate the performance of multithreaded multiprocessors. The queuing network is solved by using approximate Mean Value Analysis. Unlike earlier work which modeled in- dividual subsystems in isolation, our work models pro- cessor, memory and network subsystems an an inte- grated manner. Such an approach brings out a strong coupling between each pair of subsystems. For ex- ample, the processor and memory Utilizations respond identically to the variations in the network character- istics. Further, we observe that high performance on an application is achieved when the memory request rate of a processor equals the weighted sum of memory bandwidth and the average round trip distance of the remote memory across the network. 1 Introduction A multithreaded processor has the potential to tol- erate the effects of long memory latency, high network delays and unpredictable synchronization delays, as long as it has some available tasks to execute on. The processor executes on small sequences of instructions, called threads, interspersed with memory accesses and synchronizations. Multi-threaded architectures mask the long memory latency by suspending the execution of the current thread upon encountering the long la- tency operation and switching to another thread. This improves the processor utilization as the computation is overlapped with the memory access and synchro- nization delays. A context switch incurs an associated overhead of saving the state of the current thread and restoring that of the newly scheduled thread. The performance of a multithreaded architecture depends on a number of parameters related to the architecture-memory latency, context switch time, switch delay, and to the application- number of threads, thread runlength and remote memory access pattern'. In this paper, we are interested in an an- alytical study on the performance of a multithreaded multiprocessor with the processing elements connected across a two-dimensional mesh. The processors access a shared memory which is physically distributed across the processing elements. Previous studies on multithreaded architectures have focussed only on the processor utilization. Sim- ilarly, the studies on network characteristics concen- trate only on the network switch responses. These work [l,3, 111analyze a subsystem in isolation without considering the feedback effect of other subsystems. In contrast, we develop an integrated model of pro- cessor, memory and network subsystems, which helps in identifying the relationships among various param- eters for achieving high performance. The model is based on the multi-chain closed queuing network. A key intuitive observation for the model is that with the availability of an adequate number of enabled threads, we can anticipate that the system will behave as a closed queuing network provided that sufficiently large queues are maintained at each stage of the service sta- tion. In [6], Kruskal and Snir show that the difference between finite buffer and infinite buffer is not signifi- cant when the buffer size is greater than a few. Based on these observations, we can solve the closed queu- ing network. Owing to the amount of computations needed to study a multiprocessor system with many nodes [lo], we use an Approximate Mean Value Anal- ysis (AMVA). This analysis yields the performance measures such as processor and memory utilizations, which can be used to fine tune the program partition- ing for high performance. The performance results of our study establish that: (i) both the memory and network subsystems strongly *This work has been supported by MICRONET - Network Centers of Excellence 'Refer to Section 2.1 for a precise definition of some of the terms used here. 114 1063-637#93 $03.000 1993IEEE
  • 2. influence the processor utilization. A strong coupling exists between each pair of subsystems. Such behavior is not explicitly reported in earlier work. (ii)- when the rate of issue of memory requests in an application program is such that the resulting aver- age thread runlength equals the weighted sum of local memory latency and the network latencies, the proces- sor gets the response from the memory and the net- work before it runs out of workload. (iii)- the above matching of thread runlength with memory and network latencies also ensures a high uti- lization of all subsystems, and hence represents a suit- able operating range on an architecture. An equivalence between markov queuing models and the generalized stochastic petri nets has been shown in [8].So for validating these analytical solu- tions, we develop a stochastic timed petri net (STPN) model for a multithreaded architecture. In the following section we develop the analytical model. Section 3 describes the validation of the analytical re- sults using the STPN model. The results of this study are presented in Section 4. In Section 5, we compare our approach with related work. 2 Analytical Model for Multithreaded 2.1 A Multithreaded Architecture The rest of this paper organized as follows. Architecture In the multithreaded execution model, a program is a collection of partially-ordered threads. A thread consists of a sequence of instructions which are ex- ecuted sequentially as in the von Neumann model. The scheduling of individual threads is similar to a dataflow model. We enlist the terminology to be used in the rest of the paper: A thread undergoes followingstates during its lifetime: Suspended when it is waiting for its long-latency op- eration to be satisfied. Only one such operation per thread is allowed. Ready: when the long-latency operation for which a thread was waiting, is satisfied. Executing: when a ready thread is scheduled for exe- cution on processor pipeline. Each processor executes on a set of nt parallel threads. We assume that nt is a constant. Further, the threads interact only through the locations in either the lo- cal memory or the remote memory. A processor on which a group of nt threads executes, is called the host processor. These threads can not be scheduled for execution on any other processor. A thread is executed on the processor pipeline for the duration called runlength R, before getting suspended I Processor Subsystem - - - - - - Memory Subsystem I 1 . 1 I ' 4 ' 5 ' 6 ' 7 I I I II Figure 1: A Processing Element on a memory access. On suspension, the state of the outgoing thread is saved and the context of the newly scheduled thread is restored. This context switch time is C. A memory access is directed to a remote memory mod- ule with a probability premote.Local memory module services the remaining fraction (1- premote) of mem- ory accesses. At each switch node, a remote memory access suffers a delay of S time units. Memory latency L , is the access time of the local mem- ory without queuing delays. Observed memory latency Lobs is the response time experienced by a memory ac- cess, including the waiting time at the memory. The remote memory access pattern across the memory modules follows a geometric distribution. With geo- metric distribution, the probability of a remote mem- ory access requesting a memory module at a distance of h hops from the host processor, is p t w / u , where p,, is the probability of accessing the nearest mem- ory module across the network, and U is a normalizing constant. Higher value of p,, leads to higher locality of memory accesses. Processor utilization U,, is the fraction of time the processor is performing useful task in the execution pipeline. Similarly Memory utilization U,, is the fraction of time for which the memory port is busy, and Switch Utilization Unet,is the fraction of time for which the switch is busy. System Utilization Usys,is 115
  • 3. the average of the utilizations of all three components in a processing element. R, L, C and S can be measured as number of cycles, or in time units. Figure 1 shows a processing element consisting of a processor pipeline, memory and its switch inter- face. Multiple such processing elements are connected across a network of switches to form a multiprocessor system. A processor sends a remote memory access to the network through the switch. The access is for- warded along any of the available shortest-paths to the remote memory module. Upon service, the response is returned to the processor issuing this memory request. In this paper, we use a two-dimensional, bidirectional torus topology for the interconnection network (shown in Figure 2 ), similar to the one used in the ALEWIFE multiprocessor system [a]. Figure 2: 4 x 4 Multiprocessor with 2-dimensional mesh 2.2 Analytical Model and Assumptions We propose the use of a closed queuing network as shown in Figure 3 to model a multithreaded ar- chitecture described in Section 2.1. Such a queuing model has a product-form solution (refer to [4, 101for details) due to two reasons. One, there is sufficient buffer space in the thread pool, so that the processor is not blocked when a thread may wait at the thread pool for memory system to respond to its access. Two, there is no build up of active threads in the system, allowing the use of finite queues in the system without blocking the network switches. The queuing network model is composed of the pro- cessing elements with three types of nodes, namely : processor, memory and switch, a s shown in Figure 3. Processor node : The processor is a single server node. Ready threads are executed one at a time with the FCFS service dis- cipline, with exponential service time having a mean R. Further, all the processor nodes have same service time distribution. In our model, a thread is statically assigned to one of the processors and gets executed only in that processor. The threads executed on pro- cessor i are considered to be class i customers in the queuing network. A thread alternates between the ex- ecution on its host processor, and an access to memory module. Thus, the visit ratio' of a thread to its host processor is unity, and to other processors is zero. Memory Module : A memory module has a single server, with exponen- tially distributed access time, and the mean value is L time units. The visit ratio of a thread (belonging to a host i ) to a memory module j is emij. The value of emij depends on the distribution of remote memory requests across the memory modules. In this paper, we consider geometric distribution as discussed in Sec- tion 2.1. Switch Node : A switch node has a single server with an exponen- tial service time distribution with mean value S. The switch node interfaces a processor-memory block with the four3 neighboring switch nodes in a 2-dimensional torus network. The visit ratio esij ,from a processor i to a switch j, is the sum of : (i) twice the visit ratio emij to the memory module j, for class i, and (ii) the contribution due to the mem- ory modules situated at a distance greater than the number of hops from node i to node j . In summary, a processing element consisting of a processor, a memory, and a switch node, is connected to other processing elements through one or more switch nodes, as shown in Figure 3. Solving the above queuing network accurately is computationally intensive. The state space to be con- sidered for computing the normalization constant, G, of the product form solution [lo] is enormous. For example, a two processor system with 10 threads on each processor has l0C3x10 Cs = 1166400 states. So we prefer the use of Approximate Mean Value Analysis (AMVA) [7]. For nt threads on each of the P proces- sors in the system, the AMVA evaluates : 'The visit ratio for a class of threads at a node in a chain of a closed queuing network is the frequency with which a thread belonging to that class visits the node with respect to a node (say processor, selected arbitrarily for monitoring the system) in that chain. 3This number is dependent upon the topology of the inter- connection network. 116
  • 4. a a Figure 3: Queueing Network Model (i) the arrival rate X i for the threads belonging to ea( processor i; (ii) the waiting time w:,~ at each node 1; and (iii) the queue length nr,l for population vectors N = (nt, ...,nt) and N - l i i.e. one customer less in the i-th class. Based on Xi, w ~ , L ,service times and the visit ratios, we can evaluate the U,, U, , U n e t and Lob,. We inves- tigate the behavior of these performance measures in terms of architectural and application parameters, in Section 4. Section 3 presents the details of the simu- lation model based on STPN used for the validation of the analytical results. 3 Validation of the Analytical Model In this section we describe an STPN model for the multithreaded multiprocessor architecture which mimics the behavior of the queuing network model. 3.1 The STPN Model Figure 4 shows the STPN model for a processing element containing a multithreaded processor , a mem- ory module and the network interface. The processor subsystem is modeled by the transitions to,tl, and t2, and the places PO, pl, and p2. The place p4 maintains a pool of ready threads. A thread executes for the du- ration R, at the transition tl, before it encounters a long latency memory access. Transition t2 models the context switch time C for saving/restoring the states of the outgoing and the newly scheduled threads re- spectively. Further, t2 sends the memory access to a remote memory through p7 with a probability of premote, and to the local memory through p3. The memory access distribution is used to determine the destination for a remote access. ' From% Network Switch ' Figure 4: Petri Net Model for a Processing Element The transitions t,,,, t s e n d , and t8, and the places p7, p8 and pport,model the network interface. Place pportmodels the state of the network port. An incom- ing message from the network for a suspended thread in this processor, is forwarded to p4, while the request to access the memory is forwarded to p3. The memory port modeled by a token in p6, takes up a memory access from p3 for service at t3 with a duration L. Transition t3 routes the response to p4 or p7 based on the processor originating the request. Transitions with non-zero delays are represented us- ing rectangular boxes. R and L have exponentially distributed service time, and C has a fixed time delay. 3.2 Validation The above STPN model is simulated using an event-driven Petri Net simulator Voltaire [9]. The per- formance results from the simulations for various sets of input parameters, are compared with the analytical results. We report two representative cases with the following parameters : R = 10, C = 0, L = 10, S = 10 and premote= 0.1 or 0.5. The memory access pattern is geometric with p,, = 0.5. Figure 5 shows the processor utilization values ob- tained from the analysis and the Petri Net simula- tions represented as MODEL-Uti1 and SIM-Uti1 re- spectively. MODEL-Latency and SIM-Latency repre- sent Lob, from the analysis and the simulations. First, we observe that U, increases rapidly with nt and then saturates. For example, at premote= 0.1, values of U, are 57%, 79% and 89% for nt = 2, 5 and 10, respec- tively. At premote= 0.5, the corresponding values of U, are lower. Secondly, Lobs increases linearly with 117
  • 5. lOOr /- 5 10 15 20 NumberofThreads Figure 5: U, and Lobs with respect to nt. nt, at Premote = 0.1, since more threads wait at the memory module for service. At premote = 0.5, Lob, is low and almost constant after nt = 4since a saturated network limits the memory access rate. In both cases, the analytical results match well with the simulations throughout the range of the experiment. Thus STPN simulations confirm the accuracy of our queuing anal- ysis. Section 4 discusses the results based on the analysis of Section 2.2. As the analytical results match well with the simulations, we report the former only. 4 Results Our objective is to identify the relationships among various application and architecture parameters so that we can achieve low execution time while main- taining high utilization of all subsystems. We use the analytical model developed above, with the base ar- chitecture parameters as : L = 10, S = 10, on a 4 x 4 mesh of processing elements. The default values for application parameters R, nt, and premote are 10, 8 and 0.1, unless stated otherwise. The default distri- bution for remote memory accesses is geometric with p,, = 0.5. These values ensure saturated processor and memory performance, if premoteor S is zero. 4.1 Subsystem Utilizations With premote = 0, the memory requests are re- stricted the local memory module. An increase in the premoteincreases the messages routed to remote mem- ory modules across the network. This has a two-fold effect on performance : (i) Since the latency for remote access is higher (than the local memory latency) due to extra time spent in traversing the network, the cor- responding thread is suspended for a longer duration. (ii) Larger number of messages on the network leads to higher contention or network congestion, which in turn increases the network latency. This reduces the utilization of the processor and memory subsystems. Figure 6 shows this effect of premote on the subsystem utilizations, for L = 10, and L = 20. At L = 10, in- crease in premotefrom 0.2 to 0.8 reduces the values of U, and U, from nearly 90% to 23% and 22%,respec- tively. When Unet saturates, the fall in the values of U, and U, is steep. For L = 20, U, and U, decrease rapidly after the network saturates in the same way. Also the variation in premote affects both U, and U, identically. 0.2 0.4 0.6 0.8 1 RemoteAccess Prab~Miiy Figure 6: Subsystem Utilizations Similar observations could be made when we con- sider the effect of memory latency on the processor and network utilizations or the effect of S on the pro- cessor and memory utilizations. If the memory latency is increased, then the number of requests waiting at the memory increases, reducing the values of U, and Unet. Similarly, an increase in S increases the net- work latency, so more threads at the processor remain suspended waiting for the corresponding memory re- sponses to arrive. This in turn decreases the rate at which memory accesses are sent resulting in a fall in U, and U, values with respect to an increase in S. Thus we observe a close coupling among the sub- systems, based on our integrated model of processor, memory and network subsystem. 4.2 System Utilization Since variations in any parameter of the system can affect the utilizations of all subsystems, we define sys- tem utilization, Usys,as the average of the utilizations of all subsystems. Having known the behavior of sub- 118
  • 6. Prac(R=tO) . ... P,a-(R=20) *--+--* : MemciyLatency Figure 7: The System Utilization with respect to L. system utilizations (from Section 4.1), we are inter- ested in the ability of Usysto track the transitions corresponding to saturation of these subsystems. Fig- ure 7 plots the subsystem utilizations and Usyswith respect to memory latency for R = 10, and R = 20. When L is close to zero, the system utilization is low due to the low utilization of memory. At values of L close to 100, the memory subsystem saturates but the Usysis low (limiting value is 33%), due to low U, and Unet. For Usys,a peak occurs when L = R = S = 10, since all subsystems are close to their maximum uti- lization values. With L > 10, both U, and Unet drop off sharply with L, and only a small rise occurs in U,,, , resulting in low value of Usys.The maximum value of Usysis referred as peak system utilization (PSU). Let the corresponding memory latency be Lpsu. From Figure 7 we observe that: (i) Usysreflects the relative values of U,, U,,, and Unet. When parameters of processor and memory sub- systems are considered, PSU occurs at L = R. We note that PSU represents a transition phase in which one subsystem approaches saturation and utilizations of other subsystems drop. This is due to balance of throughput between any pair of subsystems. (ii) For R = 10 and L 5 10, at PSU, U, is only 5% less than its maximum value while Usyshas improved by close to 25%. For R = 20 and L 5 20, these differences for U, and UsYsare 7% and 30%. Thus by keeping the operating range near PSU we can gain considerably in overall system utilization with a small loss in processor utilization. (iii) For any value of L less than Lpsu, U, is high. Thus Lpsu represents the slowest memory we can op- erate without hampering a high system performance significantly. The bell shaped plot for system utilization also oc- curs with respect to changes in other parameters such as R and S. Effect of Network Parameters Figure 8 shows the effect of premote on the system utilization for various values of S. We observe that : (i) PSU lies between 70 to 80% for a wide range of S. (ii) For faster switches i.e. low S,Unet does not saturate until premote is high. 80 70 g 60 X 50 j 40 30 - I c 1 Figure 8: Effect of premote on Usysfor various S. Effect of Thread Runlength Figure 9 plots the system utilization with respect to premote for various values of thread runlength. Let network latency Taugbe the average time taken by a message on the unloaded network to complete a round trip. For geometric distribution of memory accesses with psw = 0.5, a remote memory access travels a dis- tance davg= 1.733 hops on a 4 x 4 mesh. Thus a round trip takes 2x 1 . 7 3 3 ~10 time units in the un- loaded network. In addition, a delay of S (= 10) time units is incurred at the local switch on the forward as well as the return path of the message. Hence Tavg(= 34.66+ 20 = 54.66) is given by : In Figure 9, for R 5 10, PSU increases with R from 67% to 79%. Also, the PSU almost always occurs at m 0.18. Since R 5 L , a thread spends less time at the processor than it spends at the memory module, PSU is a result of the matching of throughput between the 119
  • 7. , 0 2 0 4 06 08 1 Remote Memow Access Probablity Figure 9: Effect of premoteon Usysfor various R. memory and network subsystems. A memory module returns the remote memory accesses to the network at the rate of Pre;Otc. At PSU,throughput of the incom- ing messages from the network (= &)equals the throughput of the responses from the memory module the processor and network subsystems govern the PSU value. A processor sends out memory requests at the rate of &. A fraction (=prenote) of these are directed across the network to remote memory modules. The network delivers the messages to processor at the rate of &.As the throughputs should match at PSU, premoteshould equal &.Considering these two sce- narios together, the maximum value of PSU occurs when throughputs of the three subsystems are equal. That is, the thread runlength, memory latency and network latency should be such that : (= p r c ~ o r e ) .SOpremote is $$ = 0.18293. For R >_ 10, (3) Equation 3 is a direct result of Equation 2, when we consider the throughput balance at the processor subsystem. In Figure 6, we observe that upon net- work saturation the values of U, and U, are close to creasing L , when the memory subsystem reaches sat- uration, the values of U, and Unet are proportional to 4.3 Locality of Memory Accesses If the remote memory access pattern is a geomet- ric distribution, an increase in p,, increases davgfor R L the P r c m o t c X T a " s and PremotaXTo"s. Similarly, for in- and z,respectively. a message on the network, and hence the network la- tency. Figure 10 shows the effect of increasing p,, on system utilization, for various values of thread run- length when premote= 0.17. For low value P,,, PSU occurs due to saturation at processor and memory sub- systems. PSU increases from 65% to 78% when p,, is increased from 0.1 to 0.7 due to an increase in the value of Unet. Further increase in p,, to 0.9 brings down PSU to 72%, due to lower values of U,, and U,. 100 4 PSW43, --*-- II PswQ5 D - - Q --e P s w 46 x --+- --x PSW a7*- - *-- 1 10 pSW49- 20 40 60 80 100 ThreadRunlength Figure 10: Usyswith Geometric Distribution. 4.4 Summary of Results achieving high performance : Our study suggests the following conditions for 0 Overall high utilization of all subsystems is achieved when (i) the thread runlength R equals with the memory latency L; and (ii) the remote memory access rate (v)equals the network service rate 1.T,", 0 The above condition is necessary irrespective of the large value of nt. 0 The applications with larger locality can toler- ate slower networks without much degradation in performance due to reduced network traffic. 5 Related Work A few analytical studies on multithreaded architec- tures have been reported in the literature [I, 3, 111. In [3] and [ll],Stochastic timed petri nets (STPN) have been used for modeling. These analyses assume that the response time of memory is constant, equiva- lently, the parallelism in the application (i.e. nt), has 120
  • 8. no impact on the throughput of the memory subsys- tem. Further [3] studies a bus-based multiprocessor without contention effect on the bus. In contrast, us- ing an integrated model for a multiprocessor, we study a realistic system with queuing delays at the network and memory subsystems. In [l],the analysis has been performed for a cache- based multiprocessor architecture. The analysis mod- els a finite number of threads and their interference in cache. It focuses on the performance of a proces- sor, but other subsystems (like memory and network) have not been studied. On the other hand, our analy- sis does not model caches explicitly. The thread run- length RIis related to the cache miss rate for an appli- cation. The two approaches are complementary. The analysis presented by Johnson 151 provides a frame- work by combining simple models of application, pro- cessor and network behavior. The model uses unsat- urated network, but does not consider the memory subsystem in detail. We develop a fairly simple inte- grated model of the system which can be adapted to different networks quickly. Our model is applicable to saturated and unsaturated subsystems. In [12], performance results for trace-driven sim- ulations of multithreaded system with a shared bus, are reported. They conclude that a small number of threads in an application can achieve near 100% processor utilization, but large global traffic can limit the performance benefits due to multithreading. Our study extends these results by suggesting the operat- ing range for obtaining higher performance. 6 Conclusions In this paper, we have proposed a simple analytical model for a multithreaded multiprocessor architecture based on closed queuing network with finite thread population. The performance study based on this an- alytical model integrating the processor, memory and network subsystems shows that : 0 a strong coupling exists between these subsys- tems. Variations in parameters of one subsystem affect the utilizations of other subsystems as well. 0 for high performance, the partitioning of a pro- gram should result in thread runlengths close to the weighted sum of memory latency and network latency. This is necessary for high performance irrespective of the application parallelism nt. References A. Agarwal. Performace tradeoffs in multithreaded processors. IEEE Transactions on Parallel and Dis- tributed Systems, 2(4), September 1992. A. Agarwal, B.H. Lim, D. Kranz, and J. Kubiatow- icz. April: A processor architecture for multiprocess- ing. In Proc. of the 17th Int’l. Symp. on Computer Architecture, pages 104-114, 1990. L. Alkalaj and R.V. Bopanna. Performance of multi- threaded execution in a shared-memory multiproces- sor. In Proc. of 3rd Ann. IEEE Symp. on Parallel and Distributed Processing, pages 330-333, Dallas, USA, December 1991. IEEE. F. Baskett, K. Mani Chandy, R.R. Muntz, and F.G. Palacios. Open, closed, and mixed network of queues with different classes of customers. Journal of the ACM, 22(2):248-260, April 1975. K. Johnson. The impact of communication locality on large-scale multiprocessor performance. In Proceed- ings of the 19th Insternational Symposium on Com- puter Architecture, pages 392-402. ACM, May 1992. C.P. KrusM and M. Snir. The performance of mul- tistage interconnection networks. IEEE Transactions on Computers, C-32(12):1091-1098, Jan 1983. E.D. Lazowska, 3. Zahorjan,G.S. Graham, and K.C. Sevcik. Quantitative System Performance: Computer System enalysis Using Queueing Network Models. Prentice-Hall, Inc., Englewood Cliffs, NJ, 1984. T. Murata. Petri nets: Properties, analysis and ap- plications. Proceedings of the IEEE, 77(4):541-580, April 1989. P. Parent and 0.Tanir. Voltaire: a discrete event sim- ulator. In Proceedings of Fourth International Work- shop on Petri Nets and Performance Models, Mel- bourne, Australia, December 1991. M. Reiser and S. Lavenberg. Mean value andysis of closed multichain queueing networks. Journal of R.H. Saavedra-Barrera, D.E. Culler, and T. v. Eicken. Analysis of multithreaded architectures for parallel computing. In Proc. of 2nd Ann. ACM Symp. on Parallel Algorithms and Architectures, Crete, Greece, July 1990. ACM. ACM, 27(2):313-322, April 1980. [12] W.D. Weber and A. Gupta. Exploring the benefits of multiple contexts in a multiprocessor architecture: Preliminary results. In Proceedings of the 16th Annual International Symposium on Computer Architecture, pages 273-280. ACM, 1989. 0 a larger locality in application program reduces the network traffic, resulting in a higher perfor- mance. 121