2. SCALABILITY
• Almost all computers allow the capability of the system to be
increased in some form, for example by adding memory, I/O
cards, disks or upgraded processor(s), but the increase
typically has hard limits
• A scalable system attempts to avoid inherent design limits on
the extent to which resources can be added to the system
• Four aspects of scalability:
– How does the bandwidth or throughput of the system increase with additional
processors?
– How does the latency or time per operation increase?
– How does the cost of the system increase?
– How do we actually package the systems and put them together
3. Bandwidth Scaling
• If a large number of processors are to exchange
information simultaneously with many other
processors or memories, a large number of
independent wires must connect them.
• Thus scalable machines must be organized in
the manner shown in figure (next slide) where
a large number of processor modules and
memory modules are connected together by
independent wires through a large number of
switches
4.
5. • A switch may be realized by a bus, a crossbar or even a collection
of multiplexers
• The number of outputs (or inputs) to the switch is called degree of
the switch
• Switches are limited in scale but may be interconnected to form
large configurations, that is, networks
• Controllers are also available to determine which inputs are to be
connected to which outputs at each instant in time
• A network switch is a more general-purpose device, in which the
information presented at the input is enough for the switch
controller to determine the proper output without consulting all
the nodes
• Pairs of modules are connected by routes through network switches
6. • The most common structure for scalable
machines is illustrated by the generic
architecture shown in fig (next slide)
• Here one or more processors are packaged
together with one or more memory modules
and a communication assist as an easily
replicated unit, which is called a node
• The intranode switch is typically a high-
performance bus
9. • If the memory modules are on the opposite side
of the interconnect, as in fig (previous slide) the
network bandwidth requirement scales linearly
with the number of processors, even when no
communication occurs between processes
• Providing adequate bandwidth scaling may not
be enough for the computational performance to
scale perfectly since the access latency increases
with the number of processors
• By distributing the memories across the
processors, all processes can access local memory
with fixed latency, independent of the number of
processors; thus the computational performance
of the system can scale perfectly
10. The following assumptions are made to achieve scalable
bandwidth:
• It must be possible to have a very large number of
concurrent transactions using different wires
• They are initiated independently and without global
arbitration
• The effects of a transaction (such as changes of state)
are directly visible only by the nodes involved in the
transaction
• The effects may eventually become visible to other
nodes as they are propagated by additional
transactions
• Although it is possible to broadcast information to all
nodes, broadcast bandwidth (i,.e. the rate at which
broadcasts can be performed) does not increase with
the number of nodes
11. Latency Scaling
The time to transfer n bytes between two nodes
is given by
T(n) = Overhead + channel time + routing Delay
Where overhead is the processing time in
initiating or completing the transfer
Channel Time is n/B (where B is the bandwidth
of the thinnest channel)
Routing Delay is a function f(H,n) of the number
of routing steps or hops in the transfer and
number of bytes transferred
12. Prob 7.1: Many classic networks are
constructed out of fixed-degree switches in a
configuration or topology, such that for n nodes
the distance from any network input to any
network output is log2n and the total number of
switches is α n log n for some small constant α.
Assuming the overhead as 1µs per message, the
link bandwidth is 64 MB/s and the router delay
is 200 ns per hop. How much does the time for
a 128-byte transfer increase as the machine is
scaled from 64 to 1,024 nodes?
Solution: At 64 nodes, six hops are required so
13. This increases to 5µs on a 1024-node
configuration. Thus, the latency increases by
less than 20% with a 16-fold increase in
machine size. Even with this small transfer
size, a store-and-forward delay would add
2µs(the time to buffer 128 bytes)to the
routing delay per hop. Thus the latency would
be
at 64 nodes and
14. Cost Scaling:
• It may be viewed as a fixed cost for the system
infrastructure plus an incremental cost of
adding processors and memory to the system:
15. Realizing Programming Models
• Here we examine what is required to
implement programming models on large
distributed –memory machines
• These machines have been most strongly
associated with message-passing
programming models
• Shared address space programming models
have become increasingly important and
well represented
16. • The concept of a communication abstraction, which defined
the set of communication primitives provided to the user
• These could be realized directly in the hardware via system
software or through some combination of the two, as shown
in fig below
17. • In large-scale parallel machines the
programming model is realized in a similar
manner, except that the primitive events are
transactions across the network, that is,
network transactions rather than bus
transactions
• A network transaction is a one-way transfer of
information from an output buffer at the
source to an input buffer at the destination
that causes some kind of action at the
destination, the occurrence of which is not
directly visible at the source, as shown in fig
(next slide)
18.
19. • Primitive Network Transactions
• Before starting a bus transaction, a protection
check has been performed as part of the
virtual-to-physical address translation
• The format of information in a bus transaction
is determined by the physical wires of the bus,
i.e. the data lines, address lines and command
lines
• The information to be transferred onto the
bus is held in special output registers viz.,
address, command and data registers until it
can be driven onto the bus
20. • A bus transaction begins with arbitration for
the medium
• Most buses employ a global arbitration
scheme where a processor requesting a
transaction asserts a bus request line and
waits for the corresponding bus grant
• The destination of the transaction is implicit in
the address
• Each module on the bus is configured to
respond to a set of physical addresses
21. • All modules examine the address and one
responds to the transaction
• If none responds, the bus controller detects the
time-out and aborts the transaction
• Each module includes a set of input registers,
capable of buffering any request to which it might
respond
• Each bus transaction involves a request followed
by a response
• In the case of a read, the response is the data and
an associated completion signal
• For a write it is just the completion
acknowledgement
22. • In either case, both the source and destination
are informed of the completion of the
transaction
• In split-transaction buses, the response phase
of the transaction may require rearbitration
and may be performed in a different order
than the requests
• Care is required to avoid deadlock with split
transactions because a module on the bus
may be both requesting and servicing
transactions
23. • The module must continue servicing bus
requests and accept replies while it is
attempting to present its own request
• The bus design ensures that, for any
transaction that might be placed on the bus,
sufficient input buffering exists to accept the
transaction at the destination
• This can be accomplished by providing enough
resources or by adding a negative
acknowledgement signal (NACK)
24. Issues present in a network transaction
• Protection: As the number of components
becomes larger, the coupling between
components looser and the individual
components more complex, limitations occur
as to how much each component trusts the
others to operate correctly. In a scalable
system, individual components will often
perform checks on the network transaction so
that an errant program or faulty hardware
component cannot corrupt other components
of the system
25. Format: Most network links are narrow, so the
information associated with a transaction is
transferred as a serial stream. Typical links are a
few (1 to 16) bits wide. The format of the
transaction is dictated by how the information is
serialized onto the link. Thus there is a great deal
of flexibility in this aspect of design. The
information in a network transaction is an
envelope with more information inside. The
envelope includes information pertaining to the
physical network to get the packet from it’s
source to it’s destination port. Some networks
are designed to deliver only fixed-size packets
others can deliver variable-size packets.
26. Output Buffering: The source must provide
storage to hold information that is to be
serialized onto the link, either in registers,
FIFOs or memory. Since network transactions
are one-way and can potentially be pipelined,
it maybe desirable to provide a queue of
output registers. If the packet format is
variable up to some moderate size, a similar
approach may be adopted where each entry
in the output buffer is of variable size. If a
packet can be quite long, then typically the
output controller contains a buffer of
descriptors, pointing to the data in memory.
27. Media arbitration: There is no global arbitration
for access to the network and many network
transactions can be initiated simultaneously.
Initiation of the network transaction places an
implicit claim on resources in the
communication path from the source to the
destination as well as on resources at the
destination. These resources are potentially
shared with other transactions. Local
arbitration is performed at the source to
determine whether or not to initiate the
transaction. The resources are allocated
incrementally as the message moves forward.
28. Destination name and routing:
The source must be able to specify enough
information to cause the transaction to be
routed to the appropriate destination. There
are many variations in how routing is specified
and performed, but basically the source
performs a translation from some logical
name for the destination to some form of
physical address.
29. • Input buffering: At the destination, the
information in the network transaction must
be transferred from the physical link into some
storage element. This maybe simple registers
or a queue or it may be delivered directly into
memory. The input buffer is in some sense a
shared resource used by many remote
processors.
• Action: The action taken at the destination
may be very simple or complex. In either case,
it may involve initiating a response.
30. • Completion detection: The source has an
indication that the transaction has been delivered
into the network but usually no indication that it
has arrived at its destination. This completion
must be inferred from a response, an
acknowledgement or some additional
transaction.
• Transaction ordering: In a network the ordering
is quite weak. Some networks ensure that a
sequence of transactions from a given source to a
single destination will be seen in order at the
destination; others will not even provide this
assurance. In either case no node can percieve
the global order.
31. • Deadlock avoidance: Most modern networks are
deadlock free as long as the modules on the
network continue to accept transactions. Within
the network, this may require restrictions on
permissible routes or other special precautions.
• Delivery guarantees: A fundamental decision in
the design of a scalable network is the behavior
when the destination buffer is full. This is clearly
an issue on an end-to-end basis since it is
necessary for the source to know whether the
destination input buffer is available when it is
attempting to initiate a transaction. It is also an
issue on a link-by-link basis within the network
itself.
33. • Realizing the shared address space
communication abstraction requires a two-
way request-response protocol, as shown in
fig (previous slide)
• A global address is decomposed into a module
number and a local address.
• For a read operation, a request is sent to the
designated module requesting a load of the
desired address and specifying enough
information to allow the result to be returned
to the requestor through a response network
transaction.
34. • A write is similar, except that the data is
conveyed with the address and command to
the designated module and the response is
merely an acknowledgement to the requestor
that the write has been performed. The
response informs the source that the request
has been received or serviced, depending on
whether it is generated before or after the
remote action.
35.
36. • A send/receive pair in the message-passing
model is conceptually a one-way transfer
from a source area specified by the source
user process to a destination area specified by
the destination user process.
• In addition, it embodies a pairwise
synchronization event between the two
processes.
• Message passing interface (MPI) distinguishes
the notion of when a call to a send or receive
function returns from when a message
operation completes.
37. • A synchronous send completes once the
matching receive has executed, the source
data buffer can be reused and the data is
ensured of arriving in the destination receive
buffer.
• A buffered send completes as soon as the
source data buffer can be reused,
independent of whether the matching receive
has been issued; the data may have been
transmitted or it may be buffered somewhere
in the system.
38. • Buffered send completion is asynchronous
with respect to the receiver process
• A receive completes when the message data is
present in the receive destination buffer.
• A blocking function, send or receive, returns
only after the message operation completes
• A non blocking function returns immediately,
regardless of message completion and
additional calls to a probe function are used to
detect completion
39. • The protocols are concerned only with
message operation and completion, regardless
of whether the functions are blocking