Flynn taxonomies

Parallel Architectures
MICHAEL J. FLYNN AND KEVIN W. RUDD
Stanford University ͗flynn@Umunhum.Stanford.edu͘; ͗kevin@Umunhum.Stanford.edu͘

PARALLEL ARCHITECTURES currently performing different phases of
processing an instruction. This does not
Parallel or concurrent operation has
achieve concurrency of execution (with
many different forms within a computer
system. Using a model based on the multiple actions being taken on objects)
different streams used in the computa- but does achieve a concurrency of pro-
tion process, we represent some of the cessing—an improvement in efficiency
different kinds of parallelism available. upon which almost all processors de-
A stream is a sequence of objects such pend today.
as data, or of actions such as instruc- Techniques that exploit concurrency
tions. Each stream is independent of all of execution, often called instruction-
other streams, and each element of a level parallelism (ILP), are also com-
stream can consist of one or more ob- mon. Two architectures that exploit ILP
jects or actions. We thus have four com- are superscalar and VLIW (very long
binations that describe most familiar
parallel architectures: instruction word). These techniques
schedule different operations to execute
(1) SISD: single instruction, single data concurrently based on analyzing the de-
stream. This is the traditional uni- pendencies between the operations
processor [Figure 1(a)]. within the instruction stream—dynami-
(2) SIMD: single instruction, multiple cally at run time in a superscalar pro-
data stream. This includes vector cessor and statically at compile time in
processors as well as massively par- a VLIW processor. Both ILP approaches
allel processors [Figure 1(b)]. trade off adaptability against complex-
(3) MISD: multiple instruction, single ity—the superscalar processor is adapt-
data stream. These are typically able but complex whereas the VLIW
systolic arrays [Figure 1(c)].
processor is not adaptable but simple.
(4) MIMD: multiple instruction, multi- Both superscalar and VLIW use the
ple data stream. This includes tradi-
same compiler techniques to achieve
tional multiprocessors as well as the
newer networks of workstations high performance.
[Figure 1(d)]. The current trend for SISD processors
is towards superscalar designs in order
Each of these combinations character- to exploit available ILP as well as exist-
izes a class of architectures and a corre- ing object code. In the marketplace
sponding type of parallelism.
there are few VLIW designs, due to code
compatibility issues, although advances
SISD in compiler technology may cause this
to change. However, research in all as-
The SISD class of processor architecture
pects of ILP is fundamental to the de-
is the most familiar class and has the
least obvious concurrency of any of the velopment of improved architectures in
models, yet a good deal of concurrency all classes because of the frequent use of
can be present. Pipelining is a straight- SISD architectures as the processor ele-
forward approach that is based on con- ments in most implementations.

Copyright © 1996, CRC Press.

ACM Computing Surveys, Vol. 28, No. 1, March 1996

68 • Michael J. Flynn and Kevin W. Rudd

Figure 1. The stream model.

SIMD consisting of hundreds to tens of thou-
sands of relatively simple processors op-
The SIMD class of processor architec-
erating together. A vector processor de-
ture includes both array and vector pro-
pends on the same regularity of action
cessors. This processor is a natural re-
sponse to the use of certain regular data as an array processor but on smaller
structures such as vectors and matrices. data sets and relies on extreme pipelin-
Two different architectures, array pro- ing and high clock rates to reduce the
cessors and vector processors, have been overall latency of the operation.
developed to address these structures. There have not been a significant
An array processor has many proces- number of array architectures devel-
sor elements operating in parallel on oped due to a limited application base
many data elements. A vector processor and market requirement. There has
has a single processor element that op- been a trend towards more and more
erates in sequence on many data ele- complex processor elements due to in-
ments. Both types of processors use a creases in chip density, and recent ar-
single operation to perform many ac- ray architectures blur the distinction
tions. An array processor depends on between SIMD and MIMD configura-
the massive size of the data sets to tions. On the other hand, many differ-
achieve its efficiency (and thus is often ent kinds of vector processors have de-
referred to as a massively parallel pro- veloped dramatically through the years.
cessor), with a typical array processor Starting with simple memory-based vec-


Parallel Architectures • 69

tor processors, modern vector processors cessor elements are performed through
have developed into high-performance a shared memory address space (either
multiprocessors capable of addressing global or distributed between processor
both SIMD and MIMD parallelism. elements, called distributed shared
memory to distinguish it from distrib-
MISD uted memory), two significant problems
arise. The first is mainlining memory
Although it is easy to both envision and consistency—the programmer-visible
design MISD processors, there has been ordering effects of memory references
little interest in this type of parallel both within a processor element and
architecture. The reason, so far anyway, between different processor elements.
is that no ready programming con- The second is maintaining cache coher-
structs easily map programs into the ency—the programmer-invisible mecha-
MISD organization. nism to ensure that all processor ele-
Abstractly, the MISD is a pipeline of ments see the same value for a given
multiple independently executing func- memory location. The memory consis-
tional units operating on a single tency problem is usually solved through
stream of data, forwarding results from a combination of hardware and software
one functional unit to the next. On the techniques. The cache coherency prob-
microarchitecture level, this is exactly lem is usually solved exclusively
what the vector processor does. How- through hardware techniques.
ever, in the vector pipeline the opera- There have been many configurations
tions are simply fragments of an assem- of MIMD processors that have ranged
bly-level operation, as distinct from from the traditional processor described
being a complete operation in them- in this section to loosely coupled proces-
selves. Surprisingly, some of the earli- sors based on networking commodity
est attempts at computers in the 1940s workstations through a local area net-
could be seen as the MISD concept. work. These configurations differ pri-
They used plug boards for programs, marily in the interconnection network
where data in the form of a punched between processor elements that range
card was introduced into the first stage from on-chip arbitration between multi-
of a multistage processor. A sequential ple processor elements on one chip to
series of actions was taken in which the wide-area networks between continents,
intermediate results were forwarded the tradeoffs being between the latency
from stage to stage until at the final of communications and the size limita-
stage a result was punched into a new tions on the system.
card.
LOOKING FORWARD
MIMD
We are celebrating the first 50 years of
The MIMD class of parallel architecture electronic digital computers—the past,
is the most familiar and possibly most as it were, is history, and it is instruc-
basic form of parallel processor: it con- tive to change our perspective and to
sists of multiple interconnected proces- look forward and consider not what has
sor elements. Unlike the SIMD proces- been done but what must be done. Just
sor, each processor element executes as in the past there will be larger,
completely independently (although faster, more complex computers with
typically the same program). Although more memory, more storage, and more
there is no requirement that all proces- complications. We cannot expect that
sor elements be identical, most MIMD processors will be limited to the “sim-
configurations are homogeneous with ple” uniprocessors, multiprocessors, ar-
all processor elements identical. ray processors, and vector processors we
When communications between pro- have today. We cannot expect that the


70 • Michael J. Flynn and Kevin W. Rudd

programming environments will be lim- element, the performance benefits of
ited to the simple imperative program- these improvements are complementary
ming languages and tools that we have and at this point are nowhere near the
today. scale of performance available through
As before, we can expect that memory exploiting parallelism. Clearly, provid-
cost (on a per-bit basis) will continue its ing parallelism of order n is much easier
decline so that systems will contain than increasing the execution rate (for
larger and larger memory spaces. We example, the clock speed) by a factor of n.
are already seeing this effect in the The continued drive for higher- and
latest processor designs that have dou- higher-performance systems thus leads
bled the “standard” address size, yield- us to one simple conclusion: the future
ing an increase from 4,294,967,296 ad- is parallel. The first electronic comput-
dresses (with 32 bits) to 18,446,744,073, ers provided a speedup of 10,000 com-
709,551,616 addresses (with 64 bits). pared to the mechanical computers of 50
We can expect that interconnection net- years ago. The challenge for the future
works will continue to grow in both is to realize parallel processors that pro-
scale and performance. The growth in vide a similar speedup over a broad
the Internet in the last few years is range of applications. There is much
phenomenal and the increase in the use work to be done here. . .let us be on with
of optics in the interconnection network it!
has made this increase at least feasible.
However, we cannot expect that the REFERENCES
ease of programming these improved
There are thousands of references deal-
configurations will advance—as the
ing with the many aspects of parallel
available parallelism of computer sys-
architectures. These references com-
tems increases, exploiting this parallel-
prise only a very small subset of acces-
ism becomes the limiting factor. There
sible publications, but provide the inter-
are two aspects of this problem: finding
ested reader with jumping-off points for
large degrees of parallelism (typically
further exploration.
an algorithmic or partitioning problem) FLYNN, M. J. 1995. Computer Architecture:
and efficiently managing the available Pipelined and Parallel Processor Design.
parallelism to achieve high performance Jones and Bartlett, Boston.
(typically a scheduling or placement HOCKNEY, R. W. AND JESSHOPE, C. R. 1988.
problem). Of course, all the solutions to Parallel Computers 2: Architecture, Program-
these problems must ensure that cor- ming and Algorithms, 2nd ed. Adam Hilger,
Bristol.
rectness is satisfied. It does not matter
HOCKNEY, R. W. AND JESSHOPE, C. R. 1981.
how fast the program runs if it does not Parallel Computers: Architecture, Programming
produce the correct result. Solving these and Algorithms. Adam Hilger, Bristol.
problems will require many develop- HWANG, K. 1993. Advanced Computer Architec-
ments and changes, few of which are ture: Parallelism, Scalability, Programmabil-
foreseeable. ity. McGraw-Hill, New York.
Although not satisfying, we can cer- IBBETT, R. N. AND TOPHAM, N. P. 1989a.
Architecture of High Performance Computers,
tainly say that programming para- Vol. I. Uniprocessors and Vector Processors.
digms, compiler techniques, algorithm Springer-Verlag, New York.
designs, and operating systems are all IBBETT, R. N. AND TOPHAM, N. P. 1989b.
fair game, but these are likely only Architecture of High Performance Computers,
pieces of the puzzle. Indeed, broad new Vol. II. Array Processors and Multiprocessor
approaches to the representation of Systems. Springer-Verlag, New York.
physical problems may be required. The KUHN, R. H. AND PADUA, D. A. EDS. 1981.
Tutorial on Parallel Processing. IEEE Com-
good news from all this is that there is puter Society Press, Los Alamitos, CA.
no dearth of work to be done in this WOLFE, M. J. 1996. High Performance Compil-
area. Although improvements can cer- ers for Parallel Computing. Addison-Wesley,
tainly be made to a single processor Reading, MA.


Flynn taxonomies

More Related Content

What's hot

Similar to Flynn taxonomies

Flynn taxonomies