Parallel Architectures
MICHAEL J. FLYNN AND KEVIN W. RUDD
Stanford University ͗flynn@Umunhum.Stanford.edu͘; ͗kevin@Umunhum.Stanford.edu͘



PARALLEL ARCHITECTURES                         currently performing different phases of
                                               processing an instruction. This does not
Parallel or concurrent operation has
                                               achieve concurrency of execution (with
many different forms within a computer
system. Using a model based on the             multiple actions being taken on objects)
different streams used in the computa-         but does achieve a concurrency of pro-
tion process, we represent some of the         cessing—an improvement in efficiency
different kinds of parallelism available.      upon which almost all processors de-
A stream is a sequence of objects such         pend today.
as data, or of actions such as instruc-          Techniques that exploit concurrency
tions. Each stream is independent of all       of execution, often called instruction-
other streams, and each element of a           level parallelism (ILP), are also com-
stream can consist of one or more ob-          mon. Two architectures that exploit ILP
jects or actions. We thus have four com-       are superscalar and VLIW (very long
binations that describe most familiar
parallel architectures:                        instruction word). These techniques
                                               schedule different operations to execute
(1) SISD: single instruction, single data      concurrently based on analyzing the de-
    stream. This is the traditional uni-       pendencies between the operations
    processor [Figure 1(a)].                   within the instruction stream—dynami-
(2) SIMD: single instruction, multiple         cally at run time in a superscalar pro-
    data stream. This includes vector          cessor and statically at compile time in
    processors as well as massively par-       a VLIW processor. Both ILP approaches
    allel processors [Figure 1(b)].            trade off adaptability against complex-
(3) MISD: multiple instruction, single         ity—the superscalar processor is adapt-
    data stream. These are typically           able but complex whereas the VLIW
    systolic arrays [Figure 1(c)].
                                               processor is not adaptable but simple.
(4) MIMD: multiple instruction, multi-         Both superscalar and VLIW use the
    ple data stream. This includes tradi-
                                               same compiler techniques to achieve
    tional multiprocessors as well as the
    newer networks of workstations             high performance.
    [Figure 1(d)].                               The current trend for SISD processors
                                               is towards superscalar designs in order
  Each of these combinations character-        to exploit available ILP as well as exist-
izes a class of architectures and a corre-     ing object code. In the marketplace
sponding type of parallelism.
                                               there are few VLIW designs, due to code
                                               compatibility issues, although advances
SISD                                           in compiler technology may cause this
                                               to change. However, research in all as-
The SISD class of processor architecture
                                               pects of ILP is fundamental to the de-
is the most familiar class and has the
least obvious concurrency of any of the        velopment of improved architectures in
models, yet a good deal of concurrency         all classes because of the frequent use of
can be present. Pipelining is a straight-      SISD architectures as the processor ele-
forward approach that is based on con-         ments in most implementations.

Copyright © 1996, CRC Press.



                                                  ACM Computing Surveys, Vol. 28, No. 1, March 1996
68       •      Michael J. Flynn and Kevin W. Rudd




                                     Figure 1. The stream model.



SIMD                                                consisting of hundreds to tens of thou-
                                                    sands of relatively simple processors op-
The SIMD class of processor architec-
                                                    erating together. A vector processor de-
ture includes both array and vector pro-
                                                    pends on the same regularity of action
cessors. This processor is a natural re-
sponse to the use of certain regular data           as an array processor but on smaller
structures such as vectors and matrices.            data sets and relies on extreme pipelin-
Two different architectures, array pro-             ing and high clock rates to reduce the
cessors and vector processors, have been            overall latency of the operation.
developed to address these structures.                 There have not been a significant
   An array processor has many proces-              number of array architectures devel-
sor elements operating in parallel on               oped due to a limited application base
many data elements. A vector processor              and market requirement. There has
has a single processor element that op-             been a trend towards more and more
erates in sequence on many data ele-                complex processor elements due to in-
ments. Both types of processors use a               creases in chip density, and recent ar-
single operation to perform many ac-                ray architectures blur the distinction
tions. An array processor depends on                between SIMD and MIMD configura-
the massive size of the data sets to                tions. On the other hand, many differ-
achieve its efficiency (and thus is often           ent kinds of vector processors have de-
referred to as a massively parallel pro-            veloped dramatically through the years.
cessor), with a typical array processor             Starting with simple memory-based vec-


ACM Computing Surveys, Vol. 28, No. 1, March 1996
Parallel Architectures             •      69

tor processors, modern vector processors    cessor elements are performed through
have developed into high-performance        a shared memory address space (either
multiprocessors capable of addressing       global or distributed between processor
both SIMD and MIMD parallelism.             elements, called distributed shared
                                            memory to distinguish it from distrib-
MISD                                        uted memory), two significant problems
                                            arise. The first is mainlining memory
Although it is easy to both envision and    consistency—the       programmer-visible
design MISD processors, there has been      ordering effects of memory references
little interest in this type of parallel    both within a processor element and
architecture. The reason, so far anyway,    between different processor elements.
is that no ready programming con-           The second is maintaining cache coher-
structs easily map programs into the        ency—the programmer-invisible mecha-
MISD organization.                          nism to ensure that all processor ele-
   Abstractly, the MISD is a pipeline of    ments see the same value for a given
multiple independently executing func-      memory location. The memory consis-
tional units operating on a single          tency problem is usually solved through
stream of data, forwarding results from     a combination of hardware and software
one functional unit to the next. On the     techniques. The cache coherency prob-
microarchitecture level, this is exactly    lem is usually solved exclusively
what the vector processor does. How-        through hardware techniques.
ever, in the vector pipeline the opera-        There have been many configurations
tions are simply fragments of an assem-     of MIMD processors that have ranged
bly-level operation, as distinct from       from the traditional processor described
being a complete operation in them-         in this section to loosely coupled proces-
selves. Surprisingly, some of the earli-    sors based on networking commodity
est attempts at computers in the 1940s      workstations through a local area net-
could be seen as the MISD concept.          work. These configurations differ pri-
They used plug boards for programs,         marily in the interconnection network
where data in the form of a punched         between processor elements that range
card was introduced into the first stage    from on-chip arbitration between multi-
of a multistage processor. A sequential     ple processor elements on one chip to
series of actions was taken in which the    wide-area networks between continents,
intermediate results were forwarded         the tradeoffs being between the latency
from stage to stage until at the final      of communications and the size limita-
stage a result was punched into a new       tions on the system.
card.
                                            LOOKING FORWARD
MIMD
                                            We are celebrating the first 50 years of
The MIMD class of parallel architecture     electronic digital computers—the past,
is the most familiar and possibly most      as it were, is history, and it is instruc-
basic form of parallel processor: it con-   tive to change our perspective and to
sists of multiple interconnected proces-    look forward and consider not what has
sor elements. Unlike the SIMD proces-       been done but what must be done. Just
sor, each processor element executes        as in the past there will be larger,
completely independently (although          faster, more complex computers with
typically the same program). Although       more memory, more storage, and more
there is no requirement that all proces-    complications. We cannot expect that
sor elements be identical, most MIMD        processors will be limited to the “sim-
configurations are homogeneous with         ple” uniprocessors, multiprocessors, ar-
all processor elements identical.           ray processors, and vector processors we
  When communications between pro-          have today. We cannot expect that the


                                              ACM Computing Surveys, Vol. 28, No. 1, March 1996
70       •      Michael J. Flynn and Kevin W. Rudd

programming environments will be lim-               element, the performance benefits of
ited to the simple imperative program-              these improvements are complementary
ming languages and tools that we have               and at this point are nowhere near the
today.                                              scale of performance available through
   As before, we can expect that memory             exploiting parallelism. Clearly, provid-
cost (on a per-bit basis) will continue its         ing parallelism of order n is much easier
decline so that systems will contain                than increasing the execution rate (for
larger and larger memory spaces. We                 example, the clock speed) by a factor of n.
are already seeing this effect in the                  The continued drive for higher- and
latest processor designs that have dou-             higher-performance systems thus leads
bled the “standard” address size, yield-            us to one simple conclusion: the future
ing an increase from 4,294,967,296 ad-              is parallel. The first electronic comput-
dresses (with 32 bits) to 18,446,744,073,           ers provided a speedup of 10,000 com-
709,551,616 addresses (with 64 bits).               pared to the mechanical computers of 50
We can expect that interconnection net-             years ago. The challenge for the future
works will continue to grow in both                 is to realize parallel processors that pro-
scale and performance. The growth in                vide a similar speedup over a broad
the Internet in the last few years is               range of applications. There is much
phenomenal and the increase in the use              work to be done here. . .let us be on with
of optics in the interconnection network            it!
has made this increase at least feasible.
   However, we cannot expect that the                               REFERENCES
ease of programming these improved
                                                    There are thousands of references deal-
configurations will advance—as the
                                                    ing with the many aspects of parallel
available parallelism of computer sys-
                                                    architectures. These references com-
tems increases, exploiting this parallel-
                                                    prise only a very small subset of acces-
ism becomes the limiting factor. There
                                                    sible publications, but provide the inter-
are two aspects of this problem: finding
                                                    ested reader with jumping-off points for
large degrees of parallelism (typically
                                                    further exploration.
an algorithmic or partitioning problem)             FLYNN, M. J. 1995. Computer Architecture:
and efficiently managing the available                  Pipelined and Parallel Processor Design.
parallelism to achieve high performance                 Jones and Bartlett, Boston.
(typically a scheduling or placement                HOCKNEY, R. W. AND JESSHOPE, C. R. 1988.
problem). Of course, all the solutions to               Parallel Computers 2: Architecture, Program-
these problems must ensure that cor-                    ming and Algorithms, 2nd ed. Adam Hilger,
                                                        Bristol.
rectness is satisfied. It does not matter
                                                    HOCKNEY, R. W. AND JESSHOPE, C. R. 1981.
how fast the program runs if it does not                Parallel Computers: Architecture, Programming
produce the correct result. Solving these               and Algorithms. Adam Hilger, Bristol.
problems will require many develop-                 HWANG, K. 1993. Advanced Computer Architec-
ments and changes, few of which are                     ture: Parallelism, Scalability, Programmabil-
foreseeable.                                            ity. McGraw-Hill, New York.
   Although not satisfying, we can cer-             IBBETT, R. N. AND TOPHAM, N. P. 1989a.
                                                        Architecture of High Performance Computers,
tainly say that programming para-                       Vol. I. Uniprocessors and Vector Processors.
digms, compiler techniques, algorithm                   Springer-Verlag, New York.
designs, and operating systems are all              IBBETT, R. N. AND TOPHAM, N. P. 1989b.
fair game, but these are likely only                    Architecture of High Performance Computers,
pieces of the puzzle. Indeed, broad new                 Vol. II. Array Processors and Multiprocessor
approaches to the representation of                     Systems. Springer-Verlag, New York.
physical problems may be required. The              KUHN, R. H. AND PADUA, D. A. EDS. 1981.
                                                        Tutorial on Parallel Processing. IEEE Com-
good news from all this is that there is                puter Society Press, Los Alamitos, CA.
no dearth of work to be done in this                WOLFE, M. J. 1996. High Performance Compil-
area. Although improvements can cer-                    ers for Parallel Computing. Addison-Wesley,
tainly be made to a single processor                    Reading, MA.



ACM Computing Surveys, Vol. 28, No. 1, March 1996

Flynn taxonomies

  • 1.
    Parallel Architectures MICHAEL J.FLYNN AND KEVIN W. RUDD Stanford University ͗flynn@Umunhum.Stanford.edu͘; ͗kevin@Umunhum.Stanford.edu͘ PARALLEL ARCHITECTURES currently performing different phases of processing an instruction. This does not Parallel or concurrent operation has achieve concurrency of execution (with many different forms within a computer system. Using a model based on the multiple actions being taken on objects) different streams used in the computa- but does achieve a concurrency of pro- tion process, we represent some of the cessing—an improvement in efficiency different kinds of parallelism available. upon which almost all processors de- A stream is a sequence of objects such pend today. as data, or of actions such as instruc- Techniques that exploit concurrency tions. Each stream is independent of all of execution, often called instruction- other streams, and each element of a level parallelism (ILP), are also com- stream can consist of one or more ob- mon. Two architectures that exploit ILP jects or actions. We thus have four com- are superscalar and VLIW (very long binations that describe most familiar parallel architectures: instruction word). These techniques schedule different operations to execute (1) SISD: single instruction, single data concurrently based on analyzing the de- stream. This is the traditional uni- pendencies between the operations processor [Figure 1(a)]. within the instruction stream—dynami- (2) SIMD: single instruction, multiple cally at run time in a superscalar pro- data stream. This includes vector cessor and statically at compile time in processors as well as massively par- a VLIW processor. Both ILP approaches allel processors [Figure 1(b)]. trade off adaptability against complex- (3) MISD: multiple instruction, single ity—the superscalar processor is adapt- data stream. These are typically able but complex whereas the VLIW systolic arrays [Figure 1(c)]. processor is not adaptable but simple. (4) MIMD: multiple instruction, multi- Both superscalar and VLIW use the ple data stream. This includes tradi- same compiler techniques to achieve tional multiprocessors as well as the newer networks of workstations high performance. [Figure 1(d)]. The current trend for SISD processors is towards superscalar designs in order Each of these combinations character- to exploit available ILP as well as exist- izes a class of architectures and a corre- ing object code. In the marketplace sponding type of parallelism. there are few VLIW designs, due to code compatibility issues, although advances SISD in compiler technology may cause this to change. However, research in all as- The SISD class of processor architecture pects of ILP is fundamental to the de- is the most familiar class and has the least obvious concurrency of any of the velopment of improved architectures in models, yet a good deal of concurrency all classes because of the frequent use of can be present. Pipelining is a straight- SISD architectures as the processor ele- forward approach that is based on con- ments in most implementations. Copyright © 1996, CRC Press. ACM Computing Surveys, Vol. 28, No. 1, March 1996
  • 2.
    68 • Michael J. Flynn and Kevin W. Rudd Figure 1. The stream model. SIMD consisting of hundreds to tens of thou- sands of relatively simple processors op- The SIMD class of processor architec- erating together. A vector processor de- ture includes both array and vector pro- pends on the same regularity of action cessors. This processor is a natural re- sponse to the use of certain regular data as an array processor but on smaller structures such as vectors and matrices. data sets and relies on extreme pipelin- Two different architectures, array pro- ing and high clock rates to reduce the cessors and vector processors, have been overall latency of the operation. developed to address these structures. There have not been a significant An array processor has many proces- number of array architectures devel- sor elements operating in parallel on oped due to a limited application base many data elements. A vector processor and market requirement. There has has a single processor element that op- been a trend towards more and more erates in sequence on many data ele- complex processor elements due to in- ments. Both types of processors use a creases in chip density, and recent ar- single operation to perform many ac- ray architectures blur the distinction tions. An array processor depends on between SIMD and MIMD configura- the massive size of the data sets to tions. On the other hand, many differ- achieve its efficiency (and thus is often ent kinds of vector processors have de- referred to as a massively parallel pro- veloped dramatically through the years. cessor), with a typical array processor Starting with simple memory-based vec- ACM Computing Surveys, Vol. 28, No. 1, March 1996
  • 3.
    Parallel Architectures • 69 tor processors, modern vector processors cessor elements are performed through have developed into high-performance a shared memory address space (either multiprocessors capable of addressing global or distributed between processor both SIMD and MIMD parallelism. elements, called distributed shared memory to distinguish it from distrib- MISD uted memory), two significant problems arise. The first is mainlining memory Although it is easy to both envision and consistency—the programmer-visible design MISD processors, there has been ordering effects of memory references little interest in this type of parallel both within a processor element and architecture. The reason, so far anyway, between different processor elements. is that no ready programming con- The second is maintaining cache coher- structs easily map programs into the ency—the programmer-invisible mecha- MISD organization. nism to ensure that all processor ele- Abstractly, the MISD is a pipeline of ments see the same value for a given multiple independently executing func- memory location. The memory consis- tional units operating on a single tency problem is usually solved through stream of data, forwarding results from a combination of hardware and software one functional unit to the next. On the techniques. The cache coherency prob- microarchitecture level, this is exactly lem is usually solved exclusively what the vector processor does. How- through hardware techniques. ever, in the vector pipeline the opera- There have been many configurations tions are simply fragments of an assem- of MIMD processors that have ranged bly-level operation, as distinct from from the traditional processor described being a complete operation in them- in this section to loosely coupled proces- selves. Surprisingly, some of the earli- sors based on networking commodity est attempts at computers in the 1940s workstations through a local area net- could be seen as the MISD concept. work. These configurations differ pri- They used plug boards for programs, marily in the interconnection network where data in the form of a punched between processor elements that range card was introduced into the first stage from on-chip arbitration between multi- of a multistage processor. A sequential ple processor elements on one chip to series of actions was taken in which the wide-area networks between continents, intermediate results were forwarded the tradeoffs being between the latency from stage to stage until at the final of communications and the size limita- stage a result was punched into a new tions on the system. card. LOOKING FORWARD MIMD We are celebrating the first 50 years of The MIMD class of parallel architecture electronic digital computers—the past, is the most familiar and possibly most as it were, is history, and it is instruc- basic form of parallel processor: it con- tive to change our perspective and to sists of multiple interconnected proces- look forward and consider not what has sor elements. Unlike the SIMD proces- been done but what must be done. Just sor, each processor element executes as in the past there will be larger, completely independently (although faster, more complex computers with typically the same program). Although more memory, more storage, and more there is no requirement that all proces- complications. We cannot expect that sor elements be identical, most MIMD processors will be limited to the “sim- configurations are homogeneous with ple” uniprocessors, multiprocessors, ar- all processor elements identical. ray processors, and vector processors we When communications between pro- have today. We cannot expect that the ACM Computing Surveys, Vol. 28, No. 1, March 1996
  • 4.
    70 • Michael J. Flynn and Kevin W. Rudd programming environments will be lim- element, the performance benefits of ited to the simple imperative program- these improvements are complementary ming languages and tools that we have and at this point are nowhere near the today. scale of performance available through As before, we can expect that memory exploiting parallelism. Clearly, provid- cost (on a per-bit basis) will continue its ing parallelism of order n is much easier decline so that systems will contain than increasing the execution rate (for larger and larger memory spaces. We example, the clock speed) by a factor of n. are already seeing this effect in the The continued drive for higher- and latest processor designs that have dou- higher-performance systems thus leads bled the “standard” address size, yield- us to one simple conclusion: the future ing an increase from 4,294,967,296 ad- is parallel. The first electronic comput- dresses (with 32 bits) to 18,446,744,073, ers provided a speedup of 10,000 com- 709,551,616 addresses (with 64 bits). pared to the mechanical computers of 50 We can expect that interconnection net- years ago. The challenge for the future works will continue to grow in both is to realize parallel processors that pro- scale and performance. The growth in vide a similar speedup over a broad the Internet in the last few years is range of applications. There is much phenomenal and the increase in the use work to be done here. . .let us be on with of optics in the interconnection network it! has made this increase at least feasible. However, we cannot expect that the REFERENCES ease of programming these improved There are thousands of references deal- configurations will advance—as the ing with the many aspects of parallel available parallelism of computer sys- architectures. These references com- tems increases, exploiting this parallel- prise only a very small subset of acces- ism becomes the limiting factor. There sible publications, but provide the inter- are two aspects of this problem: finding ested reader with jumping-off points for large degrees of parallelism (typically further exploration. an algorithmic or partitioning problem) FLYNN, M. J. 1995. Computer Architecture: and efficiently managing the available Pipelined and Parallel Processor Design. parallelism to achieve high performance Jones and Bartlett, Boston. (typically a scheduling or placement HOCKNEY, R. W. AND JESSHOPE, C. R. 1988. problem). Of course, all the solutions to Parallel Computers 2: Architecture, Program- these problems must ensure that cor- ming and Algorithms, 2nd ed. Adam Hilger, Bristol. rectness is satisfied. It does not matter HOCKNEY, R. W. AND JESSHOPE, C. R. 1981. how fast the program runs if it does not Parallel Computers: Architecture, Programming produce the correct result. Solving these and Algorithms. Adam Hilger, Bristol. problems will require many develop- HWANG, K. 1993. Advanced Computer Architec- ments and changes, few of which are ture: Parallelism, Scalability, Programmabil- foreseeable. ity. McGraw-Hill, New York. Although not satisfying, we can cer- IBBETT, R. N. AND TOPHAM, N. P. 1989a. Architecture of High Performance Computers, tainly say that programming para- Vol. I. Uniprocessors and Vector Processors. digms, compiler techniques, algorithm Springer-Verlag, New York. designs, and operating systems are all IBBETT, R. N. AND TOPHAM, N. P. 1989b. fair game, but these are likely only Architecture of High Performance Computers, pieces of the puzzle. Indeed, broad new Vol. II. Array Processors and Multiprocessor approaches to the representation of Systems. Springer-Verlag, New York. physical problems may be required. The KUHN, R. H. AND PADUA, D. A. EDS. 1981. Tutorial on Parallel Processing. IEEE Com- good news from all this is that there is puter Society Press, Los Alamitos, CA. no dearth of work to be done in this WOLFE, M. J. 1996. High Performance Compil- area. Although improvements can cer- ers for Parallel Computing. Addison-Wesley, tainly be made to a single processor Reading, MA. ACM Computing Surveys, Vol. 28, No. 1, March 1996