1. Heterogeneous Computing:
Challenges and
Opportunities
Ashfaq A. Khokhar, Viktor K. Prasanna, Muhammad E. Shaaban,
and Cho-Li Wang
University of Southern California
omogeneous computing, which uses one or more machines of the same
type, has provided adequate performance for many applications in the
past. Many of these applications had more than one type of embedded
parallelism, such as single instruction, multiple data (SIMD) and multiple instruc-
tion, multiple data (MIMD). Most of the current parallel machines are suited only
for homogeneous computing. However, numerous applications that have more
than one type of embedded parallelism are now being considered for parallel
implementation. On the other hand, as the amount of homogeneous parallelism in
applications decreases, homogeneous systems cannot offer the desired speedups.
T o exploit the heterogeneity in computations, researchers are investigating a suite
of heterogeneous architectures.
Anytime you work with Heterogeneous computing (HC) is the well-orchestrated and coordinated effec-
tive use of a suite of diverse high-performance machines (including parallel
oranges and apples, machines) to provide superspeed processing for computationally demanding tasks
you'll need a number of with diverse computing needs.' An H C system includes heterogeneous machines,
high-speed networks, interfaces, operating systems, communication protocols,
schemes to organize and programming environments, all combining to produce a positive impact on
ease of use and performance. Figure 1 shows an example H C environment.
total performance. This Heterogeneous computing should be distinguished from network computing or
article surveys the high-performance distributed computing, which have generally come to mean
either clusters of workstations or ad hoc connectivity among computers using little
challenges posed by more than opportunistic load-balancing. H C is a plausible, novel technique for
heterogeneous solving computationally intensive problems that have several types of embedded
parallelism. H C also helps to reduce design risks by incorporating proven technol-
computing and ogy and existing designs instead of developing them from scratch. However,
several issues and problems arise from employing this technique, which we discuss.
discusses some In the past few years, several technical meetings have addressed many of these
approaches to opening issues. There is also a growing interest in using this paradigm to solve Grand
Challenges problems. Richard Freund has organized the Heterogeneous Process-
up its opportunities. ing Workshops held each year at the I E E E International Parallel Processing
18 0018-9162/93/0600-0018$03.00 Q 1993 IEEE COMPUTER
2. Glossary
Symposiums.’ Another related yearly
meeting is the IEEE International Sym- Analytical benchmarking: A procedure to analyze the relative effectiveness
posium on High-Performance Distrib- of machines on various computational types.
uted Computing.’
Code-type profiling: A code-specific function to identify various types of par-
allelism present in code and to estimate the execution times of each code type.
Heterogeneous systems Cross-machine debuggers: Those available within the heterogeneous com-
puting environment to help debug the application code that executes over multi-
The quest for higher computational ple machines.
power suitable for a wide range of ap- Cross-over overhead: That incurred in transferring data from one machine
plications at a reasonable cost has ex- to another. It also includes data-format-conversion overhead between the two
posed several inherent limitations of machines.
homogeneous systems. Replacing such Cross-parallel compiler: An intelligent compiler that can generate intermedi-
systems with yet more powerful homo- ate code executable on different parallel machines.
geneous systems is not feasible. More-
Heterogeneous computing (HC): A well-orchestrated, coordinated effective
over. this approach does not improve
use of a suite of diverse high-performance machines (including parallel ma-
the versatility of the system. H C offers
chines) to provide fast processing for computationally demanding tasks that
a novel cost-effective approach to these
have diverse computing needs.
problems; instead of replacing existing
multiprocessor systems at high cost, HC Metacomputations: Computations exhibiting coarse-grained heterogeneity
proposes using existing systems in an in terms of embedded parallelism.
integrated environment. Mixed-mode computations: Computations exhibiting fine-grained heteroge-
neity in terms of embedded parallelism.
Limitationsof homogeneous systems. Multiple instruction,multiple data (MIMD): A mode in which code stored in
Conventional homogeneous systems each processor’s local memory is executed independently.
usually use one mode of parallelism in a
given machine (like SIMD, MIMD, or Single instruction, multiple data (SIMD): A mode in which all processors
vector processing) and thus cannot ad- execute the same instruction synchronously on data stored in their local
equately meet the requirements of ap- memory.
plications that require more than one
MasPar MP-2
User workstations
Cray Y-MP
Connection Machine CM-5
Massively Parallel Processor (MPP) Image-UnderstandingArchitecture (IUA)
I
Figure 1. An example heterogeneous computing environment.
June 1993 19
3. of heterogeneous machines (so that each
Special portion of the code is executed on its
Vector MIMD SlMD ~urnose
matching machine type) is likely t o
/25 2 5 / y T o t a l time = 100 units
achieve speedups. Figure 2 illustrates a
possible scenario (the numbers are exe-
cution times in terms of basic units).
Heterogeneous computing. Hetero-
geneity in computing systems is not an
entirely new concept. Several types of
tal time = 50 units
special-purpose processors have been
Communication time used to provide specific services for
improving system throughput. One of
the most common is I/O handling. At-
taching floating-point processors t o host
1 1 Total time = 4 units +
communication overhead computers is yet another heterogeneous
approach t o enhance system perfor-
Figure 2. Execution of example code using various systems. mance. In high-performance comput-
ers, the concept of heterogeneity mani-
fests itself at the instruction level in the
type of parallelism. As a result, any the code is executed rapidly, while oth- form of several types of functional units,
single type of machine often spends its er portions of the code still have rela- such as vector arithmetic pipelines and
time executing code for which it is poor- tively higher execution times. Similarly, fast scalar processors. However, cur-
ly suited. Moreover, many applications the same code when executed on a suite rent multiprocessor systems remain
need to process information at more
than one level concurrently, with differ-
ent types of parallelism at each level.
Image understanding, a Grand Chal-
lenges problem, is one such applica-
tion.'
At the lowest level of computer vi-
sion, image-processing operations are of machines in Algorithm design
applied t o the raw image. These compu- HC environments
tations have a massive SIMD-type par- t
allelism. In contrast, the participants in
t h e D A R P A Image-Understanding
I Partitioning and mapping I
Benchmark exercises' observed that
high-level image-understanding compu-
tations exhibit coarse-grained MIMD-
type characteristics. For such appli-
cations, users of a conventional multi-
processor system must either settle for
degraded performance on the existing
hardware or acquire more powerful (and
expensive) machines.
Each type of homogeneous system
suffers from inherent limitations. For
example, vector machines employ in-
terleaved memory with apipelined arith-
metic logic unit, leading t o performance
in high million floating-point operations
per second (Mflops). If the data distri-
bution of an application and the result-
ing computations cannot exploit these
features, the performance degrades se-
verely.
Consider an application code having
mixed types of embedded parallelism.
Assume that the code when executed
on a serial machine spends 100 units of I Proarammina environment I
time. When this code is executed on a
vector machine, the vector portion of Figure 3. User-directed approach.
20 COMPUTER
4. mostly homogeneous as far as the type complete redesign. Since H C comprises parallel code of the application is taken
of parallelism supported by them. Such several autonomous computers, overall as input. To run this code in an H C
systems have been traditionally classi- system fault tolerance and longevity are environment, users must profile the types
fied according to the number of instruc- likely to improve. of heterogeneous parallelism embed-
tion and data streams. ded in the code. For this purpose, code-
An H C environment must contain type profilers need to be designed. Fig-
the following components: Issues ures 3 and 4 illustrate these approaches.
However, both approaches need strate-
a set of heterogeneous machines, We consider two approaches to using gies for partitioning, mapping, schedul-
an intelligent high-speed network the H C paradigm. The first one analyz- ing, and synchronization. New tools and
connecting all machines, and es an application to explore embedded metrics for performance evaluation are
a (user-friendly) programming en- heterogeneous parallelism. Research- also required. Parallel programming en-
vironment. ers must devise new algorithms or mod- vironments are needed to orchestrate
ify existing ones to exploit the hetero- the effective use of the computing re-
H C lets a given system be adapted to a geneity present in the application. Based sources.
wide range of applications by augment- on these algorithms, users develop the
ing it with specific functional or perfor- code to be executed by the machines. Algorithm design. Heterogeneous
mance capabilities without requiring a In the second approach, an existing computing opens new opportunities for
developing parallel algorithms. In this
section, we identify the efforts needed
to devise suitable algorithms. The fol-
lowing issues must be considered by the
designer:
(1) the types of machines available
I Code analysis I and their inherent computingchar-
acteristics,
(2) alternate solutions t o various
Vector
J-
MIMD SIMD
J-
SP
subproblems of the application,
and
(3) the costs of performing the com-
munication over the network.
c Computations in H C can be classified
into two
I I Metacomputing. Computations in
this class fall into the category of coarse-
grained heterogeneity. Instructions be-
longing to a particular class of parallel-
ism are grouped to form a module; each
module is then executed on a suitable
parallel machine. Metacomputing re-
fers to heterogeneity at the module lev-
el.
Mixed-modecomputing. In this fine-
grained heterogeneity, almost every al-
ternate parallel instruction belongs to a
different class of parallel computation.
Programs exhibiting this type of heter-
ogeneity are not suitable for execution
on a suite of heterogeneous machines
because the communication overhead
due to frequent exchange of informa-
tion between machines can become a
bottleneck. However, these programs
can be executed efficiently on a single
m machine such as PASM (Partitionable
SIMD/MIMD) which incorporates het-
I Programming environment I erogeneous modes of computation.
Mixed-mode computing refers to heter-
Figure 4. Compiler-directed approach. ogeneity at the instruction level.
June 1993 21
5. Mixed-mode machines can achieve show that SIMD machines are well suit- common goal of the mapping process is
large speedups for fine-grained hetero- ed for operations such as matrix compu- to accomplish these assignments such
geneity by using the mixed-mode pro- tations and low-level image processing. that the overall runtime of the task is
cessing available in a single machine. A MIMD machines. on the other hand, minimized.
mixed-mode machine, for example. can are most efficient when an application Chen et a1,"proposed a heuristic map-
use its mode-switching capability to can be partitioned into a number of ping methodology based on the Clus-
support SIMDiMIMD parallelism and tasks that have limited intercommuni- ter-M mdoel, which facilitates the de-
hardware-barrier synchronization, thus cation. Note that analytical benchmark sign of portable software. Only one
improving its performance over a ma- results are used in partitioning and map- algorithm is required for a given appli-
chine operatingin SIMD or MIMD mode ping. cation, regardless of the underlying ar-
only. chitecture. Various types of parallelism
Partitioning and mapping. Problems present in the application are identi-
Code-type profiling. Fast parallel ex- that occur in these areas of a homoge- fied. In addition, all communication
ecution of the code in a heterogeneous neous parallel environment have been and computation requirements of the
computing environment requires iden- widely studied. The partitioning prob- application are preserved in an inter-
tifying and profiling the embedded par- lem can be divided into two subprob- mediate specification of the code. The
allelism. Traditional program profiling lems. Parallelism detection determines architecture of each machine in the en-
involves testing a program assumed to the parallelism present in a given pro- vironment is modeled in the system rep-
consist of several modules by executing gram. Clustering combines several op- resentation, which captures the inter-
it on suitable test data. The prqfiler erations into a program module and connections of the architecture. The four
monitors the execution of the program thus partitions the application into sev- components of this approach are
and gathers statistics, including the ex- eral modules. These two subproblems
ecution time of each program module. can be handled by the user, the compil- an intermediate model to provide
This information is then used t o modify er, or the machine at runtime. an architecture-independent algorithm
the modules to improve the overall ex- In HC, parallelism detection is not specification of the application,
ecution time. the only objective; code classification languages to support the specifica-
In HC. profiling is done not only t o based on the type of parallelism is also tion in the intermediate model (such
estimate the code's execution time on required. This is accomplished by code- languages should be machine-indepen-
a particular machine but also t o analyze type profiling, which also poses addi- dent and allow a certain amount of ab-
the code's type. This is achieved by tional constraints o n clustering. straction of the computations),
code-type profiling. As introduced by Mapping (allocating) program mod- a tool that lets users specify topolo-
Freund.' this code-specific function is ules to processors has been addressed gies of the machines employed in the
an off-line procedure: the statistics to by many researchers. Informally, in H C environment, and
be gathered include the types of paral- homogeneous environments, the map- amappingmodule tomatch theprob-
lelism of various modules in the code ping problem can be defined as assign- lem specification and the system repre-
and the estimated execution time of ing program modules to processors so sentation.
each module on the machines available that the total execution time (including
in the environment. Code types that can the communication costs) is minimized. Figure 5 illustrates this methodology.
be identified include vectorizable, Several other costs, such as the interfer-
SIMDiMIMD parallel, scalar, and spe- ence cost, have also been considered. In Machine selection. An interesting
cial purpose (such as fast Fourier trans- HC, however, other objectives, such as problem appears in the design of H C
form). matching the code type to the machine environments: How can one find the
type, result in additional constraints. If most appropriate suite of heterogeneous
Analytical benchmarking. This test such a mapping has to be performed at machines for a given collection of appli-
measures how well the available ma- runtime for load-balancingpurposes (or cation tasks subject to a given constraint.
chines perform on a given code type.- due to machine failure), the mapping such as cost a n d execution time?
While code-type profiling identifies the problem becomes more complex due to Freund' has proposed the Optimal Se-
type of code. analytical benchmarking the overhead associated with the code lection Theory (OST) t o choose an op-
ranks the available machinesin terms of and data-format conversions. Various timal configuration of machines for ex-
their efficiency in executing a given code approaches to optimal and approximate ecuting an application task on a
type. Thus. analytical benchmarking partitioning and mapping in H C have heterogeneous suite of computers with
techniques permit researchers to deter- been studied.X-l" the assumption that the number of ma-
mine the relative effectiveness of a giv- Mapping in H C can be performed chines available is unlimited. It is also
en parallel machine on various types of conceptually at two levels: system (or assumed that machines matching the
computation. macro) and machine (or micro). A t the given set of code types are available and
This benchmarking is also an off-line system-level mapping, each module is that the application code is decomposed
process and is more rigorous than previ- assigned to one or more machines in the into equal-sized modules.
ous benchmarking techniques, which system so that the parallelism embed- Wang et al.'s Augmented Optimal
simply looked at the overall result of ded in the module matches the machine Selection Theory (A0ST)l"incorporates
running an entire benchmark code on a type. Machine-level mapping assigns the performance of code segments on
processor. Some experimental results portions of the module to individual nonoptimal machine choices, assuming
obtained by analytical benchmarking processors in the machine. The most that the number of available machines
22 COMPUTER
6. Heterogeneousarchitecture
for each code type is limited. In this
approach, the program module most
suitable for one type of machine is as-
signed to another type of machine. In
the formulation of OST and AOST, it
has been assumed that the execution of
all program modules of a given applica-
tion code is totally ordered in time. In
reality, however, different execution
interdependencies can exist among pro-
gram modules. Also, parallelism can be
present inside a module, resulting in
further decomposition of program mod-
ules. Furthermore, the effect of differ-
ent mappings on different machines
available for a program module has not
been considered in the formulation of
these selection theories. Problem-specification tool
The Heterogeneous Optimal Selec-
tion Theory (H0ST)'extends AOST in
two ways. It incorporates the effect of
various mapping techniques available
on different machines for executing a
program module. Also, the dependen- Figure 5 Cluster-M-basedheuristic mapping methodology.
.
cies between the program modules are
specified as a directed graph. Note that
OST and AOST assume linear ordering tion to the dual of the above problem, such as FIFO, round-robin, shortest-
of program modules. In the formulation that is. finding a least expensive set of job-first, and shortest-remaining-time,
of HOST, an application code is as- machines to solve a given application can be employed at each level of sched-
sumed to consist of subtasks to be exe- subject to a maximal execution time uling.
cuted serially. Each subtask contains a constraint. This scheme is applicable to While all three levels of scheduling
collection of program modules. Each all of the above selection theories. The can reside in each machine in an HC
program module is further decomposed accuracy of the scheme, however, de- environment, a fourth level is needed to
into blocks of parallel instructions, called pends upon the method used to assign perform with scheduling at the system
code blocks. the program modules to the machines. level. This scheduler maintains a bal-
To find an optimal set of machines, Iqbal also shows that for applications in anced system-wide workload by moni-
we have to assign the program modules which the program modules communi- toring the progress of all program mod-
to the machines so that cate in a restrictive manner, one can ules. In addition, the scheduler needs to
find exact algorithms for selecting an know the different module types and
optimal set of machines. If, however, available machine types in the environ-
the program modules communicate in ment, since modules may have to be
is minimal. while an arbitrary fashion, the selection prob- reassigned when the system configura-
lem is NP-complete. tion changes or overload situations oc-
zc 5 c,,,
' cur. Communication bottlenecks and
Scheduling. In homogeneous environ- queueing delays incurred due to the
where P i s the time to execute program ments, a scheduler assigns each pro- heterogeneity of the hardware add con-
module i, C' is the cost of the machine gram module to a processor to achieve straints on the scheduler.
on which program module i is to be desired performance in terms of pro-
executed, and C,,, is an overall con- cessor utilization and throughput. De- Synchronization. This process pro-
straint on the cost of the machines. The signers usually employ three schedul- vides mechanisms to control execution
cost c and execution time 71 corre-
' ing levels. High-level scheduling, also sequencing and to supervise interpro-
sponding to the assignment under con- called job scheduling, selects a subset of cess cooperation. It refers to three dis-
sideration can be obtained by usingcode- all submitted jobs competing for the tinct but related problems:
type profiling and/or by analyzing the available resources. Intermediate-level
algorithms. scheduling responds to short-term fluc- synchronization between the send-
Iqbal" presented a selection scheme tuations in the system load by tempo- er and receiver of a message,
that finds an assignment of program rarily suspending and activating pro- .specification and control of the
modules to machines in H C so that the cesses t o achieve smooth system shared activities of cooperating pro-
total processing time is minimized, while operation. Low-level scheduling de- cesses, and
the total cost of machines employed in termines the next ready process to be serialization of concurrent accesses
the solution does not exceed an upper assigned to a processor for a certain to shared objects by multiple pro-
bound. The scheme can also find a solu- duration. Different scheduling policies, cesses.
June 1993 23
7. A variety of synchronization meth- the topology, reliability, speed, and length, a bandwidth on the order of 1
ods have been proposed in the past: bandwidth of the network, in addition gigabitlsecond is required t o match the
semaphores, conditional critical regions, t o the types and number of machines in computation and communication speeds.
monitors, and pass expressions, among the environment. However, reducing Even if higher bandwidth networks
others. In addition, some multiproces- synchronization overhead is important were available, three main sources of
sors include hardware synchronization t o achieving large speedups in HC. Due inefficiency would persist in current net-
primitives. In general, synchronization t o the possibility of several concurrent- works. First, application interfaces in-
can be implemented by using shared ly operating autonomous machines in cur excessive overhead due to context
variables or by message-passing. the environment, application-code per- switching and data copying between the
In heterogeneous computing, the syn- formance in H C is more sensitive t o user process and the machine’s operat-
chronization problem resembles that of synchronization overheads. Frequent ing system. Second, each machine must
distributed systems. I n both cases, a hand-shaking for synchronization may incur the overheadof executing the high-
global clock and shared memory are expend most of the available network level protocols that ensure reliable com-
absent. and (unpredictable) network bandwidth. munication between program modules.
delays and a variety of operating sys- Also, the networkinterface burdens the
tems and programming environments Interconnection requirements. Cur- machine with interrupt handling and
complicate the process. rent local area networks (LANs) are header processing for each packet. This
Several techniques used in distribut- not suitable for H C because higher band- suggests incorporating additional net-
ed systems are again useful for solving width and lower latency networks are work-interface hardware in each ma-
H C synchronization problems. Two needed. The bandwidth of commercial- chine.
approaches are available: centralized ly available LANs is limited to about 10 Nectar’* is an example of a network
(one machine is designated as a control megabits per second. On the other hand, backplane for heterogeneous multicom-
node) and distributed (decision-mak- in HC, assuming machines operating at puters. It consists of a high-speed fiber-
ing is distributed across the entire sys- 40 megahertz and 20 million instruc- optic network, large crossbar switches,
tem). The correct choice depends on tions per second with a 32-bit word and powerful network-interface proces-
sors. Protocol processing is off-loaded
to these interface processors. A net-
working standard called Hippi (ANSI
Some academic sites X3T9.3 High-Performance Parallel In-
terface)’? is being implemented for re-
A number of academic sites are developing HC environments and applica- alizing heterogeneous computing envi-
tions (this list is not exhaustive). ronments at various research sites. Hippi
is an open standard that defines the
Systems and architectures physical and logical link layers of a 100-
Mbytelsecond network.
Distributed High-speed Computing (DHSC) project at Pittsburgh Supercom-
In HC, hardware modules from vari-
puting Center, University of Pittsburgh
ous vendors share physical intercon-
Image-Understanding Architecture, University of Massachusetts at Amherst nections. Differing communication pro-
Mentat, University of Virginia tocols may make network-management
problems complex. The following gen-
Nectar-Based Heterogeneous System, Carnegie Mellon University eral approaches for dealing with net-
Northeast Parallel Architecture Center (NPAC), Syracuse University work heterogeneity have been discussed
Partitionable SIMD/MIMD (PASM), Purdue University
in the literature:
(1) treat the heterogeneous network
Institutes and departments as apartitionednetwork,witheach
Beckman Institute, University of Illinois at Urbana-Champaign partition employing a uniform set
of protocols;
Department of Biological Sciences, University of California at Los Angeles
(2) have a single “visible” network
Department of Computer Science, Kent State University management console; and
Department of Computer Science, University of California at San Diego (3) integrate the heterogeneousman-
agement functions at a single
Department of Computer and Information Sciences, New Jersey Institute of management console.
Technology
The I E E E Computer Society Techni-
Department of Electrical Engineering-Systems, University of Southern Cali-
cal Committee on Parallel Processing,
fornia
the Technical Committee on Mass Stor-
Department of Math and Computer Science, Emory University age, and several research sites are work-
Minnesota Supercomputer Center (MSC), University of Minnesota at Minne- ing together to define interface stan-
apolis dards.
Supercomputer Computations Institute (SCI), Florida State University
Programming environments. A par-
allel programming environment includes
24 COMPUTER
8. parallel languages, intelligent compil-
ers, parallel debuggers, syntax-directed
editors. configuration-management I I
tools, and other programming aids.
In homogeneous computing, intelli-
gent compilers detect parallelism in
sequential code and translate it into
parallel machine code. Parallel program-
ming languages have been developed to
support parallel programming, such as
MPL for MasPar machines, and Lisp
and C for the Connection Machine. In
addition, several parallel programming
environments and models have been
designed, such as Code, Faust, Sched-
ule, and Linda.
H C requires machine-independent
and portable parallel programming lan-
guages and tools. This requirement cre-
ates the need for designing cross-paral-
le1 compilers for all machines in the
environment, and parallel debuggers for
debugging cross-machine code. Several
programming models and environments 1
have been developed in the past for
heterogeneous computing.R.'J-16
The Parallel Virtual Machine (PVM)
I Programming environment
I
system.16 evolved over the past three Figure 6. An overview of the Parallel Virtual Machine system.
years, consists of software that provides
a virtual concurrent computing envi-
ronment on general-purpose networks work, presenting a virtual concurrent in the environment. The inherent con-
of heterogeneous machines. It is com- computing environment to users. currency in a distributed computing
posed of a set of user-interface primi- environment, the lack of total ordering
tives and supporting software that en- Performance evaluation.Performance of events on different machines, and the
able concurrent computing on a loosely tools are used to summarize the run- nondeterministic nature of the commu-
coupled network of high-performance time behavior of an application, includ- nication delays between the processes
machines. It can be implemented on a ing analyzingresource use and the cause make the problem of evaluating perfor-
hardware base consisting of different of any performance bottleneck. Depend- mance more complex.
architectures, including single-CPU sys- ing on its design, a performance tool can The impact of the code type must be
tems, vector machines, and multipro- describe program behaviors at many considered. Thus, performance metrics
cessors (see Figure 6). levels of detail. The two most common such as processor utilization, speedup.
Application programs view the PVM are the intraprocess and interprocess and efficiency are difficult to compute.
system as a general and flexible parallel levels. Intraprocess performance tools, Indeed, these metrics must be carefully
computing resource that s u p p o r t s such as the gprof facility on BSD Unix, defined to make a reasonable perfor-
shared memory, message-passing, and the H P sampler/3000, and the Mesa Spy, mance evaluation.
hybrid models of computation. A het- provide information about individual
erogeneous application can be decom- processes.
posed into several subtasks based on Performance tools for distributed Image understanding
the embedded types of computation computing systems concentrate on the
and then executed by using PVM sub- interactions between the processes. In- Intrinsic parallelism in image process-
routines on different matching ma- tegrated performance models that ob- ing and the variety of heuristics avail-
chines available on the network. The serve the status and the performance able for problems in image understand-
PVM primitives are provided in the events at all levels can be found in the ing make computer vision an ideal
form of libraries linked to application PIE (Programming and Instrumenta- vehicle for studying heterogeneous com-
programs written in imperative languag- tion Environment) project.17 puting. From a computational perspec-
es. They support process initiation and Designing performance-evaluation tive, vision processing is usually orga-
management, message-passing, syn- tools for distributed computing systems nized as follows:
chronization, and other housekeeping involves collecting, interpreting, and
facilities. evaluating performance information Early processing of the raw image
Support software provided by the from application programs, the operat- (often called low-level processing). At
PVM system executes on a set of user- ing system, the communication network, this level, the input is an image. The
specified computing elements on a net- and other hardware modules employed output image is approximately the same
June 1993 25
9. size. Convolutions are performed on understandinglrecognition and symbolic forms better than any single machine
each pixel in parallel. The data commu- processing employ complex data struc- considered. These results support the
nication among the pixels is local to tures. Many of the proposed algorithms suitability of a heterogeneous environ-
each pixel. for such problems are nondeterminis- ment for computer vision applications.
Interfacing between low-level and tic, and architectural requirements for
image-understanding problems (often these problems demand coarse-grained
H
termed intermediate-level processing). MIMD machines. Parallel machines such eterogeneous computing offers
The operations performed on each data as the Aspex ASP and Vista13 are well new challenges and opportu-
item can be nonlocal. The communica- suited for this class of problems. nities to several research com-
tion is also irregular as compared with munities. To support this paradigm, the
that of low-level processing. Another approach is to build machines following areas of research must be in-
Image understanding. By this we having multiple computational capabil- vestigated:
mean using the acquired data from the ities embedded in a single system. These
Designing tools to identify hetero-
above processing (for example, geomet- architectures consist of several levels.
geneous parallelism embedded in
ric features such as shape, orientation, Typically, the lower levels operate in
applications.
and moments) t o infer semantic at- SIMD mode and the higher levels oper-
Studying issues in high-speed net-
tributes of an image. Processing at this ate in MIMD mode. In the Image-Un-
working, including available tech-
level can be classified as knowledge and/ derstanding A r c h i t e ~ t u r e , ’ ~ lowest
the
nologies and specialized hardware
or symbolic processing. Search-based level has bit-serial processors, and the
for networking.
techniques are widely used at this level. intermediate level consists of digital sig-
Designing communication protocols
nal processors. The highest level con-
to reduce the cross-over overheads
As evident in the preliminary results sists of general-purpose microproces-
that occur when different machines
from the 1988 D A R P A Image-Under- sors operating in MIMD mode.
communicate in the same environ-
standingBenchmark,18each level in com-
ment.
puter vision exhibits a different type of An example vision task. We present
Developing standards for parallel
parallelism. Therefore, at each level a an example vision task and identify the
interfaces between various m a -
suitable type of parallel machine must different types of parallelism. We have
chines.
be employed. Corresponding to each of chosen the D A R P A Integrated Image-
Designing efficient partitioning and
the above classes of problems, a suit- Understanding Benchmark4 as an ex-
mapping strategies t o exploit heter-
able class of architecture was p r ~ p o s e d : ~ ample task. The overall task performed
ogeneous parallelism embedded in
by this benchmark is the recognition of
applications.
* S I M D machines. Machines in this an approximately specified two-and-a-
Designing user interfaces and user-
class are well suited for computations in half-dimensional “mobile” sculpture in
friendly programming environments
low-level and in some intermediate-lev- a cluttered environment, given images
to program diverse machines in the
el computer vision problems because of from intensity and range sensors.
same environment.
the regular dataflow and iconic opera- Steps in the benchmark can be identi-
Developing algorithms for applica-
tions in these two levels. For example, fied by the vision-task classifications.
tions with heterogeneous comput-
two-dimensional cellular arrays and First, low-level operations such as con-
ing requirements.
mesh-connected computers have been nected component labeling and corner
proposed for a large class of geometric extraction are performed. Then, group- Indeed, H C provides an opportunity
and graph-based problems in image pro- ing the corners (an intermediate-level to bring together research from various
cessing. Parallel machines such as the vision operation) results in the extrac- disciplines of computer science and en-
MasPar MP-series and the Connection tion of candidate rectangles. Finally, gineering to develop a feasible approach
Machine CM-2000 fall in this category. partial matching of the candidate rect- for applications in the Grand Challeng-
Pipelined parallel machines (like the angles is followed by confirmed match- es problem set. W
Carnegie Mellon University Warp ma- ing (a high-level vision task). The re-
chine) are also well suited for low- and sults obtained on several different
intermediate-level vision computations. parallel machines were reported at the Acknowledgments
Medium-grained M I M D machines. 1988 Image-Understanding Workshop.
Various intermediate- and high-level Details of the benchmark results can be We thank RichardFreundand Ashraf Iqbal
vision tasks are computationally inten- found in Weems et al.’* for many helpful discussions. This research
was partly supported by the National Sci-
sive with irregular dataflow. Moreover, As they describe, directly interpret- ence Foundation under Grant No. IRI-
the size of the input is smaller than the ing these results would be unfair, since 9145810.
input image size. Parallel systems hav- there were many undefined factors in
ing a set of powerful processors are the benchmark description. However,
suitable for performing computations the benchmark does give pointers to References
in intermediate- and high-level vision how different machines can be classi-
tasks. The Connection Machine CM-5, fied with respect t o their suitability for 1. R. Freund and D. Conwell, “Supercon-
Vistal2, Alliant FX-80, and Sequent performing operations at different lev- currency: A Form of Distributed Heter-
Symmetry 81 are some ekamples. els of vision. Overall, the simulation ogeneous Supercomputing,” Supercorn-
puting Review, Oct. 1990, pp. 47-50.
Coarse-grained M I M D machines. results show that the (heterogeneous)
High-level vision tasks such as image Image-Understanding Architecture per- 2. Newsletter of the IEEE Computer Soci-
26 COMPUTER
10. ety Technical Committee on Parallel Pro- 4. C. de Castro and S. Yalamanchili, “Parti- puter architecture. VLSI computations. and
cessing (TCPP). Vol. 1, No. I . Oct. 1992. tioning Signal Flow Graphs for Execu- computational aspects of image processing.
tion on Heterogeneous Signal Process- vision. robotics. and neural networks.
3. V.K. Prasanna Kumar. Parallel Algo- ing Architectures,” Proc. Workshop o n Prasanna received the BS degree in elec-
rithms and Architectures f o r Image Un- Heterogeneous Processing, IEEE CS tronics engineering from Bangalore Univer-
derstanding. Academic Press, Boston, Press. Los Alamitos, Calif., Order No. sity, the MS degree from the School of Auto-
1991. 2702. 1992. pp. 81-86. mation. Indian Institute of Science. and the
PhD in computer science from Pennsylvania
4. C. Weems et al.. “An Integrated Image- 5. J. Potter. “Heterogeneous Associative State University in 1983. He serves as the
Understanding Benchmark: Recognition Computing,” Proc. Workshop on Heter- symposium chair of the 1994 IEEE Inter-
of a 2-112D Mobile,” Proc. D A R P A Im- ogeneous Processing, IEEE CS Press, Los national Parallel Processing Symposium and
ageunderstanding Workshop, Morgan Alamitos, Calif., Order No. 3532-02,1993. is a subject area editor of the Journal of
Kaufmann Publishers. San Mateo, Calif.. Parallel and Distributed Computing, I E E E
1988. pp. 111-126. 6. V. Sunderam, “PVM: A Framework for Transactions on Computers, and I E E E Trans-
Parallel Distributed Computing,” Con- actions o n Signal Processing. He is the found-
5. T. Berg and H.J. Siegel, “Instruction Ex- currency: Practice and Experience, Vol. ing chair of the IEEE Computer Society
ecution Trade-offs for SIMD vs. MIMD 2, No. 4. Dec. 1990, pp. 315-339. Technical Committee on Parallel Processing
vs. Mixed-Mode Parallelism,’’ Proc. Int’l and is a senior member of the Computer
Parallel Processing Symp. ( I P P S ) ,IEEE 7. Z. Segall and L. Rudolph, “PIE: A Pro- Society
CS Press. Los Alamitos. Calif., Order gramming and Instrumentation Environ-
NO. 2167. 1991, pp. 301-308. ment for Parallel Processing,” I E E E S o f t -
ware. Vol. 2, No. 6, Nov. 1985, pp. 22-27.
6. A. Khokhar et al.. “Heterogeneous Su-
percomputing: Problems and Issues,” 18. C. Weems et al., “Preliminary Results Muhammad E. Shaaban
Proc. Workshop on Heterogeneous Pro- from the DARPA Integrated Image- is a PhD candidate in
cessing, IEEE CS Press, Los Alamitos. Understanding Benchmark,” Parallel A r - the Department of Elec-
Calif.. Order No. 2702. 1992. pp. 3-12. chitecturesandAlgorithms f o r Image Un- trical Engineering-Sys-
dersranding, V.K. Prasanna, ed.. Academ- terns, University of
7. R. Freund. “Optimal Selection Theory ic Press, Boston, 1991, pp. 399-499. Southern California. His
for Superconcurrency.” Proc. 89 Super- areas of research include
computing, IEEE CS Press, Los Alami- 19. D. Shu, J. Nash, and C. Weems, “A Mul- parallel optical inter-
tos, Calif., Order No. M2021 (microfiche), tiple-Level Heterogeneous Architecture connection networks,
1989. pp. 13-17. for Image Understanding,” Proc. Int’l parallel algorithms for
Con$ Pattern Recognition, IEEE CS nage processing, and heterogeneous com-
8. C . Agha and R. Panwar, “An Actor- Press, Los Alamitos, Calif., Vol2, Order
Based Framework for Heterogeneous puting.
No. 2063, 1990. Shaaban received the BS and MS degrees
Computing Systems.” Proc. Workshop in electrical engineering from the University
on Heterogeneous Processing, IEEE CS of Petroleum and Minerals, Dhahran, Saudi
Press, Los Alamitos, Calif., Order No. Arabia, in 1984 and 1986, respectively. He
2702,1992, pp. 35-42.
recently served as a session chair at the Inter-
national Parallel Processing Symposium.He
9. S. Chen et al., “A Selection Theory and is a student member of the Computer Soci-
Methodology for Heterogeneous Super- Ashfaq A . Khokhar is a
computing,” Proc. W o r k s h o p on Hetero- PhD candidate in the ety.
geneous Processing, IEEE CS Press, Los Department of Electri-
Alamitos, Calif., Order No. 3532-02, 1993. cal Engineering-Systems
at the University of
10. M. Wang et al., “Augmenting the Opti- Southern California, Los
mal Selection Theory for Superconcur- Angeles. His areas of re- Cho-Li Wang is a PhD
rency,” Proc. Workshop on Heteroge- search include parallel candidate in the Depart-
neous Processing, IEEE CS Press, Los architectures and scal- ment of Electrical Engi-
Alamitos, Calif., Order No. 2702, 1992, able algorithms, image neering-Systems, Uni-
pp. 13-22. understanding and parallel processing, VLSI versity of Southern
computations, interconnection networks, and California, Los Angeles.
11.M. Iqba1,“PartitioningProblemsforHet- heterogeneous computing. His areas of research
erogeneous Computer Systems,” tech. Khokhar received the BSc degree in elec- include computer archi-
report, Dept. of Electrical Engineering- trical engineering from the University of tectures and algorithms,
Systems, Univ. of Southern California, Engineering and Technology, Lahore, Paki- image understanding
Los Angeles, 1993. stan in 1985 and the MS degree in computer and parallel processing, image compression,
engineeringfrom Syracuse University in 1988. and heterogeneous computing.
12. E. Arnould et al., “The Design of Nectar: He is a student member of the Computer Wang received the BS degree in computer
A Network Backplane for Heterogeneous Society. science and information engineering from
Multicomputers,” Proc. Int’l Conf. A r - National Taiwan University, Taiwan, in 1985
chitectural Support f o r Programming and the MS degree in computer engineering
Languages and Operating Systems (AS- from the University of Southern California
P L O S I l l ) , IEEE CS Press, Los Alami- in 1990.
tos, Calif., Order No. M1936 (microfiche), Viktor K. Prasanna
1989, pp. 205-216. (V.K.Prasanna Kumar)
is an associate professor
13. ANSI X3T9.3, “High-Performance Par- in the Department of
allel Interface: Hippi-PH, Hippi-SC,Hip- Electrical Engmeering-
pi-FP, Hippi-LE, and Hippi-MI,”Work- Systems, University of Readers can contact Viktor K. Prasanna at
ing Draft Proposed American National Southern California, Los the School of Engineering, Department of
Standard for Information Systems,Amer- Angeles. His research Electrical Engineering-Systems, University
ican Nat’l Standards Inst., New York, interests include paral- of Southern California, University Park, Los
Jan.-Apr.. 1991. * ’ I le] computation, com- Angeles, CA 90089-2562.
June 1993 21