93 1

397 views

Published on

Published in: Economy & Finance
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
397
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

93 1

  1. 1. Heterogeneous Computing: Challenges and Opportunities Ashfaq A. Khokhar, Viktor K. Prasanna, Muhammad E. Shaaban, and Cho-Li Wang University of Southern California omogeneous computing, which uses one or more machines of the same type, has provided adequate performance for many applications in the past. Many of these applications had more than one type of embedded parallelism, such as single instruction, multiple data (SIMD) and multiple instruc- tion, multiple data (MIMD). Most of the current parallel machines are suited only for homogeneous computing. However, numerous applications that have more than one type of embedded parallelism are now being considered for parallel implementation. On the other hand, as the amount of homogeneous parallelism in applications decreases, homogeneous systems cannot offer the desired speedups. T o exploit the heterogeneity in computations, researchers are investigating a suite of heterogeneous architectures.Anytime you work with Heterogeneous computing (HC) is the well-orchestrated and coordinated effec- tive use of a suite of diverse high-performance machines (including parallel oranges and apples, machines) to provide superspeed processing for computationally demanding tasksyoull need a number of with diverse computing needs. An H C system includes heterogeneous machines, high-speed networks, interfaces, operating systems, communication protocols, schemes to organize and programming environments, all combining to produce a positive impact on ease of use and performance. Figure 1 shows an example H C environment.total performance. This Heterogeneous computing should be distinguished from network computing or article surveys the high-performance distributed computing, which have generally come to mean either clusters of workstations or ad hoc connectivity among computers using little challenges posed by more than opportunistic load-balancing. H C is a plausible, novel technique for heterogeneous solving computationally intensive problems that have several types of embedded parallelism. H C also helps to reduce design risks by incorporating proven technol- computing and ogy and existing designs instead of developing them from scratch. However, several issues and problems arise from employing this technique, which we discuss. discusses some In the past few years, several technical meetings have addressed many of these approaches to opening issues. There is also a growing interest in using this paradigm to solve Grand Challenges problems. Richard Freund has organized the Heterogeneous Process- up its opportunities. ing Workshops held each year at the I E E E International Parallel Processing18 0018-9162/93/0600-0018$03.00 Q 1993 IEEE COMPUTER
  2. 2. GlossarySymposiums.’ Another related yearlymeeting is the IEEE International Sym- Analytical benchmarking: A procedure to analyze the relative effectivenessposium on High-Performance Distrib- of machines on various computational types.uted Computing.’ Code-type profiling: A code-specific function to identify various types of par- allelism present in code and to estimate the execution times of each code type.Heterogeneous systems Cross-machine debuggers: Those available within the heterogeneous com- puting environment to help debug the application code that executes over multi- The quest for higher computational ple machines.power suitable for a wide range of ap- Cross-over overhead: That incurred in transferring data from one machineplications at a reasonable cost has ex- to another. It also includes data-format-conversion overhead between the twoposed several inherent limitations of machines.homogeneous systems. Replacing such Cross-parallel compiler: An intelligent compiler that can generate intermedi-systems with yet more powerful homo- ate code executable on different parallel machines.geneous systems is not feasible. More- Heterogeneous computing (HC): A well-orchestrated, coordinated effectiveover. this approach does not improve use of a suite of diverse high-performance machines (including parallel ma-the versatility of the system. H C offers chines) to provide fast processing for computationally demanding tasks thata novel cost-effective approach to these have diverse computing needs.problems; instead of replacing existingmultiprocessor systems at high cost, HC Metacomputations: Computations exhibiting coarse-grained heterogeneityproposes using existing systems in an in terms of embedded parallelism.integrated environment. Mixed-mode computations: Computations exhibiting fine-grained heteroge- neity in terms of embedded parallelism. Limitationsof homogeneous systems. Multiple instruction,multiple data (MIMD): A mode in which code stored inConventional homogeneous systems each processor’s local memory is executed independently.usually use one mode of parallelism in agiven machine (like SIMD, MIMD, or Single instruction, multiple data (SIMD): A mode in which all processorsvector processing) and thus cannot ad- execute the same instruction synchronously on data stored in their localequately meet the requirements of ap- memory.plications that require more than one MasPar MP-2 User workstations Cray Y-MP Connection Machine CM-5 Massively Parallel Processor (MPP) Image-UnderstandingArchitecture (IUA)IFigure 1. An example heterogeneous computing environment.June 1993 19
  3. 3. of heterogeneous machines (so that each Special portion of the code is executed on its Vector MIMD SlMD ~urnose matching machine type) is likely t o /25 2 5 / y T o t a l time = 100 units achieve speedups. Figure 2 illustrates a possible scenario (the numbers are exe- cution times in terms of basic units). Heterogeneous computing. Hetero- geneity in computing systems is not an entirely new concept. Several types of tal time = 50 units special-purpose processors have been Communication time used to provide specific services for improving system throughput. One of the most common is I/O handling. At- taching floating-point processors t o host 1 1 Total time = 4 units + communication overhead computers is yet another heterogeneous approach t o enhance system perfor-Figure 2. Execution of example code using various systems. mance. In high-performance comput- ers, the concept of heterogeneity mani- fests itself at the instruction level in thetype of parallelism. As a result, any the code is executed rapidly, while oth- form of several types of functional units,single type of machine often spends its er portions of the code still have rela- such as vector arithmetic pipelines andtime executing code for which it is poor- tively higher execution times. Similarly, fast scalar processors. However, cur-ly suited. Moreover, many applications the same code when executed on a suite rent multiprocessor systems remainneed to process information at morethan one level concurrently, with differ-ent types of parallelism at each level.Image understanding, a Grand Chal-lenges problem, is one such applica-tion. At the lowest level of computer vi-sion, image-processing operations are of machines in Algorithm designapplied t o the raw image. These compu- HC environmentstations have a massive SIMD-type par- tallelism. In contrast, the participants int h e D A R P A Image-Understanding I Partitioning and mapping IBenchmark exercises observed thathigh-level image-understanding compu-tations exhibit coarse-grained MIMD-type characteristics. For such appli-cations, users of a conventional multi-processor system must either settle fordegraded performance on the existinghardware or acquire more powerful (andexpensive) machines. Each type of homogeneous systemsuffers from inherent limitations. Forexample, vector machines employ in-terleaved memory with apipelined arith-metic logic unit, leading t o performancein high million floating-point operationsper second (Mflops). If the data distri-bution of an application and the result-ing computations cannot exploit thesefeatures, the performance degrades se-verely. Consider an application code havingmixed types of embedded parallelism.Assume that the code when executedon a serial machine spends 100 units of I Proarammina environment Itime. When this code is executed on avector machine, the vector portion of Figure 3. User-directed approach.20 COMPUTER
  4. 4. mostly homogeneous as far as the type complete redesign. Since H C comprises parallel code of the application is takenof parallelism supported by them. Such several autonomous computers, overall as input. To run this code in an H Csystems have been traditionally classi- system fault tolerance and longevity are environment, users must profile the typesfied according to the number of instruc- likely to improve. of heterogeneous parallelism embed-tion and data streams. ded in the code. For this purpose, code- An H C environment must contain type profilers need to be designed. Fig-the following components: Issues ures 3 and 4 illustrate these approaches. However, both approaches need strate- a set of heterogeneous machines, We consider two approaches to using gies for partitioning, mapping, schedul- an intelligent high-speed network the H C paradigm. The first one analyz- ing, and synchronization. New tools and connecting all machines, and es an application to explore embedded metrics for performance evaluation are a (user-friendly) programming en- heterogeneous parallelism. Research- also required. Parallel programming en- vironment. ers must devise new algorithms or mod- vironments are needed to orchestrate ify existing ones to exploit the hetero- the effective use of the computing re-H C lets a given system be adapted to a geneity present in the application. Based sources.wide range of applications by augment- on these algorithms, users develop theing it with specific functional or perfor- code to be executed by the machines. Algorithm design. Heterogeneousmance capabilities without requiring a In the second approach, an existing computing opens new opportunities for developing parallel algorithms. In this section, we identify the efforts needed to devise suitable algorithms. The fol- lowing issues must be considered by the designer: (1) the types of machines available I Code analysis I and their inherent computingchar- acteristics, (2) alternate solutions t o various Vector J- MIMD SIMD J- SP subproblems of the application, and (3) the costs of performing the com- munication over the network. c Computations in H C can be classified into two I I Metacomputing. Computations in this class fall into the category of coarse- grained heterogeneity. Instructions be- longing to a particular class of parallel- ism are grouped to form a module; each module is then executed on a suitable parallel machine. Metacomputing re- fers to heterogeneity at the module lev- el. Mixed-modecomputing. In this fine- grained heterogeneity, almost every al- ternate parallel instruction belongs to a different class of parallel computation. Programs exhibiting this type of heter- ogeneity are not suitable for execution on a suite of heterogeneous machines because the communication overhead due to frequent exchange of informa- tion between machines can become a bottleneck. However, these programs can be executed efficiently on a single m machine such as PASM (Partitionable SIMD/MIMD) which incorporates het- I Programming environment I erogeneous modes of computation. Mixed-mode computing refers to heter-Figure 4. Compiler-directed approach. ogeneity at the instruction level.June 1993 21
  5. 5. Mixed-mode machines can achieve show that SIMD machines are well suit- common goal of the mapping process islarge speedups for fine-grained hetero- ed for operations such as matrix compu- to accomplish these assignments suchgeneity by using the mixed-mode pro- tations and low-level image processing. that the overall runtime of the task iscessing available in a single machine. A MIMD machines. on the other hand, minimized.mixed-mode machine, for example. can are most efficient when an application Chen et a1,"proposed a heuristic map-use its mode-switching capability to can be partitioned into a number of ping methodology based on the Clus-support SIMDiMIMD parallelism and tasks that have limited intercommuni- ter-M mdoel, which facilitates the de-hardware-barrier synchronization, thus cation. Note that analytical benchmark sign of portable software. Only oneimproving its performance over a ma- results are used in partitioning and map- algorithm is required for a given appli-chine operatingin SIMD or MIMD mode ping. cation, regardless of the underlying ar-only. chitecture. Various types of parallelism Partitioning and mapping. Problems present in the application are identi- Code-type profiling. Fast parallel ex- that occur in these areas of a homoge- fied. In addition, all communicationecution of the code in a heterogeneous neous parallel environment have been and computation requirements of thecomputing environment requires iden- widely studied. The partitioning prob- application are preserved in an inter-tifying and profiling the embedded par- lem can be divided into two subprob- mediate specification of the code. Theallelism. Traditional program profiling lems. Parallelism detection determines architecture of each machine in the en-involves testing a program assumed to the parallelism present in a given pro- vironment is modeled in the system rep-consist of several modules by executing gram. Clustering combines several op- resentation, which captures the inter-it on suitable test data. The prqfiler erations into a program module and connections of the architecture. The fourmonitors the execution of the program thus partitions the application into sev- components of this approach areand gathers statistics, including the ex- eral modules. These two subproblemsecution time of each program module. can be handled by the user, the compil- an intermediate model to provideThis information is then used t o modify er, or the machine at runtime. an architecture-independent algorithmthe modules to improve the overall ex- In HC, parallelism detection is not specification of the application,ecution time. the only objective; code classification languages to support the specifica- In HC. profiling is done not only t o based on the type of parallelism is also tion in the intermediate model (suchestimate the codes execution time on required. This is accomplished by code- languages should be machine-indepen-a particular machine but also t o analyze type profiling, which also poses addi- dent and allow a certain amount of ab-the codes type. This is achieved by tional constraints o n clustering. straction of the computations),code-type profiling. As introduced by Mapping (allocating) program mod- a tool that lets users specify topolo-Freund. this code-specific function is ules to processors has been addressed gies of the machines employed in thean off-line procedure: the statistics to by many researchers. Informally, in H C environment, andbe gathered include the types of paral- homogeneous environments, the map- amappingmodule tomatch theprob-lelism of various modules in the code ping problem can be defined as assign- lem specification and the system repre-and the estimated execution time of ing program modules to processors so sentation.each module on the machines available that the total execution time (includingin the environment. Code types that can the communication costs) is minimized. Figure 5 illustrates this methodology.be identified include vectorizable, Several other costs, such as the interfer-SIMDiMIMD parallel, scalar, and spe- ence cost, have also been considered. In Machine selection. An interestingcial purpose (such as fast Fourier trans- HC, however, other objectives, such as problem appears in the design of H Cform). matching the code type to the machine environments: How can one find the type, result in additional constraints. If most appropriate suite of heterogeneous Analytical benchmarking. This test such a mapping has to be performed at machines for a given collection of appli-measures how well the available ma- runtime for load-balancingpurposes (or cation tasks subject to a given constraint.chines perform on a given code type.- due to machine failure), the mapping such as cost a n d execution time?While code-type profiling identifies the problem becomes more complex due to Freund has proposed the Optimal Se-type of code. analytical benchmarking the overhead associated with the code lection Theory (OST) t o choose an op-ranks the available machinesin terms of and data-format conversions. Various timal configuration of machines for ex-their efficiency in executing a given code approaches to optimal and approximate ecuting an application task on atype. Thus. analytical benchmarking partitioning and mapping in H C have heterogeneous suite of computers withtechniques permit researchers to deter- been studied.X-l" the assumption that the number of ma-mine the relative effectiveness of a giv- Mapping in H C can be performed chines available is unlimited. It is alsoen parallel machine on various types of conceptually at two levels: system (or assumed that machines matching thecomputation. macro) and machine (or micro). A t the given set of code types are available and This benchmarking is also an off-line system-level mapping, each module is that the application code is decomposedprocess and is more rigorous than previ- assigned to one or more machines in the into equal-sized modules.ous benchmarking techniques, which system so that the parallelism embed- Wang et al.s Augmented Optimalsimply looked at the overall result of ded in the module matches the machine Selection Theory (A0ST)l"incorporatesrunning an entire benchmark code on a type. Machine-level mapping assigns the performance of code segments onprocessor. Some experimental results portions of the module to individual nonoptimal machine choices, assumingobtained by analytical benchmarking processors in the machine. The most that the number of available machines22 COMPUTER
  6. 6. Heterogeneousarchitecturefor each code type is limited. In thisapproach, the program module mostsuitable for one type of machine is as-signed to another type of machine. Inthe formulation of OST and AOST, ithas been assumed that the execution ofall program modules of a given applica-tion code is totally ordered in time. Inreality, however, different executioninterdependencies can exist among pro-gram modules. Also, parallelism can bepresent inside a module, resulting infurther decomposition of program mod-ules. Furthermore, the effect of differ-ent mappings on different machinesavailable for a program module has notbeen considered in the formulation ofthese selection theories. Problem-specification tool The Heterogeneous Optimal Selec-tion Theory (H0ST)extends AOST intwo ways. It incorporates the effect ofvarious mapping techniques availableon different machines for executing aprogram module. Also, the dependen- Figure 5 Cluster-M-basedheuristic mapping methodology. .cies between the program modules arespecified as a directed graph. Note thatOST and AOST assume linear ordering tion to the dual of the above problem, such as FIFO, round-robin, shortest-of program modules. In the formulation that is. finding a least expensive set of job-first, and shortest-remaining-time,of HOST, an application code is as- machines to solve a given application can be employed at each level of sched-sumed to consist of subtasks to be exe- subject to a maximal execution time uling.cuted serially. Each subtask contains a constraint. This scheme is applicable to While all three levels of schedulingcollection of program modules. Each all of the above selection theories. The can reside in each machine in an HCprogram module is further decomposed accuracy of the scheme, however, de- environment, a fourth level is needed tointo blocks of parallel instructions, called pends upon the method used to assign perform with scheduling at the systemcode blocks. the program modules to the machines. level. This scheduler maintains a bal- To find an optimal set of machines, Iqbal also shows that for applications in anced system-wide workload by moni-we have to assign the program modules which the program modules communi- toring the progress of all program mod-to the machines so that cate in a restrictive manner, one can ules. In addition, the scheduler needs to find exact algorithms for selecting an know the different module types and optimal set of machines. If, however, available machine types in the environ- the program modules communicate in ment, since modules may have to beis minimal. while an arbitrary fashion, the selection prob- reassigned when the system configura- lem is NP-complete. tion changes or overload situations oc- zc 5 c,,, cur. Communication bottlenecks and Scheduling. In homogeneous environ- queueing delays incurred due to thewhere P i s the time to execute program ments, a scheduler assigns each pro- heterogeneity of the hardware add con-module i, C is the cost of the machine gram module to a processor to achieve straints on the scheduler.on which program module i is to be desired performance in terms of pro-executed, and C,,, is an overall con- cessor utilization and throughput. De- Synchronization. This process pro-straint on the cost of the machines. The signers usually employ three schedul- vides mechanisms to control executioncost c and execution time 71 corre- ing levels. High-level scheduling, also sequencing and to supervise interpro-sponding to the assignment under con- called job scheduling, selects a subset of cess cooperation. It refers to three dis-sideration can be obtained by usingcode- all submitted jobs competing for the tinct but related problems:type profiling and/or by analyzing the available resources. Intermediate-levelalgorithms. scheduling responds to short-term fluc- synchronization between the send- Iqbal" presented a selection scheme tuations in the system load by tempo- er and receiver of a message,that finds an assignment of program rarily suspending and activating pro- .specification and control of themodules to machines in H C so that the cesses t o achieve smooth system shared activities of cooperating pro-total processing time is minimized, while operation. Low-level scheduling de- cesses, andthe total cost of machines employed in termines the next ready process to be serialization of concurrent accessesthe solution does not exceed an upper assigned to a processor for a certain to shared objects by multiple pro-bound. The scheme can also find a solu- duration. Different scheduling policies, cesses.June 1993 23
  7. 7. A variety of synchronization meth- the topology, reliability, speed, and length, a bandwidth on the order of 1ods have been proposed in the past: bandwidth of the network, in addition gigabitlsecond is required t o match thesemaphores, conditional critical regions, t o the types and number of machines in computation and communication speeds.monitors, and pass expressions, among the environment. However, reducing Even if higher bandwidth networksothers. In addition, some multiproces- synchronization overhead is important were available, three main sources ofsors include hardware synchronization t o achieving large speedups in HC. Due inefficiency would persist in current net-primitives. In general, synchronization t o the possibility of several concurrent- works. First, application interfaces in-can be implemented by using shared ly operating autonomous machines in cur excessive overhead due to contextvariables or by message-passing. the environment, application-code per- switching and data copying between the In heterogeneous computing, the syn- formance in H C is more sensitive t o user process and the machine’s operat-chronization problem resembles that of synchronization overheads. Frequent ing system. Second, each machine mustdistributed systems. I n both cases, a hand-shaking for synchronization may incur the overheadof executing the high-global clock and shared memory are expend most of the available network level protocols that ensure reliable com-absent. and (unpredictable) network bandwidth. munication between program modules.delays and a variety of operating sys- Also, the networkinterface burdens thetems and programming environments Interconnection requirements. Cur- machine with interrupt handling andcomplicate the process. rent local area networks (LANs) are header processing for each packet. This Several techniques used in distribut- not suitable for H C because higher band- suggests incorporating additional net-ed systems are again useful for solving width and lower latency networks are work-interface hardware in each ma-H C synchronization problems. Two needed. The bandwidth of commercial- chine.approaches are available: centralized ly available LANs is limited to about 10 Nectar’* is an example of a network(one machine is designated as a control megabits per second. On the other hand, backplane for heterogeneous multicom-node) and distributed (decision-mak- in HC, assuming machines operating at puters. It consists of a high-speed fiber-ing is distributed across the entire sys- 40 megahertz and 20 million instruc- optic network, large crossbar switches,tem). The correct choice depends on tions per second with a 32-bit word and powerful network-interface proces- sors. Protocol processing is off-loaded to these interface processors. A net- working standard called Hippi (ANSI Some academic sites X3T9.3 High-Performance Parallel In- terface)’? is being implemented for re- A number of academic sites are developing HC environments and applica- alizing heterogeneous computing envi- tions (this list is not exhaustive). ronments at various research sites. Hippi is an open standard that defines the Systems and architectures physical and logical link layers of a 100- Mbytelsecond network. Distributed High-speed Computing (DHSC) project at Pittsburgh Supercom- In HC, hardware modules from vari- puting Center, University of Pittsburgh ous vendors share physical intercon- Image-Understanding Architecture, University of Massachusetts at Amherst nections. Differing communication pro- Mentat, University of Virginia tocols may make network-management problems complex. The following gen- Nectar-Based Heterogeneous System, Carnegie Mellon University eral approaches for dealing with net- Northeast Parallel Architecture Center (NPAC), Syracuse University work heterogeneity have been discussed Partitionable SIMD/MIMD (PASM), Purdue University in the literature: (1) treat the heterogeneous network Institutes and departments as apartitionednetwork,witheach Beckman Institute, University of Illinois at Urbana-Champaign partition employing a uniform set of protocols; Department of Biological Sciences, University of California at Los Angeles (2) have a single “visible” network Department of Computer Science, Kent State University management console; and Department of Computer Science, University of California at San Diego (3) integrate the heterogeneousman- agement functions at a single Department of Computer and Information Sciences, New Jersey Institute of management console. Technology The I E E E Computer Society Techni- Department of Electrical Engineering-Systems, University of Southern Cali- cal Committee on Parallel Processing, fornia the Technical Committee on Mass Stor- Department of Math and Computer Science, Emory University age, and several research sites are work- Minnesota Supercomputer Center (MSC), University of Minnesota at Minne- ing together to define interface stan- apolis dards. Supercomputer Computations Institute (SCI), Florida State University Programming environments. A par- allel programming environment includes24 COMPUTER
  8. 8. parallel languages, intelligent compil-ers, parallel debuggers, syntax-directededitors. configuration-management I Itools, and other programming aids. In homogeneous computing, intelli-gent compilers detect parallelism insequential code and translate it intoparallel machine code. Parallel program-ming languages have been developed tosupport parallel programming, such asMPL for MasPar machines, and Lispand C for the Connection Machine. Inaddition, several parallel programmingenvironments and models have beendesigned, such as Code, Faust, Sched-ule, and Linda. H C requires machine-independentand portable parallel programming lan-guages and tools. This requirement cre-ates the need for designing cross-paral-le1 compilers for all machines in theenvironment, and parallel debuggers fordebugging cross-machine code. Severalprogramming models and environments 1 have been developed in the past forheterogeneous computing.R.J-16 The Parallel Virtual Machine (PVM) I Programming environment Isystem.16 evolved over the past three Figure 6. An overview of the Parallel Virtual Machine system.years, consists of software that provides a virtual concurrent computing envi- ronment on general-purpose networks work, presenting a virtual concurrent in the environment. The inherent con- of heterogeneous machines. It is com- computing environment to users. currency in a distributed computing posed of a set of user-interface primi- environment, the lack of total ordering tives and supporting software that en- Performance evaluation.Performance of events on different machines, and the able concurrent computing on a loosely tools are used to summarize the run- nondeterministic nature of the commu- coupled network of high-performance time behavior of an application, includ- nication delays between the processes machines. It can be implemented on a ing analyzingresource use and the cause make the problem of evaluating perfor- hardware base consisting of different of any performance bottleneck. Depend- mance more complex. architectures, including single-CPU sys- ing on its design, a performance tool can The impact of the code type must be tems, vector machines, and multipro- describe program behaviors at many considered. Thus, performance metrics cessors (see Figure 6). levels of detail. The two most common such as processor utilization, speedup. Application programs view the PVM are the intraprocess and interprocess and efficiency are difficult to compute. system as a general and flexible parallel levels. Intraprocess performance tools, Indeed, these metrics must be carefully computing resource that s u p p o r t s such as the gprof facility on BSD Unix, defined to make a reasonable perfor- shared memory, message-passing, and the H P sampler/3000, and the Mesa Spy, mance evaluation. hybrid models of computation. A het- provide information about individual erogeneous application can be decom- processes. posed into several subtasks based on Performance tools for distributed Image understanding the embedded types of computation computing systems concentrate on the and then executed by using PVM sub- interactions between the processes. In- Intrinsic parallelism in image process- routines on different matching ma- tegrated performance models that ob- ing and the variety of heuristics avail- chines available on the network. The serve the status and the performance able for problems in image understand- PVM primitives are provided in the events at all levels can be found in the ing make computer vision an ideal form of libraries linked to application PIE (Programming and Instrumenta- vehicle for studying heterogeneous com- programs written in imperative languag- tion Environment) project.17 puting. From a computational perspec- es. They support process initiation and Designing performance-evaluation tive, vision processing is usually orga- management, message-passing, syn- tools for distributed computing systems nized as follows: chronization, and other housekeeping involves collecting, interpreting, and facilities. evaluating performance information Early processing of the raw image Support software provided by the from application programs, the operat- (often called low-level processing). At PVM system executes on a set of user- ing system, the communication network, this level, the input is an image. The specified computing elements on a net- and other hardware modules employed output image is approximately the sameJune 1993 25
  9. 9. size. Convolutions are performed on understandinglrecognition and symbolic forms better than any single machineeach pixel in parallel. The data commu- processing employ complex data struc- considered. These results support thenication among the pixels is local to tures. Many of the proposed algorithms suitability of a heterogeneous environ-each pixel. for such problems are nondeterminis- ment for computer vision applications. Interfacing between low-level and tic, and architectural requirements forimage-understanding problems (often these problems demand coarse-grained Htermed intermediate-level processing). MIMD machines. Parallel machines such eterogeneous computing offersThe operations performed on each data as the Aspex ASP and Vista13 are well new challenges and opportu-item can be nonlocal. The communica- suited for this class of problems. nities to several research com-tion is also irregular as compared with munities. To support this paradigm, thethat of low-level processing. Another approach is to build machines following areas of research must be in- Image understanding. By this we having multiple computational capabil- vestigated:mean using the acquired data from the ities embedded in a single system. These Designing tools to identify hetero-above processing (for example, geomet- architectures consist of several levels. geneous parallelism embedded inric features such as shape, orientation, Typically, the lower levels operate in applications.and moments) t o infer semantic at- SIMD mode and the higher levels oper- Studying issues in high-speed net-tributes of an image. Processing at this ate in MIMD mode. In the Image-Un- working, including available tech-level can be classified as knowledge and/ derstanding A r c h i t e ~ t u r e , ’ ~ lowest the nologies and specialized hardwareor symbolic processing. Search-based level has bit-serial processors, and the for networking.techniques are widely used at this level. intermediate level consists of digital sig- Designing communication protocols nal processors. The highest level con- to reduce the cross-over overheads As evident in the preliminary results sists of general-purpose microproces- that occur when different machinesfrom the 1988 D A R P A Image-Under- sors operating in MIMD mode. communicate in the same environ-standingBenchmark,18each level in com- ment.puter vision exhibits a different type of An example vision task. We present Developing standards for parallelparallelism. Therefore, at each level a an example vision task and identify the interfaces between various m a -suitable type of parallel machine must different types of parallelism. We have chines.be employed. Corresponding to each of chosen the D A R P A Integrated Image- Designing efficient partitioning andthe above classes of problems, a suit- Understanding Benchmark4 as an ex- mapping strategies t o exploit heter-able class of architecture was p r ~ p o s e d : ~ ample task. The overall task performed ogeneous parallelism embedded in by this benchmark is the recognition of applications. * S I M D machines. Machines in this an approximately specified two-and-a- Designing user interfaces and user-class are well suited for computations in half-dimensional “mobile” sculpture in friendly programming environmentslow-level and in some intermediate-lev- a cluttered environment, given images to program diverse machines in theel computer vision problems because of from intensity and range sensors. same environment.the regular dataflow and iconic opera- Steps in the benchmark can be identi- Developing algorithms for applica-tions in these two levels. For example, fied by the vision-task classifications. tions with heterogeneous comput-two-dimensional cellular arrays and First, low-level operations such as con- ing requirements.mesh-connected computers have been nected component labeling and cornerproposed for a large class of geometric extraction are performed. Then, group- Indeed, H C provides an opportunityand graph-based problems in image pro- ing the corners (an intermediate-level to bring together research from variouscessing. Parallel machines such as the vision operation) results in the extrac- disciplines of computer science and en-MasPar MP-series and the Connection tion of candidate rectangles. Finally, gineering to develop a feasible approachMachine CM-2000 fall in this category. partial matching of the candidate rect- for applications in the Grand Challeng-Pipelined parallel machines (like the angles is followed by confirmed match- es problem set. WCarnegie Mellon University Warp ma- ing (a high-level vision task). The re-chine) are also well suited for low- and sults obtained on several differentintermediate-level vision computations. parallel machines were reported at the Acknowledgments Medium-grained M I M D machines. 1988 Image-Understanding Workshop.Various intermediate- and high-level Details of the benchmark results can be We thank RichardFreundand Ashraf Iqbalvision tasks are computationally inten- found in Weems et al.’* for many helpful discussions. This research was partly supported by the National Sci-sive with irregular dataflow. Moreover, As they describe, directly interpret- ence Foundation under Grant No. IRI-the size of the input is smaller than the ing these results would be unfair, since 9145810.input image size. Parallel systems hav- there were many undefined factors ining a set of powerful processors are the benchmark description. However,suitable for performing computations the benchmark does give pointers to Referencesin intermediate- and high-level vision how different machines can be classi-tasks. The Connection Machine CM-5, fied with respect t o their suitability for 1. R. Freund and D. Conwell, “Supercon-Vistal2, Alliant FX-80, and Sequent performing operations at different lev- currency: A Form of Distributed Heter-Symmetry 81 are some ekamples. els of vision. Overall, the simulation ogeneous Supercomputing,” Supercorn- puting Review, Oct. 1990, pp. 47-50. Coarse-grained M I M D machines. results show that the (heterogeneous)High-level vision tasks such as image Image-Understanding Architecture per- 2. Newsletter of the IEEE Computer Soci-26 COMPUTER
  10. 10. ety Technical Committee on Parallel Pro- 4. C. de Castro and S. Yalamanchili, “Parti- puter architecture. VLSI computations. and cessing (TCPP). Vol. 1, No. I . Oct. 1992. tioning Signal Flow Graphs for Execu- computational aspects of image processing. tion on Heterogeneous Signal Process- vision. robotics. and neural networks.3. V.K. Prasanna Kumar. Parallel Algo- ing Architectures,” Proc. Workshop o n Prasanna received the BS degree in elec- rithms and Architectures f o r Image Un- Heterogeneous Processing, IEEE CS tronics engineering from Bangalore Univer- derstanding. Academic Press, Boston, Press. Los Alamitos, Calif., Order No. sity, the MS degree from the School of Auto- 1991. 2702. 1992. pp. 81-86. mation. Indian Institute of Science. and the PhD in computer science from Pennsylvania4. C. Weems et al.. “An Integrated Image- 5. J. Potter. “Heterogeneous Associative State University in 1983. He serves as the Understanding Benchmark: Recognition Computing,” Proc. Workshop on Heter- symposium chair of the 1994 IEEE Inter- of a 2-112D Mobile,” Proc. D A R P A Im- ogeneous Processing, IEEE CS Press, Los national Parallel Processing Symposium and ageunderstanding Workshop, Morgan Alamitos, Calif., Order No. 3532-02,1993. is a subject area editor of the Journal of Kaufmann Publishers. San Mateo, Calif.. Parallel and Distributed Computing, I E E E 1988. pp. 111-126. 6. V. Sunderam, “PVM: A Framework for Transactions on Computers, and I E E E Trans- Parallel Distributed Computing,” Con- actions o n Signal Processing. He is the found-5. T. Berg and H.J. Siegel, “Instruction Ex- currency: Practice and Experience, Vol. ing chair of the IEEE Computer Society ecution Trade-offs for SIMD vs. MIMD 2, No. 4. Dec. 1990, pp. 315-339. Technical Committee on Parallel Processing vs. Mixed-Mode Parallelism,’’ Proc. Int’l and is a senior member of the Computer Parallel Processing Symp. ( I P P S ) ,IEEE 7. Z. Segall and L. Rudolph, “PIE: A Pro- Society CS Press. Los Alamitos. Calif., Order gramming and Instrumentation Environ- NO. 2167. 1991, pp. 301-308. ment for Parallel Processing,” I E E E S o f t - ware. Vol. 2, No. 6, Nov. 1985, pp. 22-27.6. A. Khokhar et al.. “Heterogeneous Su- percomputing: Problems and Issues,” 18. C. Weems et al., “Preliminary Results Muhammad E. Shaaban Proc. Workshop on Heterogeneous Pro- from the DARPA Integrated Image- is a PhD candidate in cessing, IEEE CS Press, Los Alamitos. Understanding Benchmark,” Parallel A r - the Department of Elec- Calif.. Order No. 2702. 1992. pp. 3-12. chitecturesandAlgorithms f o r Image Un- trical Engineering-Sys- dersranding, V.K. Prasanna, ed.. Academ- terns, University of7. R. Freund. “Optimal Selection Theory ic Press, Boston, 1991, pp. 399-499. Southern California. His for Superconcurrency.” Proc. 89 Super- areas of research include computing, IEEE CS Press, Los Alami- 19. D. Shu, J. Nash, and C. Weems, “A Mul- parallel optical inter- tos, Calif., Order No. M2021 (microfiche), tiple-Level Heterogeneous Architecture connection networks, 1989. pp. 13-17. for Image Understanding,” Proc. Int’l parallel algorithms for Con$ Pattern Recognition, IEEE CS nage processing, and heterogeneous com-8. C . Agha and R. Panwar, “An Actor- Press, Los Alamitos, Calif., Vol2, Order Based Framework for Heterogeneous puting. No. 2063, 1990. Shaaban received the BS and MS degrees Computing Systems.” Proc. Workshop in electrical engineering from the University on Heterogeneous Processing, IEEE CS of Petroleum and Minerals, Dhahran, Saudi Press, Los Alamitos, Calif., Order No. Arabia, in 1984 and 1986, respectively. He 2702,1992, pp. 35-42. recently served as a session chair at the Inter- national Parallel Processing Symposium.He9. S. Chen et al., “A Selection Theory and is a student member of the Computer Soci- Methodology for Heterogeneous Super- Ashfaq A . Khokhar is a computing,” Proc. W o r k s h o p on Hetero- PhD candidate in the ety. geneous Processing, IEEE CS Press, Los Department of Electri- Alamitos, Calif., Order No. 3532-02, 1993. cal Engineering-Systems at the University of10. M. Wang et al., “Augmenting the Opti- Southern California, Los mal Selection Theory for Superconcur- Angeles. His areas of re- Cho-Li Wang is a PhD rency,” Proc. Workshop on Heteroge- search include parallel candidate in the Depart- neous Processing, IEEE CS Press, Los architectures and scal- ment of Electrical Engi- Alamitos, Calif., Order No. 2702, 1992, able algorithms, image neering-Systems, Uni- pp. 13-22. understanding and parallel processing, VLSI versity of Southern computations, interconnection networks, and California, Los Angeles.11.M. Iqba1,“PartitioningProblemsforHet- heterogeneous computing. His areas of research erogeneous Computer Systems,” tech. Khokhar received the BSc degree in elec- include computer archi- report, Dept. of Electrical Engineering- trical engineering from the University of tectures and algorithms, Systems, Univ. of Southern California, Engineering and Technology, Lahore, Paki- image understanding Los Angeles, 1993. stan in 1985 and the MS degree in computer and parallel processing, image compression, engineeringfrom Syracuse University in 1988. and heterogeneous computing.12. E. Arnould et al., “The Design of Nectar: He is a student member of the Computer Wang received the BS degree in computer A Network Backplane for Heterogeneous Society. science and information engineering from Multicomputers,” Proc. Int’l Conf. A r - National Taiwan University, Taiwan, in 1985 chitectural Support f o r Programming and the MS degree in computer engineering Languages and Operating Systems (AS- from the University of Southern California P L O S I l l ) , IEEE CS Press, Los Alami- in 1990. tos, Calif., Order No. M1936 (microfiche), Viktor K. Prasanna 1989, pp. 205-216. (V.K.Prasanna Kumar) is an associate professor13. ANSI X3T9.3, “High-Performance Par- in the Department of allel Interface: Hippi-PH, Hippi-SC,Hip- Electrical Engmeering- pi-FP, Hippi-LE, and Hippi-MI,”Work- Systems, University of Readers can contact Viktor K. Prasanna at ing Draft Proposed American National Southern California, Los the School of Engineering, Department of Standard for Information Systems,Amer- Angeles. His research Electrical Engineering-Systems, University ican Nat’l Standards Inst., New York, interests include paral- of Southern California, University Park, Los Jan.-Apr.. 1991. * ’ I le] computation, com- Angeles, CA 90089-2562.June 1993 21

×