M. S. Ramaiah School of Advanced Studies 1CSN2502 ACA PresentationAnshuman BiswalPT 2012 Batch, Reg. No.: CJB0412001M. Sc. (Engg.) in Computer Science and NetworkingModule Leader: Padma Priya Dharishini P.Module Name: Advanced Computer ArchitectureModule Code : CSN2502Array Processor
M. S. Ramaiah School of Advanced Studies 2MarkingHead Maximum ScoreTechnical Content 10Grasp and Understanding 10Delivery – Technical andGeneral Aspects10Handling Questions 10Total 40
M. S. Ramaiah School of Advanced Studies 3Presentation Outline• History of Array Processor• Array Processor• How Array Processor can help?• Array Processor classification• Array Processor architecture• Performance and scalability of array processor• Why use the array processor?• When to use and when not to use the array processor?
M. S. Ramaiah School of Advanced Studies 4History of Array processor• Array processor development began in early 1960’s at Westinghousein their Solomon project.• Solomon’s goal was to improve the math performance by using largenumber of math co-processors under the control of a single CPU.• The CPU fed a single common instruction to all of the arithmetic logicunits (ALUs), one per "cycle", but with a different data point for eachone to work on.• This allowed the Solomon machine to apply a single algorithm to alarge data set, fed in the form of an array.• In 1962, Westinghouse cancelled the project, but the effort wasrestarted at the University of Illinois as the ILLIAC IV.• In 1972 , it was finally delivered to the world and till 1990’s it formedthe basic design of the fastest machine.
M. S. Ramaiah School of Advanced Studies 5Array Processor• Array processor is a synchronous parallel computer with multiple ALUcalled processing elements ( PE) that can operate in parallel in lockstep fashion.• It is composed of N identical PE under the control of a single controlunit and a number of memory modules.• Array processors also frequently use a form of parallel computationcalled pipelining where an operation is divided into smaller steps andthe steps are performed simultaneously.• It can greatly improve performance on certain workloads mainly innumerical simulation.• These machines appeared in the 1970’s anddominated supercomputer design through the 1970s into the 90s,notably the various Cray platforms.• The rapid rise in the price-to-performance ratio ofconventional microprocessor designs led to the vector supercomputersdemise in the later 1990s.
M. S. Ramaiah School of Advanced Studies 6How array processor can help?• In general terms, CPUs are able to manipulate one or two pieces ofdata at a time. For instance, most CPUs have an instruction thatessentially says "add A to B and put the result in C". The data for A, Band C could be—in theory at least—encoded directly into theinstruction. However, in efficient implementation things are rarely thatsimple. The data is rarely sent in raw form, and is instead "pointed to"by passing in an address to a memory location that holds the data.Decoding this address and getting the data out of the memory takessome time.• In order to reduce the amount of time this takes, most modern CPUsuse a technique known as instruction pipelining in which theinstructions pass through several sub-units in turn.• Array processors take this concept one step further. Instead ofpipelining just the instructions, they also pipeline the data itself. Thisallows for significant savings in decoding time.
M. S. Ramaiah School of Advanced Studies 7How Array Processor can help?: An Example• Consider the simple task of adding two groups of 10 numberstogether. In a normal programming language you might have donesomething as• execute this loop 10 times• read the next instruction and decode it• fetch this number fetch that number• add them• put the result here• End loop• But to an array processor this tasks looks as• read instruction and decode it• fetch these 10 numbers• fetch those 10 numbers• add them• put the results here
M. S. Ramaiah School of Advanced Studies 8How Array Processor can help?• There are several savings inherent in this approach.(Based on theexample in previous slide)A. Only two address translations are neededB. Fetching and decoding the instruction is done only one timeinstead of ten timesC. The code itself is also smaller, which can lead to more efficientmemory use.D. It improve performance by avoiding stalls.
M. S. Ramaiah School of Advanced Studies 9Array Processor Classification• SIMD ( Single Instruction Multiple Data ): is an array processor that has asingle instruction multiple data organization. It manipulates vector instructions by means of multiple functional unit responding to acommon instruction. ILLIAC-IV, CM -2( Connection Machine ),MP-1(MasPar-1), BSP (Bulk SynchronousParallel )• Attached array processor: is an auxiliary processor attached to ageneral purpose computer. Its intent is to improve the performance of the host computer in specific numericcalculation tasks.
M. S. Ramaiah School of Advanced Studies 10Array Processor Architecture - SIMD• SIMD has two basic configuration– a. Array processors using RAM also known as ( Dedicatedmemory organization )• ILLIAC-IV, CM-2,MP-1– b. Associative processor using content accessible memory alsoknown as ( Global Memory Organization)• BSP
M. S. Ramaiah School of Advanced Studies 11SIMDArchitecture – Array Processor using RAMHostComputer•Here we have a Control Unitand multiple synchronizedPE.•The control unit controls allthe PE below it .•Control unit decode all theinstructions given to it anddecides where the decodedinstruction should beexecuted.•The vector instructions arebroadcasted to all the PE.This broad casting is to getspatial parallelism throughduplicate PE.•The scalar instructions areexecuted directly inside theCU.
M. S. Ramaiah School of Advanced Studies 12SIMDArchitecture – Array Processor using RAMControl Unit• A simple CPU• Can execute instructions w/o PE intervention• Coordinates all PE’s• 64 64b registers, D0-D63• 4 64b Accumulators A0-A3• Ops:– Integer ops– Shifts– Boolean– Loop control– Index MemoryD0D63A0A3A1A2ALUCU
M. S. Ramaiah School of Advanced Studies 13SIMDArchitecture – Array Processor using RAMProcessing Element• A PE consists of an ALU with working registers anda local memory PMEMi which is used to storedistributed data.• All PE do the same function synchronously under thesuper vision of CU in a lock-step fashion.• Before execution in a PE the vector instructionsshould be loaded into its PMEM .• Data can be added into the PEM from an externalsource or by the CU.• When executing a instruction all the PE doesnt haveto work ,only the enabled PE have to work. Forenabling and disabling a PE during the execution of ainstruction we can used several masking schemes.ASBRALUPEiXD012043PMEMiPEi-1PEi+1PEi-8PEi+8• A PE consists of the following:• 64 bit regs• A: Accumulator• B: 2nd operand for binary ops• R: Routing – Inter-PE Communication• S: Status Register• X: Index for PMEM 16bits• D: mode 8bits• Communication:– PMEM only from local PE– Amongst PE with R
M. S. Ramaiah School of Advanced Studies 14• IN: All communication between PE’s are done by the interconnectionnetwork. It does all the routing and manipulation function . Thisinterconnection network is under the control of CU.• Host Computer: The array processor is interfaced to the host controllerusing host computer. The host computer does the resourcemanagement and peripheral and I/O supervisions.SIMDArchitecture – Array Processor using RAMInterconnection Network and Host Computer
M. S. Ramaiah School of Advanced Studies 15SIMDArchitecture – Masking and data routingorganizationAi BiDi Ii RiXiPEiSiALUPEMiFor i=0,1,2…,N-1...To other PE’s viainterconnectednetworkTo CU
M. S. Ramaiah School of Advanced Studies 16• One PE is connected to another PE via its routing register R.• When one PE is communicating with the other PE ,it is the contents ofthe R register that is transferred.• All the inputs and output goes through this register , the inputs andoutputs are isolated by master-slave-flip-flops.• The D register is the address register and it stores the 8 bit address ofthe PE.• During a instruction cycle only the enabled PE will take the operandsend to them while the other PE will discard the operands send tothem. For an enabled PE the status register S =1 and for a masked PEstatus register S =0 .• A = accumulator, B= 2nd operand of binary operations,SIMDArchitecture – Masking and data routingorganization
M. S. Ramaiah School of Advanced Studies 17SIMDArchitecture – Associative processor usingcontent accessible memoryHostComputer• In this configuration PE does nothave private memory. Memoriesattached to PE are replaced byparallel memory modules shared toall PE via an alignment network.• Alignment network does pathswitching between PE and parallelmemory.• The PE to PE communication isalso via alignment network .• The alignment network iscontrolled by the CU.• The number of PE (N) and the number of memory modules (K)may not be equal , in fact they are chosen to be prime to eachother.• An alignment network should allow conflict free access of sharedmemories by as many PEs as possible.AlignmentNetwork
M. S. Ramaiah School of Advanced Studies 18• In this configuration the attached array processor has an input outputinterface to common processor and another interface with a localmemory.• The local memory connects to the main memory with the help of ahigh speed memory bus.Attached Array Processor
M. S. Ramaiah School of Advanced Studies 19Performance and Scalability of array processor
M. S. Ramaiah School of Advanced Studies 20• The principal reason for using the array processor is speed.• The design of most array processors optimizes its performance forrepetitive arithmetic operations , making it much faster at the vectorarithmetic than the host CPU. Since most array processors operateasynchronously from the host CPU, they constitute a co-processorwhich increases the capacity of the system.• The second advantage is that AP consists of its own local memory. Onsystems with limited physical memory, or address space, this can be animportant consideration.Why use the array processor?
M. S. Ramaiah School of Advanced Studies 21• The AP (array processor) is most efficient in doing repetitiveoperations such as doing FFT’s and multiplying large vectors. Itsefficiency degrades for non repetitive operations, or operationsrequiring a great number of decisions based on the results ofcomputations.• Since the AP’s have their own program and data memory, the APinstruction and data must be transferred to , and the results transferredfrom the AP. These I/O operations may cost more CPU time than theamount saved by using the array processor.• As a general rule , use of AP is most efficient than the CPU whenmultiple or complex (such as FFT) operations, which are highlyrepetitious, are going to be done on relatively large amount of data (thousands of words or more.). In other cases use of AP will not helpmuch and will keep other processes from using valuable resource.When to use and not to use the array processor?
M. S. Ramaiah School of Advanced Studies 22Conclusion• Though array processor can improve the performance but all problems can notbe attacked with this sort of solution. Instructions of array processor to processan array of data at a time necessarily adds complexity to the core CPU. Thatcomplexity typically makes other instructions run .The more complexinstructions also add to the complexity of the decoders, which might slowdown the decoding of the more common instructions such as normal adding.• So the array processors work best only when there are large amounts of data tobe worked on. For this reason, these sorts of CPUs were found primarilyin supercomputers, as the supercomputers themselves were, in general, foundin places such as weather prediction centres and physics labs, where hugeamounts of data are "crunched".• This architecture relies on the fact that the data sets are all acting on a singleinstruction. However if these data sets somewhat rely on each other then youcannot apply parallel processing. For example if data A has to be processedbefore data B then you cannot do both A and B simultaneously. Thisdependency is what makes parallel processing difficult to implement and it iswhy sequential machines are extremely common.
M. S. Ramaiah School of Advanced Studies 23References Array or Vector Processing [Online] Available From: http://www.teach-ict.com/as_as_computing/ocr/H447/F453/3_3_3/parallel_processors/miniweb/pg3.htm# (Accessed:05 December 2012) Hennessy J. and Patterson D. (2007) Computer Architecture: A QuantitativeApproach, 4th edition, Morgan Kaufmann. Martin,J.(November 2011) Array processors- SIMD computer organisations[Online] Available from :http://www.martinjacob.info/2011/11/17/array-processors-simd-computer-organizations/ (Accessed:05 December 2012) Schaum.(2009)Theory and Problems of Computer Architecture, Indian specialedition,McGraw-Hill Companies Inc.