Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

5,899 views

Published on

No Downloads

Total views

5,899

On SlideShare

0

From Embeds

0

Number of Embeds

3

Shares

0

Downloads

278

Comments

0

Likes

5

No embeds

No notes for slide

- 1. MODULE - 5 INTRODUCTION TO PARALLEL PROCESSING
- 2. Contents:- <ul><li>Parallel Processing </li></ul><ul><li>Architectural Classification </li></ul><ul><li>Pipeline Computers </li></ul><ul><li>Arithmetic Pipeline </li></ul><ul><li>Instruction Pipeline </li></ul><ul><li>Array Processors </li></ul><ul><li>Vector Processing </li></ul><ul><li>Multiprocessors </li></ul><ul><li>Comparison of RISC & CISC. </li></ul>
- 3. 5.1. Parallel Processing <ul><li>Parallel processing is an efficient form of information processing which emphasizes the exploitation of concurrent events in the computing process. </li></ul><ul><li>Concurrency implies : </li></ul><ul><li>1 . Parallelism </li></ul><ul><li>2. Simultaneity </li></ul><ul><li>3. Pipelining . </li></ul><ul><ul><li>Parallel events may occur in multiple resources during the same time interval; </li></ul></ul><ul><ul><li>simultaneous events may occur at the same time instant; and </li></ul></ul><ul><ul><li>pipelined events may occur in overlapped time spans. </li></ul></ul><ul><li>Parallel processing demands concurrent execution of many programs in the computer. </li></ul>
- 4. <ul><li>The purpose of parallel processing is to speed up the computer processing capability and increase its throughput, </li></ul><ul><li>Through put means the amount of processing that can be accomplished during a given interval of time. </li></ul><ul><li>The amount of hardware increases with parallel processing, and so the cost of the system increases. </li></ul>Adv and Disadv
- 5. Processor registers Adder-Subtractor Integer multiply Logic unit Shift Unit Incrementer Floating-point multiply Floating-point add-subtract Floating-point divide To memory Processor with multiple functional units operating in parallel
- 6. <ul><li>The operands in the registers are applied to one of the units depending on the operation specified by the instruction associated with the operands. </li></ul><ul><li>The operation performed in each functional unit is indicated in each block of the diagram. </li></ul><ul><li>The adder and integer multiplier perform the arithmetic operations with integer numbers. </li></ul><ul><li>The floating point operations are separated into 3 circuits operating in parallel. </li></ul><ul><li>The logic, shift, and increment operations can be performed concurrently on different data. </li></ul><ul><li>All units are independent of each other, so one no: can be shifted while another number is being incremented. </li></ul>
- 7. 5.2 Architectural classification schemes <ul><li>Three computer architectural classification schemes: </li></ul><ul><ul><li>Flynn’s classification is based on the multiplicity of instruction stream and data stream in a computer system. </li></ul></ul><ul><ul><li>Feng’s scheme is based on serial versus parallel processing. </li></ul></ul><ul><ul><li>Handler’s classification is determined by the degree of parallelism and pipelining in various levels. </li></ul></ul>
- 8. Flynn’s Classification
- 9. <ul><li>The four categories: </li></ul><ul><li>Single instruction stream-single data stream (SISD) </li></ul><ul><li>Single instruction stream-multiple data stream (SIMD) </li></ul><ul><li>Multiple instruction stream-single data stream (MISD) </li></ul><ul><li>Multiple instruction stream-multiple data stream (MIMD) </li></ul>
- 10. 5.2.1. SISD computer organization <ul><li>This represents the organization of a single computer containing a CU, a Processor Unit, and a MU. </li></ul><ul><li>Instructions are executed sequentially and the system may or may not have internal parallel processing capabilities. </li></ul><ul><li>Parallel processing may be achieved by means of multiple functional units or by pipeline processing. </li></ul>CU PU MM IS IS DS
- 11. 5.2.2. SIMD computer organization <ul><li>This class corresponds to array processors. </li></ul><ul><li>There are multiple processing elements supervised by the same control unit. </li></ul><ul><li>All PEs receive the same instruction broadcast from the CU but operate on different data sets from distinct data streams. </li></ul><ul><li>The shared memory subsystem may contain multiple modules. </li></ul>
- 12. CU PU1 PU2 PUn Shared Memory MM1 MM2 MMn IS DS1 DS2 DSn
- 13. 5.2.3. MISD Computer organization <ul><li>There are n processor units, each receiving distinct instructions operating over the same data stream and its derivatives. </li></ul><ul><li>The result of one processor become the input of the next processor in the macro pipe . </li></ul><ul><li>MISD structure is only of theoretical interest since no practical system has been constructed using this organization. </li></ul>
- 14. CU1 CU2 CUn PU1 PU2 PUn MM1 MM2 MMm IS1 IS2 ISn IS1 IS2 ISn DS SM DS
- 15. 5.2.4. MIMD computer organization <ul><li>MIMD organization refers to a computer system capable of processing several programs at the same time. </li></ul><ul><li>Most multiprocessor systems and multiple computer systems can be classified into this category. </li></ul><ul><li>An intrinsic MIMD computer implies interactions among the n processors because all memory streams are derived from the same data space shared by all processors. </li></ul>
- 16. CU1 CU2 CUn PU1 PU2 PUn MM1 MM2 SM MMm DS1 DS2 DSn IS1 IS2 ISn IS1 IS2 ISn IS1 IS2 ISn
- 17. Parallel Computer Structures <ul><li>Parallel computers are those systems that emphasize parallel processing. Parallel computers are divided into three architectural configurations: </li></ul><ul><ul><ul><li>Pipeline computers </li></ul></ul></ul><ul><ul><ul><li>Array processors </li></ul></ul></ul><ul><ul><ul><li>Multiprocessor systems </li></ul></ul></ul><ul><ul><li>A pipeline computer performs overlapped computations to exploit temporal parallelism. </li></ul></ul><ul><ul><li>An array processor uses multiple synchronized arithmetic logic units to achieve spatial parallelism. </li></ul></ul><ul><ul><li>A multiprocessor system achieves asynchronous parallelism through a set of interactive processors with shared resources. </li></ul></ul><ul><li>The diff. bet. an array processor and a multiprocessor is that the processing elements in an array processor operate synchronously but in multiprocessor system it may operate asynchronously. </li></ul>
- 18. 5.3 Pipeline Computers <ul><li>Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates concurrently with all other segments. </li></ul><ul><li>Each segment performs partial processing dictated by the way the task is partitioned. </li></ul><ul><li>The result obtained from the computation in each segment is transferred to the next segment in the pipeline. </li></ul><ul><li>The final result is obtained after the data have passed through all segments. </li></ul><ul><li>The overlapping of computation is made possible by associating a register with each segment in the pipeline. </li></ul>
- 19. Basic Ideas <ul><li>Parallel processing </li></ul><ul><li>Pipelined processing </li></ul>a1 a2 a3 a4 b1 b2 b3 b4 c1 c2 c3 c4 d1 d2 d3 d4 a1 b1 c1 d1 a2 b2 c2 d2 a3 b3 c3 d3 a4 b4 c4 d4 P1 P2 P3 P4 P1 P2 P3 P4 time Colors: different types of operations performed a, b, c, d: different data streams processed Less inter-processor communication Complicated processor hardware time More inter-processor communication Simpler processor hardware
- 20. Data Dependence <ul><li>Parallel processing requires NO data dependence between processors </li></ul><ul><li>Pipelined processing will involve inter-processor communication </li></ul>P1 P2 P3 P4 P1 P2 P3 P4 time time
- 21. <ul><li>The process of executing an instruction involves 4 major steps: </li></ul><ul><li>In a nonpipelined computer, these 4 steps must be completed before the next instruction can be issued. </li></ul><ul><li>In a pipelined computer, successive instructions are executed in an overlapped fashion. </li></ul>IF ID OF EX S1 S2 S3 S4 ( stages) Pipelined processor
- 22. <ul><li>The general structure of a 4 segment pipeline: </li></ul><ul><li>The operands pass through all 4 segments in a fixed sequence. Each segment consists of a combinational circuit Si that performs a sub operation over the data stream flowing through the pipe. </li></ul><ul><li>The segments are separated by Ri that hold the intermediate results between the stages. </li></ul><ul><li>The behavior of a pipeline can be illustrated with a space time diagram. </li></ul>S1 R1 S2 R2 S3 R3 S4 R4 Input Clock
- 23. <ul><li>Suppose we want to perform </li></ul><ul><ul><li>Ai * Bi + Ci for I = 1,2,3,…,7 </li></ul></ul><ul><li>Each sub operation is to be implemented in a segment within a pipeline. </li></ul><ul><li>Each segment has one or two registers and a combinational circuit. </li></ul><ul><li>R1 through R5 are registers that receive new data with every clock pulse. </li></ul><ul><li>The sub operations performed in each segment of the pipeline are: </li></ul><ul><li>R1 Ai ; R2 Bi, </li></ul><ul><li>R3 R1 * R2 ; R4 Ci ; </li></ul><ul><li>R5 R3 + R4 </li></ul>Example: R1 R2 Multiplier R3 R4 Adder R5 Ai Bi Ci
- 24. <ul><li>The first clock pulse transfers A1 & B1 into R1 & R2. </li></ul><ul><li>The second clock pulse transfers the product of R1 & R2 into R3 and C1 into R4. The same clock pulse transfers A2 and B2 into R1 & R2. </li></ul><ul><li>The third clock pulse operates on all 3 segments simultaneously. It places A3 & B3 into R1 & R2, transfers R1 * R2 into R3, transfers C2 into R4, and places the sum of R3 & R4 into R5. </li></ul><ul><li>It takes 3 clock pulses to fill up the pipe and retrieve the first output from R5. From there on, each clock produces a new output and moves the data one step down the pipeline. </li></ul>
- 25. I1 I2 I3 I1 I2 I3 I1 I2 I3 I1 I2 I3 I4 EX OF ID IF a) Pipelined stages O/P I1 I2 I3 I4 I5 I1 I2 I3 I4 I5 I1 I2 I3 I4 I5 I1 I2 I3 I4 I5 EX OF ID IF 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 11 12 13 O/P O/P O/P b) NonPipelined stages
- 26. <ul><li>An instruction cycle consists of multiple pipeline cycles. </li></ul><ul><li>Pipeline cycle can be set equal to the delay of the slowest stage. </li></ul><ul><li>The flow of data from stage to stage is triggered by a common clock of the pipeline. </li></ul><ul><li>For the nonpipelined computer, it takes four pipeline cycles to complete one instruction. </li></ul><ul><li>Once a pipeline is filled up, an output result is produced from the pipeline on each cycle. </li></ul><ul><li>Instruction cycle is reduced to one-fourth of the original cycle time. </li></ul><ul><li>Due to overlapped instruction and arithmetic execution, the pipeline machines are better tuned to perform the same operations repeatedly. </li></ul>
- 27. <ul><li>To complete n tasks with clock time tp using a k-segment pipeline requires k+(n-1) clock cycles. </li></ul><ul><li>For eg:, consider 4 segments and 5 tasks. The time required to complete all the operations is 4 + (5-1) = 8 clock cycles. </li></ul><ul><li>Consider a nonpipeline unit that performs the same operation and takes a time tn to complete each task. The total time required for n tasks is ntn. </li></ul><ul><li>The speed up of a pipeline processing over an equivalent nonpipeline processing is defined as </li></ul><ul><li> S = ntn / ((k+n-1)tp) </li></ul>
- 28. 5.4 Arithmetic Pipeline <ul><li>An arithmetic pipeline divides an arithmetic operation into sub operations for execution in the pipeline segments. </li></ul><ul><li>They are used to implement floating point operations, multiplication of fixed-point numbers, and similar computations encountered in scientific problems. </li></ul>
- 29. Example <ul><li>The inputs to the floating-point adder pipeline are two normalized floating-point binary numbers: </li></ul><ul><ul><ul><li>X = A x 2 a </li></ul></ul></ul><ul><ul><ul><li>Y = B x 2 b </li></ul></ul></ul><ul><li>A & B are two fractions that represent the mantissas and a & b are the exponents. The floating point addition and subtraction can be performed in 4 segments. </li></ul><ul><li>The registers labeled R are placed between the segments to store intermediate results. </li></ul><ul><li>The sub operations that are performed in the 4 segments are: </li></ul><ul><ul><ul><li>Compare the exponents </li></ul></ul></ul><ul><ul><ul><li>Align the mantissas </li></ul></ul></ul><ul><ul><ul><li>Add or subtract the mantissas. </li></ul></ul></ul><ul><ul><ul><li>Normalize the result. </li></ul></ul></ul>
- 30. <ul><li>Exponents are compares by subtracting them to determine their difference. </li></ul><ul><li>The larger exponent is chosen as the exponent of the result. </li></ul><ul><li>The exponent difference determines how many times the mantissa associated with the smaller exponent must be shifted to the right. This produces an alignment of the 2 mantissas. </li></ul><ul><li>The 2 mantissas are added or subtracted in segment 3. The result is normalized in segment 4. </li></ul><ul><li>When an overflow occurs, the mantissa is shifted right and the exponent incremented by one </li></ul><ul><li>If an underflow occurs, the no: of leading zeros in the mantissa determines the no: of left shifts in the mantissa and the number that must be subtracted from the exponent </li></ul>
- 31. R R Compare Exponents By subtraction R Choose exponent R R Align Mantissas Normalize result R Add or Subtract mantissas R Adjust exponent R Seg 1: Seg 2: Seg 3: Seg 4: Difference Exponents Mantissas a b A B
- 32. <ul><li>Consider 2 normalized floating-point numbers: </li></ul><ul><ul><li>X = 0.9504 x 10 3 ; Y = 0.8200 x 10 2 </li></ul></ul><ul><li>The 2 exponents are subtracted in the segment 1 to obtain 3 -2 = 1. The larger exponent 3 is chosen as the exponent of the result. </li></ul><ul><li>The next segment shifts the mantissa of Y to the right to obtain </li></ul><ul><ul><li>X = 0.9504 x 10 3 ; Y = 0.0820 x 10 3 </li></ul></ul><ul><li>This aligns the 2 mantissa under the same exponent. </li></ul><ul><li>The addition of the 2 mantissas in segment 3 produces the sum </li></ul><ul><ul><li>Z = 1.0324 x 10 3 </li></ul></ul><ul><li>The sum is adjusted by normalizing the result so that it has a fraction with a nonzero first digit. This is done by shifting the mantissa once to the right and incrementing the exponent by one to obtain the normalized sum. </li></ul><ul><ul><li>Z = 0.10324 x 10 4 </li></ul></ul>
- 33. 5.5 Instruction Pipeline <ul><li>An instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their instruction throughput (the number of instructions that can be executed in a unit of time). </li></ul><ul><li>The fundamental idea is to split the processing of a computer instruction into a series of independent steps, with storage at the end of each step. </li></ul><ul><li>An instruction pipeline reads consecutive instructions from memory while previous instructions are being executed in other segments. </li></ul><ul><li>This causes the instruction fetch and execute phases to overlap and perform simultaneous operations. </li></ul>
- 34. <ul><li>In the most general case, the computer needs to process each instruction with the following sequence of steps: </li></ul><ul><ul><ul><li>Fetch the instruction from memory </li></ul></ul></ul><ul><ul><ul><li>Decode the instruction </li></ul></ul></ul><ul><ul><ul><li>Calculate the effective address </li></ul></ul></ul><ul><ul><ul><li>Fetch the operands from memory </li></ul></ul></ul><ul><ul><ul><li>Execute the instruction </li></ul></ul></ul><ul><ul><ul><li>Store the result in the proper place. </li></ul></ul></ul><ul><li>Two or more segments may require memory access at the same time, causing one segment to wait until another is finished with the memory. </li></ul><ul><li>Memory access conflicts can be resolved by using 2 memory buses for accessing data & instructions. </li></ul>
- 35. Example <ul><li>While an instruction is being executed in segment 4, the next instruction in sequence is busy fetching an operand from memory in segment 3. </li></ul><ul><li>The EA may be calculated in a separate arithmetic circuit for the 3 rd instruction, and whenever the memory is available, the 4 th and all subsequent instructions can be fetched and placed in an instruction FIFO. </li></ul><ul><li>Thus up to 4 sub operations in the instruction cycle can overlap and up to 4 different instructions can be in progress of being processed at the same time. </li></ul>
- 36. Fetch instruction from memory Decode instruction & calculate EA Branch? Fetch Operand from memory Execute instruction Interrupt? Empty pipe Update PC Interrupt Handling Yes Yes No No Seg 1: Seg 2: Seg 3: Seg 4: Four segment CPU pipeline
- 37. Timing of instruction pipeline INSTRUCTION BRANCH Steps 1 2 3 4 5 6 7 8 9 10 11 12 13 1 FI DA FO EX 2 FI DA FO EX 3 FI DA FO EX 4 FI - - FI DA FO EX 5 - - - FI DA FO EX 6 FI DA FO EX 7 FI DA FO EX
- 38. <ul><li>It is assumed that processor has separate instruction and data memories so that the operation in FI & FO can proceed at the same time. </li></ul><ul><li>In the absence of a branch instruction, each segment operates on different instructions. </li></ul><ul><li>Thus in step 4, instruction 1 is being executed in Seg 4; the operand for instruction 2 is fetched in Seg 3; instruction 3 is being decoded in Seg 2; and instruction 4 is being fetched from memory in Seg 2. </li></ul><ul><li>Assume that instruction 3 is a branch instruction. </li></ul><ul><ul><li>As soon as this is decoded in seg 2 in step 4, the transfer from FI to DA of other instructions is halted until the branch instruction is executed in step 6. </li></ul></ul><ul><ul><li>If the branch is not taken, a new instruction is fetched in step 7. </li></ul></ul><ul><ul><li>If the branch is not taken, the instruction fetched previously in step 4 can be used. </li></ul></ul><ul><ul><li>The pipeline then continues until a new branch instruction is encountered. </li></ul></ul>
- 39. Major Difficulties <ul><li>Resource conflicts </li></ul><ul><ul><li>Access to memory by two segments at the same time. </li></ul></ul><ul><ul><li>Soln: Separate memory for Instruction and Data </li></ul></ul><ul><li>Data dependency </li></ul><ul><ul><li>An instruction depends on the result of previous instruction </li></ul></ul><ul><li>Branch difficulty </li></ul>
- 40. Data Dependency <ul><li>It occurs when an instruction needs data that are not yet available </li></ul><ul><li>For eg:, an instruction in the FO segment may need to fetch an operand that is being generated at the same time by the previous instruction in segment EX. Therefore, the second instruction must wait for data to become available by the first instruction. </li></ul>
- 41. Solutions <ul><li>Hardware Interlocks </li></ul><ul><ul><li>Circuit that detects instructions whose source operands are destinations of instructions farther up in the pipeline </li></ul></ul><ul><ul><li>Detection of this situation causes the instruction whose source is not available to be delayed by enough clock cycles to resolve the conflict. </li></ul></ul><ul><li>Operand forwarding </li></ul><ul><ul><li>Special hardware to detect a conflict and then avoid by routing data through special paths between pipeline segments </li></ul></ul><ul><ul><li>Eg: ALU result forward into ALU input location </li></ul></ul><ul><li>Delayed Load </li></ul><ul><ul><li>Problem is solved in the compilation process itself </li></ul></ul><ul><ul><li>Delay the data loading by inserting NO-Operation instruction </li></ul></ul>
- 42. Handling Branch Instruction <ul><li>Pre-fetch target instruction – </li></ul><ul><ul><li>Pre-fetch the target instruction in addition to the instruction following the branch. </li></ul></ul><ul><ul><li>Both are saved until the branch is executed. </li></ul></ul><ul><ul><li>if branch condition is successful, pipeline continues from branch target instruction. </li></ul></ul><ul><li>Branch target buffer or BTB </li></ul><ul><ul><li>It is an associative memory included in the fetch segment of the pipeline. </li></ul></ul><ul><ul><li>Each entry consists of the address of a previously executed branch instruction and the target instruction for the branch. </li></ul></ul><ul><ul><li>It also stores the next few instructions after the branch target instruction. </li></ul></ul><ul><ul><li>Adv is branch instructions that have occurred previously are readily available in the pipeline without interruption </li></ul></ul>
- 43. <ul><li>Loop buffer </li></ul><ul><ul><li>Small very high speed register file maintained by the instruction fetch segment of pipeline. </li></ul></ul><ul><ul><li>When a loop is detected, it is stored in the loop buffer in its entirety, including all the branches. </li></ul></ul><ul><ul><li>The loop can be executed directly without having to access memory until the loop is removed </li></ul></ul>
- 44. 5.6 Vector Processing <ul><li>Applies to scientific and engineering data processing which require vast number of computations </li></ul><ul><li>Vector processing can apply to </li></ul><ul><ul><li>Long-range weather forecasting </li></ul></ul><ul><ul><li>Image processing </li></ul></ul><ul><ul><li>Medical Diagnostics </li></ul></ul><ul><ul><li>Flight simulations </li></ul></ul><ul><ul><li>AI & Expert Systems </li></ul></ul><ul><ul><li>Petroleum Explorations </li></ul></ul>
- 45. <ul><li>Vector Operations </li></ul><ul><ul><li>A Vector is an ordered set of a 1D array of data items </li></ul></ul><ul><ul><li>It is represented as row vector by V = [V1, V2, ….. Vn] of length ‘n’ </li></ul></ul><ul><ul><li>Operations can broken down into single computations with subscripted variables </li></ul></ul><ul><ul><li>The element Vi of vector V is written as V(I) and the index I refers to a memory address or register where the no. is stored. </li></ul></ul><ul><ul><li>Vector processing eliminates the overhead of fetch and execution of each step </li></ul></ul>
- 46. <ul><li>Vector processor allows operations to be specified with single vector instruction. </li></ul><ul><li>The vector instruction includes the initial address of the operands, the length of the vectors, and the operation to be performed, all in one composite instruction </li></ul><ul><li>Instruction Format: </li></ul><ul><li>This is a 3 address instruction with 3 fields specifying the base address of the operands and an additional field that gives the length of the data items in the vectors. This assumes that the vector operands reside in memory </li></ul>Operation Code Base Address Source 1 Base Address Source 2 Base Address Destination Vector Length
- 47. Matrix Multiplication <ul><li>Computational intensive operations performed in computers with vector processors. </li></ul><ul><li>Multiplication two n X n matrix consists n 2 inner products or n 3 multiply-add operations. </li></ul><ul><li>N x m matrix may be considered as constituting a set of n row vectors or a set of m column vectors </li></ul><ul><li>Each multiplication and addition can implement by using floating-point pipeline </li></ul>
- 48. <ul><li>Consider the product matrix C ( 3 x 3) as A x B whose elements are related to elements of A & B by the inner product: </li></ul><ul><li>Cij = ∑ a ik x b kj </li></ul><ul><li>For eg: for i=1 & j=1 then C 11 = a 11 b 11 + a 12 b 21 + a 13 b 31 </li></ul><ul><li>This requires 3 multiplications and 3 additions. </li></ul><ul><li>The total number of multiplications or additions required to compute the matrix product is 9 x 3= 27 </li></ul><ul><li>If we consider the linked multiply-add operation c + a x b as a cumulative operation, the product of two n x n matrices requires n 3 multiply-add operations. The computation consists of n 2 inner products, with each inner product requiring n multiply-add operations. </li></ul><ul><li>In general, the inner product consists of sum of k product terms of the form: </li></ul><ul><li>C = A 1 B 1 + A 2 B 2 + A 3 B 3 +……. + A k B k </li></ul>3 k=1
- 49. Pipeline for calculating an inner product <ul><li>Values of A & B are either in memory or in processor registers. </li></ul><ul><li>The floating point multiplier pipeline and the floating point adder pipeline are assumed to have 4 segments each. </li></ul><ul><li>All segment registers in the multiplier and adder are initialized to 0. </li></ul><ul><li>Therefore, the output of the adder is 0 for the first 8 cycles until both pipes are full. </li></ul><ul><li>Ai & Bi pairs are brought in and multiplied at a rate of one pair per cycle. </li></ul>Source A Source B Multiplier Pipeline Adder Pipeline
- 50. <ul><li>After the first 4 cycles, the products begin to be added to the output of the adder. </li></ul><ul><li>During the next 4 cycles 0 is added to the products entering the adder pipeline. </li></ul><ul><li>At the end of the 8 th cycle, the first 4 products A 1 B 1 through A 4 B 4 are in the 4 adder segments, and the next 4 products, A 5 B 5 through A 8 B 8 , are in the multiplier segments. </li></ul><ul><li>At the beginning of the 9 th cycle, the output of the adder is A 1 B 1 and of multiplier is A 5 B 5 and thus it starts addition A 1 B 1 + A 5 B 5 in the adder pipeline. </li></ul><ul><li>The 10th cycle starts the addition A 2 B 2 + A 6 B 6 and so on </li></ul>
- 51. <ul><li>ie </li></ul><ul><li>C = A 1 B 1 + A 5 B 5 + A 9 B 9 + A 13 B 13 + …… </li></ul><ul><li>+ A 2 B 2 + A 6 B 6 + A 10 B 10 + A 14 B 14 + …… </li></ul><ul><li>+ A 3 B 3 + A 7 B 7 + A 11 B 11 + A 15 B 15 + …… </li></ul><ul><li> + A 4 B 4 + A 8 B 8 + A 12 B 12 + A 16 B 16 + …… </li></ul><ul><li>When there are no more product terms to be added, system inserts 4 zeros into the multiplier pipeline. </li></ul>
- 52. <ul><li>Memory Interleaving </li></ul><ul><ul><li>Pipeline and vector processors require simultaneous access to memory from 2 or more sources </li></ul></ul><ul><ul><li>Instead of using 2 memory buses for simultaneous access, the memory can be partitioned into modules connected to a common address and data bus </li></ul></ul><ul><ul><li>Memory Module is a memory array with Address Register and Data Register </li></ul></ul><ul><ul><li>Multiple memory operations are possible in each memory module </li></ul></ul>
- 53. Memory Array Memory Array Memory Array Memory Array AR AR AR AR DR DR DR DR Address Bus Data Bus
- 54. <ul><li>Each memory has its own AR & DR. The AR receives information from a common address bus and DR communicates with a bidirectional data bus. </li></ul><ul><li>The two least significant bits of the address can be used to distinguish between the 4 modules. </li></ul><ul><li>In an interleaved memory, different sets of addresses are assigned to different memory modules. For eg:, in a 2 module memory system, the even addresses may be in one module and the odd addresses in the other. </li></ul><ul><li>When the number of modules is a power of 2, the LSB of the address select a memory module and the remaining bits designate the specific location to be accessed within the selected module. </li></ul><ul><li>A vector processor that uses an n-way interleaved memory can fetch n operands from n different modules. </li></ul>
- 55. <ul><li>Super computers </li></ul><ul><ul><li>A computer with Vector instructions and pipelined floating-point arithmetic operations </li></ul></ul><ul><ul><li>Internal components are packed tightly together to minimize the distance </li></ul></ul><ul><ul><li>Special techniques to remove heat from circuit </li></ul></ul><ul><ul><li>Performance is measured in terms of number of floating-point operations per second (FLOPS) </li></ul></ul><ul><ul><li>The first supercomputer is the CRAY-1 and it uses vector processing with 12 distinct functional units in parallel. </li></ul></ul>
- 56. 5.7 Array Processors <ul><li>Array processor is a synchronous parallel computer with multiple ALUs, called processing elements (PE), that can operate in a lock step fashion. </li></ul><ul><li>The PEs are synchronized to perform the same function at the same time. </li></ul><ul><li>Scalar and control type instructions are directly executed in the CU. </li></ul><ul><li>Each PE consists of an ALU with registers and a local memory. </li></ul>
- 57. SIMD Array Processor <ul><li>An SIMD array processor is a computer with multiple processing units operating in parallel. </li></ul><ul><li>The PUs are synchronized to perform the same operation under the control of a common CU, thus providing a single instruction stream, multiple data stream (SIMD) organization. </li></ul><ul><li>It contains a set of PEs each having a local memory M. Each PE includes an ALU, a floating point arithmetic unit, and working registers. </li></ul><ul><li>The main memory is used for storage of the program. </li></ul><ul><li>The CU controls the operations in the PEs. It decodes the instructions and determine how the instruction is to be executed. </li></ul><ul><li>Vector instructions are broadcast to all PEs simultaneously. Each PE uses operands stored in its local memory. Vector operands are distributed to the local memories prior to the parallel execution of the instruction. </li></ul>
- 58. Master control Unit Main Memory PE1 M1 PE2 M2 PE3 M3 M n PE n . . . . . .
- 59. <ul><li>Eg: C = A + B </li></ul><ul><ul><li>The control unit first stores the ith components ai & bi of A & B in local memory Mi for i = 1,2,3,...n. It then broadcasts the floating point add instruction ci = ai + bi to all PEs, causing the addition to take place simultaneously. </li></ul></ul><ul><ul><li>The components of Ci are stored in fixed locations in each local memory. This produces the desired vector sum in one add cycle. </li></ul></ul><ul><li>The best known SIMD array processor is the ILLIAC IV computer. </li></ul><ul><li>SIMD processors are highly specialized computers which are suited primarily for numerical problems that can be expressed in vector or matrix form. </li></ul>
- 60. 5.8. Multiprocessor systems <ul><li>The system contains two or more processors of appropriately comparable capabilities. </li></ul><ul><li>All processors share access to common sets of memory modules, I/O channels ,and peripheral devices. </li></ul><ul><li>The entire system must be controlled by a single integrated OS providing interactions between processors and their programs. </li></ul><ul><li>Each processor has its own local memory and I/O devices. </li></ul><ul><li>A multiprocessor system is an interconnection of 2 or more CPUs with memory and input-output devices. </li></ul><ul><li>Multiprocessors are MIMD systems. </li></ul>
- 61. <ul><li>If a fault causes one processor to fail, a second processor can be assigned to perform the functions of the disabled processor. </li></ul><ul><li>The system as a whole can continue to function correctly with perhaps some loss in efficiency. </li></ul><ul><li>the adv from a multiprocessor organization is an improved system performance and reliability </li></ul>
- 62. Architectural Aspects <ul><li>The architecture of multiprocessor system contains a number of CPU’s and a number of Input Output Processors with so many i/o devices and a memory unit connected and it needs a no of network policies to be considered to get the maximum performance with minimized complexity. </li></ul><ul><li>Three different interconnections : </li></ul><ul><ul><ul><li>Time-shared common bus </li></ul></ul></ul><ul><ul><ul><li>Crossbar switch network </li></ul></ul></ul><ul><ul><ul><li>Multiport memories. </li></ul></ul></ul><ul><li> </li></ul>
- 63. <ul><li>Time-shared common bus </li></ul><ul><li>A common bus multiprocessor system consists of a number of processors connected through a common path to a memory unit. </li></ul><ul><li>In a time shared common bus, only one processor can communicate with the memory or another processor at any given time. </li></ul>Memory Unit IOP 1 IOP 1 CPU 1 CPU 1 CPU 1 Common Bus
- 64. <ul><li>Transfer operations are conducted by the processor that is in control of the bus at the time. </li></ul><ul><li>Any other processor wishing to initiate a transfer must first determine the availability status of the bus, and only after the bus becomes available can the processor address the destination unit to initiate the transfer. </li></ul><ul><li>A command is issued to inform the destination unit what operation is to be performed. </li></ul><ul><li>The receiving unit recognizes its address in the bus and responds to the control signals from the sender, after which the transfer is initiated. </li></ul><ul><li>Single common bus system is restricted to one transfer at a time. </li></ul>
- 65. Multiport Memory <ul><li>A multiport memory system employs separate buses between each memory module and each CPU. </li></ul>CPU 1 CPU 3 CPU 4 MM 1 MM 2 MM 3 MM 4 CPU 2 CPU 1
- 66. <ul><li>Each processor bus is connected to each MM. </li></ul><ul><li>A processor bus consists of address, data and control lines required to communicate with memory. </li></ul><ul><li>The MM is said to have 4 ports and each port accommodates one of the buses. </li></ul><ul><li>The module must have internal control logic to determine which port will have access to memory at any given time. </li></ul><ul><li>Memory access conflicts are resolved by assigning fixed priorities to each memory port. </li></ul><ul><li>The priority for memory access associated with each processor may be established by the physical port position that its bus occupies in each module. </li></ul><ul><li>Thus CPU 1 will have priority over CPU2 and CPU 4 will have the lowest priority </li></ul>
- 67. <ul><li>The advantage is the high transfer rate that can be achieved because of the multiple paths between processors and memory. </li></ul><ul><li>The disadvantage is that it requires expensive memory control logic and a large no of cables and connectors. </li></ul>
- 68. Crossbar Switch Organization <ul><li>the organization consists of a number of crosspoints that are placed at intersections between processor buses and memory module paths. </li></ul><ul><li>The small square in each crosspoint is a switch that determines the path from a processor to a memory module </li></ul><ul><li>Each switch point has control logic to set up the transfer path between a processor and memory. </li></ul><ul><li>It examines the address that is placed in the bus to determine whether its particular module is being addressed. </li></ul><ul><li>It also resolves multiple requests for access to the same memory module on a predetermined priority basis. </li></ul>
- 69. CPU 1 CPU 2 CPU 3 MM 1 MM 2 MM 3 Crossbar Switch
- 70. <ul><li>the circuit consists of multiplexers that select the data, address and control from one CPU for communication with the memory module. </li></ul><ul><li>priority levels are established by the arbitration logic to select one CPU when two or more CPUs attempt to access the same memory. </li></ul><ul><li>A crossbar switch organization supports simultaneous transfers from all memory modules because there is a separate path associated with each module. </li></ul>
- 71. Comparison of RISC and CISC <ul><li>Multiplying Two Numbers in Memory </li></ul><ul><li>The main memory is divided into locations numbered from (row) 1: (column) 1 to (row) 6: (column) 4. </li></ul><ul><li>The execution unit is responsible for carrying out all computations. </li></ul><ul><li>However, the execution unit can only operate on data that has been loaded into one of the six registers (A, B, C, D, E, or F). </li></ul><ul><li>to find the product of two numbers - one stored in location 2:3 and another stored in location 5:2 - and then store the product back in the location 2:3. </li></ul>
- 72. The CISC (Complex Instruction Set Computers) Approach <ul><li>The primary goal of CISC architecture is to complete a task in as few lines of assembly as possible. </li></ul><ul><li>This is achieved by building processor hardware that is capable of understanding and executing a series of operations. </li></ul><ul><li>For a particular task, a CISC processor would come prepared with a specific instruction (say "MULT"). </li></ul><ul><li>When executed, this instruction loads the two values into separate registers, multiplies the operands in the execution unit, and then stores the product in the appropriate register. </li></ul><ul><li>Thus, the entire task of multiplying two numbers can be completed with one instruction: </li></ul><ul><li>MULT 2:3, 5:2 </li></ul>
- 73. <ul><li>MULT is what is known as a "complex instruction." </li></ul><ul><li>It operates directly on the computer's memory banks and does not require the programmer to explicitly call any loading or storing functions. </li></ul><ul><li>It closely resembles a command in a higher level language. </li></ul><ul><li>For instance, if we let "a" represent the value of 2:3 and "b" represent the value of 5:2, then this command is identical to the C statement "a = a * b." </li></ul>
- 74. <ul><li>One of the primary advantages of this system is that the compiler has to do very little work to translate a high-level language statement into assembly. </li></ul><ul><li>Because the length of the code is relatively short, very little RAM is required to store instructions. </li></ul><ul><li>The emphasis is put on building complex instructions directly into the hardware. </li></ul>
- 75. The RISC (reduced instruction set computer ) Approach <ul><li>RISC processors only use simple instructions that can be executed within one clock cycle. </li></ul><ul><li>Thus, the "MULT" command described above could be divided into three separate commands: "LOAD," which moves data from the memory bank to a register, "PROD," which finds the product of two operands located within the registers, and "STORE," which moves data from a register to the memory banks. </li></ul><ul><li>In order to perform the exact series of steps described in the CISC approach, a programmer would need to code four lines of assembly: </li></ul><ul><li>LOAD A, 2:3 LOAD B, 5:2 PROD A, B STORE 2:3, A </li></ul>
- 76. <ul><li>At first, this may seem like a much less efficient way of completing the operation. </li></ul><ul><ul><li>Because there are more lines of code, more RAM is needed to store the assembly level instructions. </li></ul></ul><ul><ul><li>The compiler must also perform more work to convert a high-level language statement into code of this form. </li></ul></ul><ul><li>However, the RISC strategy also brings some very important advantages. </li></ul><ul><ul><li>Because each instruction requires only one clock cycle to execute, the entire program will execute in approximately the same amount of time as the multi-cycle "MULT" command. </li></ul></ul><ul><ul><li>These RISC "reduced instructions" require less transistors of hardware space than the complex instructions, leaving more room for general purpose registers. </li></ul></ul><ul><ul><li>Because all of the instructions execute in a uniform amount of time (i.e. one clock), pipelining is possible. </li></ul></ul>
- 77. <ul><li>Separating the "LOAD" and "STORE" instructions actually reduces the amount of work that the computer must perform. </li></ul><ul><li>After a CISC-style "MULT" command is executed, the processor automatically erases the registers. If one of the operands needs to be used for another computation, the processor must re-load the data from the memory bank into a register. </li></ul><ul><li>In RISC, the operand will remain in the register until another value is loaded in its place. </li></ul>
- 78. <ul><li>The Performance Equation: </li></ul><ul><li>The CISC approach attempts to minimize the number of instructions per program, sacrificing the number of cycles per instruction. </li></ul><ul><li>RISC does the opposite, reducing the cycles per instruction at the cost of the number of instructions per program. </li></ul>
- 79. What is RISC? <ul><li>RISC, or Reduced Instruction Set Computer is a type of microprocessor architecture that utilizes a small, highly-optimized set of instructions , rather than a more specialized set of instructions often found in other types of architectures. </li></ul>
- 80. CISC versus RISC Multiple register sets Single register set Fixed format instructions variable format instructions Spends more transistors on memory registers Transistors used for storing complex instructions Low cycles per second, large code sizes Small code sizes, high cycles per second Register to register: "LOAD" and "STORE" are independent instructions Memory-to-memory: "LOAD" and "STORE" incorporated in instructions Single-clock, reduced instruction only Includes multi-clock complex instructions Emphasis on software Emphasis on hardware RISC CISC
- 81. RISC Pipelines <ul><li>A RISC processor pipeline operates in much the same way, although the stages in the pipeline are different. </li></ul><ul><li>The stages are: </li></ul><ul><ul><ul><li>fetch instructions from memory </li></ul></ul></ul><ul><ul><ul><li>read registers and decode the instruction </li></ul></ul></ul><ul><ul><ul><li>execute the instruction or calculate an address </li></ul></ul></ul><ul><ul><ul><li>access an operand in data memory </li></ul></ul></ul><ul><ul><ul><li>write the result into a register </li></ul></ul></ul>
- 82. <ul><li>Because RISC instructions are simpler than those used in CISC processors , they are more conducive to pipelining. </li></ul><ul><li>While CISC instructions varied in length, RISC instructions are all the same length and can be fetched in a single operation. </li></ul><ul><li>Ideally, each of the stages in a RISC processor pipeline should take 1 clock cycle so that the processor finishes an instruction each clock cycle and averages one cycle per instruction (CPI). </li></ul><ul><li>Pipeline Problems </li></ul><ul><li>In practice, however, RISC processors operate at more than one cycle per instruction. The processor might occasionally stall a a result of data dependencies and branch instructions. </li></ul>
- 83. <ul><li>Eg: 3-segment instruction pipeline: </li></ul><ul><li>The instruction cycle can be divided into 3 sub operations and implemented in 3 segments: </li></ul><ul><ul><ul><li>Instruction Fetch </li></ul></ul></ul><ul><ul><ul><li>ALU Operation </li></ul></ul></ul><ul><ul><ul><li>Execute instruction </li></ul></ul></ul><ul><li>Segment I fetches the instruction from program memory. </li></ul><ul><li>The instruction is decoded and an ALU operation is performed in the A segment. </li></ul><ul><li>Segment E directs the output of the ALU to one of 3 destinations (register file: result of ALU operation, Memory: transfers the EA for loading or storing, PC: transfers the branch address), depending on the decoded instruction </li></ul>
- 84. <ul><li>Delayed Load: </li></ul><ul><li>Consider </li></ul><ul><ul><ul><li>LOAD: R1<- M[address1] </li></ul></ul></ul><ul><ul><ul><li>LOAD: R2<- M[address2] </li></ul></ul></ul><ul><ul><ul><li>ADD: R3<- R1 + R2 </li></ul></ul></ul><ul><ul><ul><li>STORE : M[address3]<- R3 </li></ul></ul></ul><ul><li>If the 3 segment proceeds without interruptions, there will be a data conflict in instruction 3 because the operand in R2 is not yet available in A segment </li></ul>
- 85. Pipeline timing with delayed load Pipeline timing with data conflict E A I 4. Store R3 E A I 3. Add R1+R2 E A I 2; LOAD R2 E A I 1; LOAD R1 6 5 4 3 2 1 Clock Cycles A E 6 E I 5. Store R3 A I 4. Add R1+R2 E A I 3. No-operation E A I 2; LOAD R2 E A I 1; LOAD R1 7 5 4 3 2 1 Clock Cycles
- 86. <ul><li>E segment in clock cycle 4 is in a process of placing the memory data into R2. </li></ul><ul><li>A segment in clock cycle 4 is using the data from R2, but the value in R2 will not be the correct value since it has not yet been transferred from memory. </li></ul><ul><li>It is up to the compiler to make sure that the instruction following the load instruction uses the data fetched from the memory. </li></ul><ul><li>If the compiler cannot find a useful instruction to put after the load, it inserts a no-operation instruction and thus waiting a clock cycle. </li></ul><ul><li>This concept of delaying the use of the data loaded from memory is referred to as delayed load. </li></ul>

No public clipboards found for this slide

Be the first to comment