Parallel processing Processing instructions in parallel requires three major tasks:2. checking dependencies between instructions to determine which instructions can be grouped together for parallel execution;3. assigning instructions to the functional units on the hardware;4. determining when instructions are initiated placed together into a single word.
Major categories VLIW – Very Long Instruction WordEPIC – Explicitly Parallel Instruction Computing
Superscalar Processors  Superscalar processors are designed to exploit more instruction-level parallelism in user programs. Only independent instructions can be executed in parallel without causing a wait state. The amount of instruction-level parallelism varies widely depending on the type of code being executed.
Pipelining in SuperscalarProcessors  In order to fully utilise a superscalar processor of degree m, m instructions must be executable in parallel. This situation may not be true in all clock cycles. In that case, some of the pipelines may be stalling in a wait state. In a superscalar processor, the simple operation latency should require only one cycle, as in the base scalar processor.
SuperscalarImplementation Simultaneously fetch multiple instructions Logic to determine true dependencies involving register values Mechanisms to communicate these values Mechanisms to initiate multiple instructions in parallel Resources for parallel execution of multiple instructions Mechanisms for committing process state in correct order
Some Architectures PowerPC 604 – six independent execution units: Branch execution unit Load/Store unit 3 Integer units Floating-point unit – in-order issue – register renaming Power PC 620 – provides in addition to the 604 out-of-order issue Pentium – three independent execution units: 2 Integer units Floating point unit – in-order issue
VLIW Very Long Instruction Word (VLIW) architectures are used for executing more than one basic instruction at a time. These processors contain multiple functional units, which fetch from the instruction cache a Very-Long Instruction Word containing several basic instructions, and dispatch the entire VLIW for parallel execution. These capabilities are exploited by compilers which generate code that has grouped together independent primitive instructions executable in parallel. VLIW has been described as a natural successor to RISC (Reduced Instruction Set Computing), because it moves complexity from the hardware to the compiler, allowing simpler, faster processors. VLIW eliminates the complicated instruction scheduling and parallel dispatch that occurs in most modern microprocessors.
WHY VLIW ?The key to higher performance in microprocessors for a broad range ofapplications is the ability to exploit fine-grain, instruction-levelparallelism.Some methods for exploiting fine-grain parallelism include: Pipelining Multiple processors Superscalar implementation Specifying multiple independent operations per instruction
Architecture Comparison: CISC, RISC & VLIWARCHITECTURE CISC RISC VLIWCHARACTERISTICINSTRUCTION SIZE Varies One size, usually 32 bits One sizeINSTRUCTION Field placement varies Regular, consistent Regular, consistentFORMAT placement of fields placement of FieldsINSTRUCTION Varies from simple to Almost always one Many simple,SEMANTICS complex ; possibly many simple operation independent dependent operations operations per instructionREGISTERS Few, sometimes special Many, general-purpose Many, general-purpose
Architecture Comparison: CISC, RISC & VLIWARCHITECTURE CISC RISC VLIWCHARACTERISTICMEMORY REFERENCES Bundled with operations Not bundled with Not bundled with in many different types operations, operations,i.e., of instructions i.e.,load/store load/store architecture architectureHARDWARE DESIGN Exploit micro coded Exploit ExploitFOCUS implementations implementations Implementations with one pipeline and & With multiple pipelines, no microcode no microcode & no complex dispatch logicPICTURES OF FIVETYPICAL INSTRUCTIONS
Advantages of VLIW VLIW processors rely on the compiler that generates the VLIW code toexplicitly specify parallelism. Relying on the compiler has advantages. VLIW architecture reduces hardware complexity. VLIW simply moves complexity from hardware into software.
What is ILP ? Instruction-level parallelism (ILP) is a measure of how many of the operations in a computer program can be performed simultaneously. A system is said to embody ILP (instruction-level parallelism) is multiple instructions runs on them at the same time. ILP can have a significant effect on performance which is critical to embedded systems. ILP provides an form of power saving by slowing the clock.
What we intend to do with ILP ?We use Micro-architectural techniques to exploit the ILP. The various techniques include : Instruction pipelining which depend on CPU caches. Register renaming which refers to a technique used to avoid unnecessary. serialization of program operations imposed by the reuse of registers by those operations. Speculative execution which reduce pipeline stalls due to control dependencies. Branch prediction which is used to keep the pipeline full. Superscalar execution in which multiple execution units are used to execute multiple instructions in parallel. Out of Order execution which reduces pipeline stall due to operand dependencies.
Algorithms forschedulingFew of the Instruction scheduling algorithms used are : List scheduling Trace scheduling Software pipelining (modulo scheduling)
List SchedulingList scheduling by steps :2. Construct a dependence graph of the basic block. (The edges are weighted with the latency of the instruction).3. Use the dependence graph to determine instructions that can execute; insert on a list, called the Readylist.4. Use the dependence graph and the Ready list to schedule an instruction that causes the smallest possible stall; update the Ready list. Repeat
Code RepresentationforList Scheduling a=b+c d=e - f 1 2 5 6 3 71. load R1, b2. load R2, c 4 83. add R2,R14. store a, R25. load R3, e6. load R4,f7. sub R3,R48. store d,R3
Code RepresentationforList Scheduling1. load R1, b 1. load R1, b 1 2 5 62. load R2, c 5.load R3, e3. add R2,R1 2. load R2, c 3 74. store a, R2 6.load R4, f5. load R3, e 3.add R2,R16. load R4,f 7.sub R3,R4 4 87. sub R3,R4 4.store a, R28. store d,R3 8. store d, R3 a=b+c d=e - fNow we have a schedule that requires no stalls and no NOPs.
Problem and Solution Register allocation conflict : use of same register creates anti-Dependencies that restrict scheduling Register allocation before scheduling–prevents good scheduling Scheduling before register allocation–spills destroy scheduling Solution : Schedule abstract assembly, Allocate registers, Schedule
Trace schedulingSteps involved in Trace Scheduling : Trace Selection– Find the most common trace of basic blocks. Trace Compaction–Combine the basic blocks in the trace and schedule them as one block–Create clean-up code if the execution goes off-trace Parallelism across IF branches vs. LOOP branches Can provide a speedup if static prediction is accurate
How Trace SchedulingworksLook for higher priority and trace the blocks as shown below.
How Trace SchedulingworksAfter tracing the priority blocks you schedule it first and restparallel to that .
How Trace Scheduling worksWe can see the blocks beentraced depending on the priority.
How Trace Schedulingworks• Creating large extended basic blocks by duplication• Schedule the larger blocksFigure above shows how the extended basic blocks can becreated.
How Trace Scheduling worksThis block diagram in its final stage shows you the parallelism across thebranches.
Limitations of Trace Scheduling Optimizations depends on the traces being the dominant paths in the program’s control-flow. Therefore, the following two things should be true:–Programs should demonstrate the behavior of being skewed in the branches taken at run-time, for typical mixes of input data.–We should have access to this information at compile time. Not so easy.
Software Pipelining In software pipelining, iterations of a loop in the source program arecontinuously initiated at constant intervals, before the precedingiterations complete thus taking advantage of the parallelism in data path. Its also explained as scheduling the operations within an iteration,such that the iterations can be pipelined to yield optimal throughput. The sequence of instructions before the steady state are calledPROLOG and the ones that are in the sequence after the steady state iscalled EPILOG.
Software Pipelining Example•Source code:for(i=0;i<n;i++) sum += a[i] r7 = L r6 ---;stall•Loop body in assembly: r2 = Add r2,r7r1 = L r0---;stall r6 = add r6,12r2 = Addr2,r1r0 = addr0,4 r10 = L r9 ---;stall•Unroll loop & allocate registers r2 = Add r2,r10r1 = L r0---;stall r9 = add r9,12r2 = Add r2,r1r0 = Add r0,12r4 = L r3---;stallr2 = Add r2,r4r3 = add r3,12
Constraints in Softwarepipelining Recurrence Constraints: which is determined by loop carried data dependencies. Resource Constraints: which is determined by total resource requirements.
Remarks on SoftwarePipelining Innermost loop, loops with larger trip count, loops without conditionals can be software pipelined. Code size increase due to prolog and epilog. Code size increase due to unrolling for MVE (Modulo Variable Expansion). Register allocation strategies for software pipelined loops . Loops with conditional can be software pipelined if predicated execution is supported.–Higher resource requirement, but efficient schedule