INSTRUCTION LEVEL
PARALLALISM
PRESENTED BY KAMRAN ASHRAF
13-NTU-4009
INTRODUCTION
 Instruction-level parallelism (ILP) is a
measure of how many operations in a
computer program can be performed
"in-parallel" at the same time
WHAT IS A PARALLEL INSTRUCTION?
 Parallel instructions are a set of instructions that do not depend on each other
to be executed.
 Hierarchy
 Bit level Parallelism
• 16 bit add on 8 bit processor
 Instruction level Parallelism
 Loop level Parallelism
• for (i=1; i<=1000; i= i+1)
x[i] = x[i] + y[i];
 Thread level Parallelism
• multi-core computers
EXAMPLE
Consider the following program:
1. e = a + b
2. f = c + d
3. g = e * f
 Operation 3 depends on the results of "e" and "f" which are calculated from operations 1 and
2, so "g" cannot be calculated until both of "e" and "f" are computed.
 However, operations 1 and 2 do not depend on any other operation, so they can be
computed simultaneously.
 If we assume that each operation can be completed in one unit of time then these three
instructions can be completed in a total of two units of time, giving an ILP factor of 3/2;
which means 3/2 = 1.5 greater than without ILP.
WHY ILP?
 One of the goals of compilers and processors designers is to use as much ILP as
possible.
 Ordinary programs are written execute instructions in sequence; one after the other, in
the order as written by programmers.
 ILP allows the compiler and the processor to overlap the execution of multiple
instructions or even to change the order in which instructions are executed.
ILP TECHNIQUES
Micro-architectural techniques that use ILP include:
 Instruction pipelining
 Superscalar
 Out-of-order execution
 Register renaming
 Speculative execution
 Branch prediction
INSTRUCTION PIPELINE
 An instruction pipeline is a technique
used in the design of modern
microprocessors, microcontrollers and
CPUs to increase their instruction
throughput (the number of instructions
that can be executed in a unit of time).
PIPELINING
 The main idea is to divide the processing of a CPU instruction
into a series of independent steps of "microinstructions with
storage at the end of each step.
 This allows the CPUs control logic to handle instructions at the
processing rate of the slowest step, which is much faster than
the time needed to process the instruction as a single step.
EXAMPLE
 For example, the RISC pipeline is broken into five stages with a set of flip flops between
each stage as follow:
 Instruction fetch
 Instruction decode & register fetch
 Execute
 Memory access
 Register write back
 The vertical axis is successive instructions, the horizontal axis is time. So in the green
column, the earliest instruction is in WB stage, and the latest instruction is undergoing
instruction fetch.
SUPERSCALER
 A superscalar CPU architecture
implements ILP inside a single processor
which allows faster CPU throughput at the
same clock rate.
WHY SUPERSCALER
 A superscalar processor executes more than one instruction during a clock
cycle
 Simultaneously dispatches multiple instructions to multiple redundant
functional units built inside the processor.
 Each functional unit is not a separate CPU core but an execution resource
inside the CPU such as an arithmetic logic unit, floating point unit (FPU), a
bit shifter, or a multiplier.
EXAMPLE
 Simple superscalar pipeline. By fetching and dispatching two instructions at a time, a
maximum of two instructions per cycle can be completed.
OUT-OF-ORDER EXECUTION
 OoOE, is a technique used in most high-
performance microprocessors.
 The key concept is to allow the processor to
avoid a class of delays that occur when the data
needed to perform an operation are unavailable.
 Most modern CPU designs include support for out
of order execution.
STEPS
 Out-of-order processors breaks up the processing of instructions into these steps:
 Instruction fetch.
 Instruction dispatch to an instruction queue (also called instruction buffer)
 The instruction waits in the queue until its input operands are available.
 The instruction is issued to the appropriate functional unit and executed by that unit.
 The results are queued (Re-order Buffer).
 Only after all older instructions have their results written back to the register file, then this
result is written back to the register.
OTHER ILP TECHNIQUES
 Register renaming which is a technique used to avoid unnecessary serialization of
program operations caused by the reuse of registers by those operations, in order to
enable out-of-order execution.
 Speculative execution which allow the execution of complete instructions or parts of
instructions before being sure whether this execution is required.
 Branch prediction which is used to avoid delays cause of control dependencies to be
resolved. Branch prediction determines whether a conditional branch (jump) in the
instruction flow of a program is likely to be taken or not.
THANKS

INSTRUCTION LEVEL PARALLALISM

  • 1.
  • 2.
    INTRODUCTION  Instruction-level parallelism(ILP) is a measure of how many operations in a computer program can be performed "in-parallel" at the same time
  • 3.
    WHAT IS APARALLEL INSTRUCTION?  Parallel instructions are a set of instructions that do not depend on each other to be executed.  Hierarchy  Bit level Parallelism • 16 bit add on 8 bit processor  Instruction level Parallelism  Loop level Parallelism • for (i=1; i<=1000; i= i+1) x[i] = x[i] + y[i];  Thread level Parallelism • multi-core computers
  • 4.
    EXAMPLE Consider the followingprogram: 1. e = a + b 2. f = c + d 3. g = e * f  Operation 3 depends on the results of "e" and "f" which are calculated from operations 1 and 2, so "g" cannot be calculated until both of "e" and "f" are computed.  However, operations 1 and 2 do not depend on any other operation, so they can be computed simultaneously.  If we assume that each operation can be completed in one unit of time then these three instructions can be completed in a total of two units of time, giving an ILP factor of 3/2; which means 3/2 = 1.5 greater than without ILP.
  • 5.
    WHY ILP?  Oneof the goals of compilers and processors designers is to use as much ILP as possible.  Ordinary programs are written execute instructions in sequence; one after the other, in the order as written by programmers.  ILP allows the compiler and the processor to overlap the execution of multiple instructions or even to change the order in which instructions are executed.
  • 6.
    ILP TECHNIQUES Micro-architectural techniquesthat use ILP include:  Instruction pipelining  Superscalar  Out-of-order execution  Register renaming  Speculative execution  Branch prediction
  • 7.
    INSTRUCTION PIPELINE  Aninstruction pipeline is a technique used in the design of modern microprocessors, microcontrollers and CPUs to increase their instruction throughput (the number of instructions that can be executed in a unit of time).
  • 8.
    PIPELINING  The mainidea is to divide the processing of a CPU instruction into a series of independent steps of "microinstructions with storage at the end of each step.  This allows the CPUs control logic to handle instructions at the processing rate of the slowest step, which is much faster than the time needed to process the instruction as a single step.
  • 9.
    EXAMPLE  For example,the RISC pipeline is broken into five stages with a set of flip flops between each stage as follow:  Instruction fetch  Instruction decode & register fetch  Execute  Memory access  Register write back  The vertical axis is successive instructions, the horizontal axis is time. So in the green column, the earliest instruction is in WB stage, and the latest instruction is undergoing instruction fetch.
  • 10.
    SUPERSCALER  A superscalarCPU architecture implements ILP inside a single processor which allows faster CPU throughput at the same clock rate.
  • 11.
    WHY SUPERSCALER  Asuperscalar processor executes more than one instruction during a clock cycle  Simultaneously dispatches multiple instructions to multiple redundant functional units built inside the processor.  Each functional unit is not a separate CPU core but an execution resource inside the CPU such as an arithmetic logic unit, floating point unit (FPU), a bit shifter, or a multiplier.
  • 12.
    EXAMPLE  Simple superscalarpipeline. By fetching and dispatching two instructions at a time, a maximum of two instructions per cycle can be completed.
  • 13.
    OUT-OF-ORDER EXECUTION  OoOE,is a technique used in most high- performance microprocessors.  The key concept is to allow the processor to avoid a class of delays that occur when the data needed to perform an operation are unavailable.  Most modern CPU designs include support for out of order execution.
  • 14.
    STEPS  Out-of-order processorsbreaks up the processing of instructions into these steps:  Instruction fetch.  Instruction dispatch to an instruction queue (also called instruction buffer)  The instruction waits in the queue until its input operands are available.  The instruction is issued to the appropriate functional unit and executed by that unit.  The results are queued (Re-order Buffer).  Only after all older instructions have their results written back to the register file, then this result is written back to the register.
  • 15.
    OTHER ILP TECHNIQUES Register renaming which is a technique used to avoid unnecessary serialization of program operations caused by the reuse of registers by those operations, in order to enable out-of-order execution.  Speculative execution which allow the execution of complete instructions or parts of instructions before being sure whether this execution is required.  Branch prediction which is used to avoid delays cause of control dependencies to be resolved. Branch prediction determines whether a conditional branch (jump) in the instruction flow of a program is likely to be taken or not.
  • 16.