1
 Very long instruction word or VLIW refers to a processor
architecture designed to take advantage of instruction level
parallelism(ILP).
 Whereas conventional processors mostly only allow programs that
specify instructions to be executed one after another.
 A VLIW processor allows programs that can explicitly specify
instructions to be executed at the same time (i.e. in parallel).
 This type of processor architecture is intended to allow higher
performance without the inherent complexity of some other
approaches.
 The term VLIW, and the concept of VLIW architecture itself, were
invented by Josh Fisher in his research group at Yale University in
the early 1980s.
Very long instruction word
2
 Explicitly parallel instruction computing (EPIC) is a term coined in
1997 by the HP–Intel alliance to describe a computing paradigm that
researchers had been investigating since the early 1980s.
 This paradigm is also called Independence architectures. It was the
basis for Intel and HP development of the Intel Itanium architecture,
and HP later asserted that "EPIC" was merely an old term for the
Itanium architecture.
 EPIC permits microprocessors to execute software instructions in
parallel by using the compiler, rather than complex on-die circuitry,
to control parallel instruction execution.
 This was intended to allow simple performance scaling without
resorting to higher clock frequencies.
Explicitly parallel instruction computing
3
 VLIW (very long instruction word):
- Compiler packs a fixed number of instructions into a single VLIW
instruction.
- The instructions within a VLIW instruction are issued and executed in
parallel
Example: High-end signal processors (TMS320C6201)
 EPIC (explicit parallel instruction computing):
Evolution of VLIW
Example: Intel’s IA-64, exemplified by the Itanium processor
VLIW or EPIC
4
VLIW
 VLIW (very long instruction word)
processors use a long instruction word that
contains a usually fixed number of
operations that are fetched, decoded,
issued, and executed synchronously.
 All operations specified within a VLIW
instruction must be independent of one
another.
5
VLIW
 Some of the key issues of a (V)LIW processor:
• (very) long instruction word (up to 1 024 bits per
instruction),
• each instruction consists of multiple independent
parallel operations,
• each operation requires a statically known number
of cycles to complete,
• a central controller that issues a long instruction
word every cycle,
• multiple FUs connected through a global shared
register file.
6
VLIW
 sequential stream of long instruction words
 instructions scheduled statically by the
compiler
 number of simultaneously issued
instructions is fixed during compile-time
 instruction issue is less complicated than in
a superscalar processor
7
VLIW
 Disadvantage: VLIW processors cannot react on dynamic
events,
e.g. cache misses, with the same flexibility like superscalars.
 The number of instructions in a VLIW instruction word is
usually fixed.
 Padding VLIW instructions with no-ops is needed in case
the full issue bandwidth is not be met. This increases code
size. More recent VLIW architectures use a denser code
format which allows to remove the no-ops.
 VLIW is an architectural technique, whereas superscalar is
a microarchitecture technique.
 VLIW processors take advantage of spatial parallelism.
8
EPIC: a paradigm shift
 Superscalar RISC solution
• Based on sequential execution semantics
• Compiler’s role is limited by the instruction set architecture
• Superscalar hardware identifies and exploits parallelism
 EPIC solution – (the evolution of VLIW)
• Based on parallel execution semantics
• EPIC ISA enhancements support static parallelization
• Compiler takes greater responsibility for exploiting parallelism
• Compiler / hardware collaboration often resembles superscalar
9
EPIC: a paradigm shift
 Advantages of pursuing EPIC architectures
• Make wide issue & deep latency less expensive in
hardware
• Allow processor parallelism to scale with additional
VLSI density
 Architect the processor to do well with in-order execution
• Enhance the ISA to allow static parallelization
• Use compiler technology to parallelize program
• However, a purely static VLIW is not appropriate for
general-purpose use
10
The fusion of VLIW and superscalar techniques
 Superscalars need improved support for static parallelization
• Static scheduling
• Limited support for predicated execution
 VLIWs need improved support for dynamic parallelization
• Caches introduce dynamically changing memory latency
• Compatibility: issue width and latency may change with new
hardware
• Application requirements - e.g. object oriented programming with
dynamic binding
 EPIC processors exhibit features derived from both
• Interlock & out-of-order execution hardware are compatible with
EPIC (but not required!)
• EPIC processors can use dynamic translation to parallelize in
software
11
Many EPIC features are taken from VLIWs
Minisupercomputer products stimulated VLIW research (FPS,
Multiflow, Cydrome)
Minisupercomputers were specialized, costly, and short-lived
Traditional VLIWs not suited to general purpose computing
VLIW resurgence in single chip DSP & media processors
Minisupercomputers exaggerated forward-looking challenges:
Long latency
Wide issue
Large number of architected registers
Compile-time scheduling to exploit exotic amounts of parallelism
EPIC exploits many VLIW techniques
12
Shortcomings of early VLIWs
 Expensive multi-chip implementations
 No data cache
 Poor "scalar" performance
 No strategy for object code compatibility
13
EPIC design challenges
 Develop architectures applicable to general-purpose
computing
• Find substantial parallelism in “difficult to parallelize”
scalar programs
• Provide compatibility across hardware generations
• Support emerging applications (e.g. multimedia)
 Compiler must find or create sufficient ILP
 Combine the best attributes of VLIW & superscalar RISC
(incorporated best concepts from all available sources)
 Scale architectures for modern single-chip implementation
14
EPIC Processors, Intel's IA-64 ISA and Itanium
 Joint R&D project by Hewlett-Packard and Intel
(announced in June 1994)
 This resulted in explicitly parallel instruction
computing (EPIC) design style:
• specifying ILP explicit in the machine code, that is, the
parallelism is encoded directly into the instructions
similarly to VLIW;
• a fully predicated instruction set;
• an inherently scalable instruction set (i.e., the ability to
scale to a lot of FUs);
• many registers;
• speculative execution of load instructions
15
IA-64 Architecture
 Unique architecture features & enhancements
• Explicit parallelism and templates
• Predication, speculation, memory support, and
others
• Floating-point and multimedia architecture
 IA-64 resources available to applications
• Large, application visible register set
• Rotating registers, register stack, register stack
engine
 IA-32 & PA-RISC compatibility models
16
Today’s Architecture Challenges
 Performance barriers :
• Memory latency
• Branches
• Loop pipelining and call / return overhead
 Headroom constraints :
• Hardware-based instruction scheduling
- Unable to efficiently schedule parallel execution
• Resource constrained
- Too few registers
- Unable to fully utilize multiple execution units
 Scalability limitations :
• Memory addressing efficiency
17
Intel's IA-64 ISA
 Intel 64-bit Architecture (IA-64) register model:
• 128, 64-bit general purpose registers GR0-GR127
to hold values for integer and multimedia computations
- each register has one additional NaT (Not a Thing) bit to
indicate whether the value stored is valid,
• 128, 82-bit floating-point registers FR0-FR127
- registers f0 and f1 are read-only with values +0.0 and +1.0,
• 64, 1-bit predicate registers P0-PR63
- the first register p0 is read-only and always reads 1 (true)
• 8, 64-bit branch registers BR0-BR7 to specify the target
addresses of indirect branches
18
IA-64’s Large Register File
BR7
BR0
Branch
Registers
63 0
96 Stacked, Rotating
GR1
GR31
GR127
GR32
GR0
NaT 32 Static
0
Integer Registers
63 0
Predicate
Registers
1
PR1
PR63
PR0
PR15
PR16
48 Rotating
16 Static
bit 0
96 Rotating
GR1
GR31
GR127
GR32
GR0
32 Static
0.0
Floating-Point
Registers
81 0
19
Intel's IA-64 ISA
• IA-64 instructions are 41-bit (previously stated 40 bit) long and consist of
- op-code,
- predicate field (6 bits),
- two source register addresses (7 bits each),
- destination register address (7 bits), and
- special fields (includes integer and floating-point arithmetic).
• The 6-bit predicate field in each IA-64 instruction refers to a set of 64 predicate
registers.
• 6 types of instructions
- A: Integer ALU ==> I-unit or M-unit
- I: Non-ALU integer ==> I-unit
- M: Memory ==> M-unit
- B: Branch ==> B-unit
- F: Floating-point ==> F-unit
- L: Long Immediate ==> I-unit
• IA-64 instructions are packed by compiler into bundles.
20
IA-64 Bundles
 A bundle is a 128-bit long instruction word (LIW) containing three 41-bit IA-64
instructions along with a so-called 5-bit template that contains instruction
grouping information.
 IA-64 does not insert no-op instructions to fill slots in the bundles.
 The template explicitly indicates :
• first 4 bits: types of instructions
• last bit (stop bit): whether the bundle can be executed in parallel with the
next bundle
• (previous literature): whether the instructions in the bundle can be executed
in parallel or if one or more must be executed serially.
 Bundled instructions don't have to be in their original program order, and they
can even represent entirely different paths of a branch.
 Also, the compiler can mix dependent and independent instructions together in a
bundle, because the template keeps track of which is which.
21
IA-64 : Explicitly Parallel Architecture
 IA-64 template specifies
• The type of operation for each instruction
- MFI, MMI, MII, MLI, MIB, MMF, MFB, MMB,
MBB, BBB
• Intra-bundle relationship
- M / MI or MI / I
• Inter-bundle relationship
 Most common combinations covered by templates
• Headroom for additional templates
 Simplifies hardware requirements
 Scales compatibly to future generations
Instruction 2
41 bits
Instruction 1
41 bits
Instruction 0
41 bits
Template
5 bits
128 bits (bundle)
M=Memory
F=Floating-point
I=Integer
L=Long Immediate
B=Branch
(MMI)Memory (M) Memory (M) Integer (I)

Vliw or epic

  • 1.
    1  Very longinstruction word or VLIW refers to a processor architecture designed to take advantage of instruction level parallelism(ILP).  Whereas conventional processors mostly only allow programs that specify instructions to be executed one after another.  A VLIW processor allows programs that can explicitly specify instructions to be executed at the same time (i.e. in parallel).  This type of processor architecture is intended to allow higher performance without the inherent complexity of some other approaches.  The term VLIW, and the concept of VLIW architecture itself, were invented by Josh Fisher in his research group at Yale University in the early 1980s. Very long instruction word
  • 2.
    2  Explicitly parallelinstruction computing (EPIC) is a term coined in 1997 by the HP–Intel alliance to describe a computing paradigm that researchers had been investigating since the early 1980s.  This paradigm is also called Independence architectures. It was the basis for Intel and HP development of the Intel Itanium architecture, and HP later asserted that "EPIC" was merely an old term for the Itanium architecture.  EPIC permits microprocessors to execute software instructions in parallel by using the compiler, rather than complex on-die circuitry, to control parallel instruction execution.  This was intended to allow simple performance scaling without resorting to higher clock frequencies. Explicitly parallel instruction computing
  • 3.
    3  VLIW (verylong instruction word): - Compiler packs a fixed number of instructions into a single VLIW instruction. - The instructions within a VLIW instruction are issued and executed in parallel Example: High-end signal processors (TMS320C6201)  EPIC (explicit parallel instruction computing): Evolution of VLIW Example: Intel’s IA-64, exemplified by the Itanium processor VLIW or EPIC
  • 4.
    4 VLIW  VLIW (verylong instruction word) processors use a long instruction word that contains a usually fixed number of operations that are fetched, decoded, issued, and executed synchronously.  All operations specified within a VLIW instruction must be independent of one another.
  • 5.
    5 VLIW  Some ofthe key issues of a (V)LIW processor: • (very) long instruction word (up to 1 024 bits per instruction), • each instruction consists of multiple independent parallel operations, • each operation requires a statically known number of cycles to complete, • a central controller that issues a long instruction word every cycle, • multiple FUs connected through a global shared register file.
  • 6.
    6 VLIW  sequential streamof long instruction words  instructions scheduled statically by the compiler  number of simultaneously issued instructions is fixed during compile-time  instruction issue is less complicated than in a superscalar processor
  • 7.
    7 VLIW  Disadvantage: VLIWprocessors cannot react on dynamic events, e.g. cache misses, with the same flexibility like superscalars.  The number of instructions in a VLIW instruction word is usually fixed.  Padding VLIW instructions with no-ops is needed in case the full issue bandwidth is not be met. This increases code size. More recent VLIW architectures use a denser code format which allows to remove the no-ops.  VLIW is an architectural technique, whereas superscalar is a microarchitecture technique.  VLIW processors take advantage of spatial parallelism.
  • 8.
    8 EPIC: a paradigmshift  Superscalar RISC solution • Based on sequential execution semantics • Compiler’s role is limited by the instruction set architecture • Superscalar hardware identifies and exploits parallelism  EPIC solution – (the evolution of VLIW) • Based on parallel execution semantics • EPIC ISA enhancements support static parallelization • Compiler takes greater responsibility for exploiting parallelism • Compiler / hardware collaboration often resembles superscalar
  • 9.
    9 EPIC: a paradigmshift  Advantages of pursuing EPIC architectures • Make wide issue & deep latency less expensive in hardware • Allow processor parallelism to scale with additional VLSI density  Architect the processor to do well with in-order execution • Enhance the ISA to allow static parallelization • Use compiler technology to parallelize program • However, a purely static VLIW is not appropriate for general-purpose use
  • 10.
    10 The fusion ofVLIW and superscalar techniques  Superscalars need improved support for static parallelization • Static scheduling • Limited support for predicated execution  VLIWs need improved support for dynamic parallelization • Caches introduce dynamically changing memory latency • Compatibility: issue width and latency may change with new hardware • Application requirements - e.g. object oriented programming with dynamic binding  EPIC processors exhibit features derived from both • Interlock & out-of-order execution hardware are compatible with EPIC (but not required!) • EPIC processors can use dynamic translation to parallelize in software
  • 11.
    11 Many EPIC featuresare taken from VLIWs Minisupercomputer products stimulated VLIW research (FPS, Multiflow, Cydrome) Minisupercomputers were specialized, costly, and short-lived Traditional VLIWs not suited to general purpose computing VLIW resurgence in single chip DSP & media processors Minisupercomputers exaggerated forward-looking challenges: Long latency Wide issue Large number of architected registers Compile-time scheduling to exploit exotic amounts of parallelism EPIC exploits many VLIW techniques
  • 12.
    12 Shortcomings of earlyVLIWs  Expensive multi-chip implementations  No data cache  Poor "scalar" performance  No strategy for object code compatibility
  • 13.
    13 EPIC design challenges Develop architectures applicable to general-purpose computing • Find substantial parallelism in “difficult to parallelize” scalar programs • Provide compatibility across hardware generations • Support emerging applications (e.g. multimedia)  Compiler must find or create sufficient ILP  Combine the best attributes of VLIW & superscalar RISC (incorporated best concepts from all available sources)  Scale architectures for modern single-chip implementation
  • 14.
    14 EPIC Processors, Intel'sIA-64 ISA and Itanium  Joint R&D project by Hewlett-Packard and Intel (announced in June 1994)  This resulted in explicitly parallel instruction computing (EPIC) design style: • specifying ILP explicit in the machine code, that is, the parallelism is encoded directly into the instructions similarly to VLIW; • a fully predicated instruction set; • an inherently scalable instruction set (i.e., the ability to scale to a lot of FUs); • many registers; • speculative execution of load instructions
  • 15.
    15 IA-64 Architecture  Uniquearchitecture features & enhancements • Explicit parallelism and templates • Predication, speculation, memory support, and others • Floating-point and multimedia architecture  IA-64 resources available to applications • Large, application visible register set • Rotating registers, register stack, register stack engine  IA-32 & PA-RISC compatibility models
  • 16.
    16 Today’s Architecture Challenges Performance barriers : • Memory latency • Branches • Loop pipelining and call / return overhead  Headroom constraints : • Hardware-based instruction scheduling - Unable to efficiently schedule parallel execution • Resource constrained - Too few registers - Unable to fully utilize multiple execution units  Scalability limitations : • Memory addressing efficiency
  • 17.
    17 Intel's IA-64 ISA Intel 64-bit Architecture (IA-64) register model: • 128, 64-bit general purpose registers GR0-GR127 to hold values for integer and multimedia computations - each register has one additional NaT (Not a Thing) bit to indicate whether the value stored is valid, • 128, 82-bit floating-point registers FR0-FR127 - registers f0 and f1 are read-only with values +0.0 and +1.0, • 64, 1-bit predicate registers P0-PR63 - the first register p0 is read-only and always reads 1 (true) • 8, 64-bit branch registers BR0-BR7 to specify the target addresses of indirect branches
  • 18.
    18 IA-64’s Large RegisterFile BR7 BR0 Branch Registers 63 0 96 Stacked, Rotating GR1 GR31 GR127 GR32 GR0 NaT 32 Static 0 Integer Registers 63 0 Predicate Registers 1 PR1 PR63 PR0 PR15 PR16 48 Rotating 16 Static bit 0 96 Rotating GR1 GR31 GR127 GR32 GR0 32 Static 0.0 Floating-Point Registers 81 0
  • 19.
    19 Intel's IA-64 ISA •IA-64 instructions are 41-bit (previously stated 40 bit) long and consist of - op-code, - predicate field (6 bits), - two source register addresses (7 bits each), - destination register address (7 bits), and - special fields (includes integer and floating-point arithmetic). • The 6-bit predicate field in each IA-64 instruction refers to a set of 64 predicate registers. • 6 types of instructions - A: Integer ALU ==> I-unit or M-unit - I: Non-ALU integer ==> I-unit - M: Memory ==> M-unit - B: Branch ==> B-unit - F: Floating-point ==> F-unit - L: Long Immediate ==> I-unit • IA-64 instructions are packed by compiler into bundles.
  • 20.
    20 IA-64 Bundles  Abundle is a 128-bit long instruction word (LIW) containing three 41-bit IA-64 instructions along with a so-called 5-bit template that contains instruction grouping information.  IA-64 does not insert no-op instructions to fill slots in the bundles.  The template explicitly indicates : • first 4 bits: types of instructions • last bit (stop bit): whether the bundle can be executed in parallel with the next bundle • (previous literature): whether the instructions in the bundle can be executed in parallel or if one or more must be executed serially.  Bundled instructions don't have to be in their original program order, and they can even represent entirely different paths of a branch.  Also, the compiler can mix dependent and independent instructions together in a bundle, because the template keeps track of which is which.
  • 21.
    21 IA-64 : ExplicitlyParallel Architecture  IA-64 template specifies • The type of operation for each instruction - MFI, MMI, MII, MLI, MIB, MMF, MFB, MMB, MBB, BBB • Intra-bundle relationship - M / MI or MI / I • Inter-bundle relationship  Most common combinations covered by templates • Headroom for additional templates  Simplifies hardware requirements  Scales compatibly to future generations Instruction 2 41 bits Instruction 1 41 bits Instruction 0 41 bits Template 5 bits 128 bits (bundle) M=Memory F=Floating-point I=Integer L=Long Immediate B=Branch (MMI)Memory (M) Memory (M) Integer (I)