E&CE 327: Digital Systems Engineering
           Course Notes
          (with Solutions)


                  Mark Aagaard
                  2011t1–Winter

               University of Waterloo
    Dept of Electrical and Computer Engineering
Contents

I Course Notes                                                                                                   1
1 VHDL                                                                                                            3
  1.1 Introduction to VHDL . . . . . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .    3
      1.1.1 Levels of Abstraction . . . . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .    3
      1.1.2 VHDL Origins and History . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .    4
      1.1.3 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .    6
      1.1.4 Synthesis of a Simulation-Based Language . . . . . . . .         .   .   .   .   .   .   .   .   .    7
      1.1.5 Solution to Synthesis Sanity . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .    7
      1.1.6 Standard Logic 1164 . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .    8
  1.2 Comparison of VHDL to Other Hardware Description Languages             .   .   .   .   .   .   .   .   .    9
      1.2.1 VHDL Disadvantages . . . . . . . . . . . . . . . . . . .         .   .   .   .   .   .   .   .   .    9
      1.2.2 VHDL Advantages . . . . . . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .    9
      1.2.3 VHDL and Other Languages . . . . . . . . . . . . . . . .         .   .   .   .   .   .   .   .   .   10
              1.2.3.1 VHDL vs Verilog . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   10
              1.2.3.2 VHDL vs System Verilog . . . . . . . . . . . .         .   .   .   .   .   .   .   .   .   10
              1.2.3.3 VHDL vs SystemC . . . . . . . . . . . . . . .          .   .   .   .   .   .   .   .   .   10
              1.2.3.4 Summary of VHDL Evaluation . . . . . . . . .           .   .   .   .   .   .   .   .   .   11
  1.3 Overview of Syntax . . . . . . . . . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   11
      1.3.1 Syntactic Categories . . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   11
      1.3.2 Library Units . . . . . . . . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   11
      1.3.3 Entities and Architecture . . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   12
      1.3.4 Concurrent Statements . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   14
      1.3.5 Component Declaration and Instantiations . . . . . . . . .       .   .   .   .   .   .   .   .   .   16
      1.3.6 Processes . . . . . . . . . . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   16
      1.3.7 Sequential Statements . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   17
      1.3.8 A Few More Miscellaneous VHDL Features . . . . . . .             .   .   .   .   .   .   .   .   .   18
  1.4 Concurrent vs Sequential Statements . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   18
      1.4.1 Concurrent Assignment vs Process . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   18
      1.4.2 Conditional Assignment vs If Statements . . . . . . . . .        .   .   .   .   .   .   .   .   .   18
      1.4.3 Selected Assignment vs Case Statement . . . . . . . . . .        .   .   .   .   .   .   .   .   .   19
      1.4.4 Coding Style . . . . . . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   19
  1.5 Overview of Processes . . . . . . . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   20
      1.5.1 Combinational Process vs Clocked Process . . . . . . . .         .   .   .   .   .   .   .   .   .   22
      1.5.2 Latch Inference . . . . . . . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   23


                                               i
ii                                                                                                       CONTENTS


          1.5.3 Combinational vs Flopped Signals . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   25
     1.6 Details of Process Execution . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   25
          1.6.1 Simple Simulation . . . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   25
          1.6.2 Temporal Granularities of Simulation . . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   26
          1.6.3 Intuition Behind Delta-Cycle Simulation . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   27
          1.6.4 Definitions and Algorithm . . . . . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   27
                 1.6.4.1 Process Modes . . . . . . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   27
                 1.6.4.2 Simulation Algorithm . . . . . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   28
                 1.6.4.3 Delta-Cycle Definitions . . . . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   30
          1.6.5 Example 1: Process Execution (Bamboozle) . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   31
          1.6.6 Example 2: Process Execution (Flummox) . . . . .             .   .   .   .   .   .   .   .   .   .   .   .   40
          1.6.7 Example: Need for Provisional Assignments . . . .            .   .   .   .   .   .   .   .   .   .   .   .   42
          1.6.8 Delta-Cycle Simulations of Flip-Flops . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   44
     1.7 Register-Transfer-Level Simulation . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   50
          1.7.1 Overview . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   50
          1.7.2 Technique for Register-Transfer Level Simulation . .         .   .   .   .   .   .   .   .   .   .   .   .   52
          1.7.3 Examples of RTL Simulation . . . . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   53
                 1.7.3.1 RTL Simulation Example 1 . . . . . . . .            .   .   .   .   .   .   .   .   .   .   .   .   53
     1.8 VHDL and Hardware Building Blocks . . . . . . . . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   58
          1.8.1 Basic Building Blocks . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   58
          1.8.2 Deprecated Building Blocks for RTL . . . . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   59
                 1.8.2.1 An Aside on Flip-Flops and Latches . . .            .   .   .   .   .   .   .   .   .   .   .   .   59
                 1.8.2.2 Deprecated Hardware . . . . . . . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   59
          1.8.3 Hardware and Code for Flops . . . . . . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   60
                 1.8.3.1 Flops with Waits and Ifs . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   60
                 1.8.3.2 Flops with Synchronous Reset . . . . . .            .   .   .   .   .   .   .   .   .   .   .   .   60
                 1.8.3.3 Flops with Chip-Enable . . . . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   61
                 1.8.3.4 Flop with Chip-Enable and Mux on Input .            .   .   .   .   .   .   .   .   .   .   .   .   61
                 1.8.3.5 Flops with Chip-Enable, Muxes, and Reset            .   .   .   .   .   .   .   .   .   .   .   .   62
          1.8.4 An Example Sequential Circuit . . . . . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   62
     1.9 Arrays and Vectors . . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   66
     1.10 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   67
          1.10.1 Arithmetic Packages . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   68
          1.10.2 Shift and Rotate Operations . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   68
          1.10.3 Overloading of Arithmetic . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   68
          1.10.4 Different Widths and Arithmetic . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   69
          1.10.5 Overloading of Comparisons . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   69
          1.10.6 Different Widths and Comparisons . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   69
          1.10.7 Type Conversion . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   70
     1.11 Synthesizable vs Non-Synthesizable Code . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   71
          1.11.1 Unsynthesizable Code . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   72
                 1.11.1.1 Initial Values . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   72
                 1.11.1.2 Wait For . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   72
                 1.11.1.3 Different Wait Conditions . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   72
                 1.11.1.4 Multiple “if rising edge” in Process . . . .       .   .   .   .   .   .   .   .   .   .   .   .   73
CONTENTS                                                                                                             iii


              1.11.1.5 “if rising edge” and “wait” in Same Process .        .   .   .   .   .   .   .   .   .   .    73
              1.11.1.6 “if rising edge” with “else” Clause . . . . . .      .   .   .   .   .   .   .   .   .   .    74
              1.11.1.7 “if rising edge” Inside a “for” Loop . . . . . .     .   .   .   .   .   .   .   .   .   .    74
              1.11.1.8 “wait” Inside of a “for loop” . . . . . . . . .      .   .   .   .   .   .   .   .   .   .    75
       1.11.2 Synthesizable, but Bad Coding Practices . . . . . . . . .     .   .   .   .   .   .   .   .   .   .    76
              1.11.2.1 Asynchronous Reset . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .    76
              1.11.2.2 Combinational “if-then” Without “else” . . .         .   .   .   .   .   .   .   .   .   .    77
              1.11.2.3 Bad Form of Nested Ifs . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .    77
              1.11.2.4 Deeply Nested Ifs . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .    77
       1.11.3 Synthesizable, but Unpredictable Hardware . . . . . . .       .   .   .   .   .   .   .   .   .   .    78
  1.12 Synthesizable VHDL Coding Guidelines . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .    78
       1.12.1 Signal Declarations . . . . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .    78
       1.12.2 Flip-Flops and Latches . . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .    79
       1.12.3 Inputs and Outputs . . . . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .    79
       1.12.4 Multiplexors and Tri-State Signals . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .    79
       1.12.5 Processes . . . . . . . . . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .    80
       1.12.6 State Machines . . . . . . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .    80
       1.12.7 Reset . . . . . . . . . . . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .    81
  1.13 VHDL Problems . . . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .    83
       P1.1 IEEE 1164 . . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .    83
       P1.2 VHDL Syntax . . . . . . . . . . . . . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .    83
       P1.3 Flops, Latches, and Combinational Circuitry . . . . . .         .   .   .   .   .   .   .   .   .   .    85
       P1.4 Counting Clock Cycles . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .    86
       P1.5 Arithmetic Overflow . . . . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .    88
       P1.6 Delta-Cycle Simulation: Pong . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .    89
       P1.7 Delta-Cycle Simulation: Baku . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .    89
       P1.8 Clock-Cycle Simulation . . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .    91
       P1.9 VHDL — VHDL Behavioural Comparison: Teradactyl .                .   .   .   .   .   .   .   .   .   .    92
       P1.10 VHDL — VHDL Behavioural Comparison: Ichtyostega                .   .   .   .   .   .   .   .   .   .    93
       P1.11 Waveform — VHDL Behavioural Comparison . . . . .               .   .   .   .   .   .   .   .   .   .    95
       P1.12 Hardware — VHDL Comparison . . . . . . . . . . . .             .   .   .   .   .   .   .   .   .   .    97
       P1.13 8-Bit Register . . . . . . . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .    98
              P1.13.1 Asynchronous Reset . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .    98
              P1.13.2 Discussion . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .    98
              P1.13.3 Testbench for Register . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .    98
       P1.14 Synthesizable VHDL and Hardware . . . . . . . . . . .          .   .   .   .   .   .   .   .   .   .    99
       P1.15 Datapath Design . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   101
              P1.15.1 Correct Implementation? . . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .   101
              P1.15.2 Smallest Area . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   104
              P1.15.3 Shortest Clock Period . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   104
iv                                                                                  CONTENTS


2 RTL Design with VHDL                                                                                105
  2.1 Prelude to Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    .   .   .   105
      2.1.1 A Note on EDA for FPGAs and ASICs . . . . . . . . . . . . . . . .             .   .   .   105
  2.2 FPGA Background and Coding Guidelines . . . . . . . . . . . . . . . . . . .         .   .   .   106
      2.2.1 Generic FPGA Hardware . . . . . . . . . . . . . . . . . . . . . . . .         .   .   .   106
              2.2.1.1 Generic FPGA Cell . . . . . . . . . . . . . . . . . . . . .         .   .   .   106
      2.2.2 Area Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   106
              2.2.2.1 Interconnect for Generic FPGA . . . . . . . . . . . . . . .         .   .   .   112
              2.2.2.2 Blocks of Cells for Generic FPGA . . . . . . . . . . . . .          .   .   .   112
              2.2.2.3 Clocks for Generic FPGAs . . . . . . . . . . . . . . . . .          .   .   .   114
              2.2.2.4 Special Circuitry in FPGAs . . . . . . . . . . . . . . . . .        .   .   .   114
      2.2.3 Generic-FPGA Coding Guidelines . . . . . . . . . . . . . . . . . . .          .   .   .   115
  2.3 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   .   .   .   116
      2.3.1 Generic Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   116
      2.3.2 Implementation Flows . . . . . . . . . . . . . . . . . . . . . . . . .        .   .   .   117
      2.3.3 Design Flow: Datapath vs Control vs Storage . . . . . . . . . . . . .         .   .   .   118
              2.3.3.1 Classes of Hardware . . . . . . . . . . . . . . . . . . . . .       .   .   .   118
              2.3.3.2 Datapath-Centric Design Flow . . . . . . . . . . . . . . .          .   .   .   119
              2.3.3.3 Control-Centric Design Flow . . . . . . . . . . . . . . . .         .   .   .   120
              2.3.3.4 Storage-Centric Design Flow . . . . . . . . . . . . . . . .         .   .   .   120
  2.4 Algorithms and High-Level Models . . . . . . . . . . . . . . . . . . . . . .        .   .   .   120
      2.4.1 Flow Charts and State Machines . . . . . . . . . . . . . . . . . . . .        .   .   .   121
      2.4.2 Data-Dependency Graphs . . . . . . . . . . . . . . . . . . . . . . .          .   .   .   121
      2.4.3 High-Level Models . . . . . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   122
  2.5 Finite State Machines in VHDL . . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   123
      2.5.1 Introduction to State-Machine Design . . . . . . . . . . . . . . . . .        .   .   .   123
              2.5.1.1 Mealy vs Moore State Machines . . . . . . . . . . . . . .           .   .   .   123
              2.5.1.2 Introduction to State Machines and VHDL . . . . . . . . .           .   .   .   123
              2.5.1.3 Explicit vs Implicit State Machines . . . . . . . . . . . . .       .   .   .   124
      2.5.2 Implementing a Simple Moore Machine . . . . . . . . . . . . . . . .           .   .   .   125
              2.5.2.1 Implicit Moore State Machine . . . . . . . . . . . . . . . .        .   .   .   126
              2.5.2.2 Explicit Moore with Flopped Output . . . . . . . . . . . .          .   .   .   127
              2.5.2.3 Explicit Moore with Combinational Outputs . . . . . . . .           .   .   .   128
              2.5.2.4 Explicit-Current+Next Moore with Concurrent Assignment              .   .   .   129
              2.5.2.5 Explicit-Current+Next Moore with Combinational Process              .   .   .   130
      2.5.3 Implementing a Simple Mealy Machine . . . . . . . . . . . . . . . .           .   .   .   131
              2.5.3.1 Implicit Mealy State Machine . . . . . . . . . . . . . . . .        .   .   .   132
              2.5.3.2 Explicit Mealy State Machine . . . . . . . . . . . . . . . .        .   .   .   133
              2.5.3.3 Explicit-Current+Next Mealy . . . . . . . . . . . . . . . .         .   .   .   134
      2.5.4 Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     .   .   .   135
      2.5.5 State Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   137
              2.5.5.1 Constants vs Enumerated Type . . . . . . . . . . . . . . .          .   .   .   137
              2.5.5.2 Encoding Schemes . . . . . . . . . . . . . . . . . . . . . .        .   .   .   138
  2.6 Dataflow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   139
      2.6.1 Dataflow Diagrams Overview . . . . . . . . . . . . . . . . . . . . .           .   .   .   139
CONTENTS                                                                                                              v


       2.6.2 Dataflow Diagrams, Hardware, and Behaviour . . . . .             .   .   .   .   .   .   .   .   .   .   142
       2.6.3 Dataflow Diagram Execution . . . . . . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .   143
       2.6.4 Performance Estimation . . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   144
       2.6.5 Area Estimation . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   144
       2.6.6 Design Analysis . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   145
       2.6.7 Area / Performance Tradeoffs . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   145
  2.7 Design Example: Massey . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   148
       2.7.1 Requirements . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   148
       2.7.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   149
       2.7.3 Initial Dataflow Diagram . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   149
       2.7.4 Dataflow Diagram Scheduling . . . . . . . . . . . . . .          .   .   .   .   .   .   .   .   .   .   150
       2.7.5 Optimize Inputs and Outputs . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   152
       2.7.6 Input/Output Allocation . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   154
       2.7.7 Register Allocation . . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   156
       2.7.8 Datapath Allocation . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   157
       2.7.9 Datapath for DP+Ctrl Model . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   158
       2.7.10 Peephole Optimizations . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   160
  2.8 Design Example: Vanier . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   162
       2.8.1 Requirements . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   163
       2.8.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   163
       2.8.3 Initial Dataflow Diagram . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   164
       2.8.4 Reschedule to Meet Requirements . . . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .   164
       2.8.5 Optimize Resources . . . . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   165
       2.8.6 Assign Names to Registered Values . . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .   167
       2.8.7 Input/Output Allocation . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   168
       2.8.8 Tangent: Combinational Outputs . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   170
       2.8.9 Register Allocation . . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   171
       2.8.10 Datapath Allocation . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   173
       2.8.11 Hardware Block Diagram and State Machine . . . . . .           .   .   .   .   .   .   .   .   .   .   173
              2.8.11.1 Control for Registers . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   173
              2.8.11.2 Control for Datapath Components . . . . . . .         .   .   .   .   .   .   .   .   .   .   174
              2.8.11.3 Control for State . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   175
              2.8.11.4 Complete State Machine Table . . . . . . . .          .   .   .   .   .   .   .   .   .   .   175
       2.8.12 VHDL Code with Explicit State Machine . . . . . . . .          .   .   .   .   .   .   .   .   .   .   176
       2.8.13 Peephole Optimizations . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   179
       2.8.14 Notes and Observations . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   182
  2.9 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   183
       2.9.1 Introduction to Pipelining . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   183
       2.9.2 Partially Pipelined . . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   186
       2.9.3 Terminology . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   187
  2.10 Design Example: Pipelined Massey . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   188
  2.11 Memory Arrays and RTL Design . . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   192
       2.11.1 Memory Operations . . . . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   192
       2.11.2 Memory Arrays in VHDL . . . . . . . . . . . . . . . .          .   .   .   .   .   .   .   .   .   .   193
              2.11.2.1 Using a Two-Dimensional Array for Memory              .   .   .   .   .   .   .   .   .   .   193
vi                                                                                                     CONTENTS


                  2.11.2.2 Memory Arrays in Hardware . . . . . . . .           .   .   .   .   .   .   .   .   .   .   .   194
                  2.11.2.3 VHDL Code for Single-Port Memory Array              .   .   .   .   .   .   .   .   .   .   .   195
                  2.11.2.4 Using Library Components for Memory . .             .   .   .   .   .   .   .   .   .   .   .   196
                  2.11.2.5 Build Memory from Slices . . . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   197
                  2.11.2.6 Dual-Ported Memory . . . . . . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   199
          2.11.3 Data Dependencies . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   199
          2.11.4 Memory Arrays and Dataflow Diagrams . . . . . . . .            .   .   .   .   .   .   .   .   .   .   .   201
          2.11.5 Example: Memory Array and Dataflow Diagram . . .               .   .   .   .   .   .   .   .   .   .   .   204
     2.12 Input / Output Protocols . . . . . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   206
     2.13 Example: Moving Average . . . . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   207
          2.13.1 Requirements and Environmental Assumptions . . . .            .   .   .   .   .   .   .   .   .   .   .   207
          2.13.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   207
          2.13.3 Pseudocode and Dataflow Diagrams . . . . . . . . . .           .   .   .   .   .   .   .   .   .   .   .   210
          2.13.4 Control Tables and State Machine . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   216
          2.13.5 VHDL Code . . . . . . . . . . . . . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   219
     2.14 Design Problems . . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   221
          P2.1 Synthesis . . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   221
                  P2.1.1    Data Structures . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   221
                  P2.1.2    Own Code vs Libraries . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   221
          P2.2 Design Guidelines . . . . . . . . . . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   221
          P2.3 Dataflow Diagram Optimization . . . . . . . . . . . .            .   .   .   .   .   .   .   .   .   .   .   222
                  P2.3.1    Resource Usage . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   222
                  P2.3.2    Optimization . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   223
          P2.4 Dataflow Diagram Design . . . . . . . . . . . . . . .            .   .   .   .   .   .   .   .   .   .   .   223
                  P2.4.1    Maximum Performance . . . . . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   223
                  P2.4.2    Minimum area . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   224
          P2.5 Michener: Design and Optimization . . . . . . . . . .           .   .   .   .   .   .   .   .   .   .   .   224
          P2.6 Dataflow Diagrams with Memory Arrays . . . . . . .               .   .   .   .   .   .   .   .   .   .   .   224
                  P2.6.1    Algorithm 1 . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   225
                  P2.6.2    Algorithm 2 . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   225
          P2.7 2-bit adder . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   225
                  P2.7.1    Generic Gates . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   225
                  P2.7.2    FPGA . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   226
          P2.8 Sketches of Problems . . . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   226
CONTENTS                                                                                                     vii


3 Performance Analysis and Optimization                                                                      227
  3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   227
  3.2 Defining Performance . . . . . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   227
  3.3 Comparing Performance . . . . . . . . . . . . . . . . . . . . . . . . .        .   .   .   .   .   .   228
       3.3.1 General Equations . . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   228
       3.3.2 Example: Performance of Printers . . . . . . . . . . . . . . . .        .   .   .   .   .   .   229
  3.4 Clock Speed, CPI, Program Length, and Performance . . . . . . . . . .          .   .   .   .   .   .   233
       3.4.1 Mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   233
       3.4.2 Example: CISC vs RISC and CPI . . . . . . . . . . . . . . . .           .   .   .   .   .   .   233
       3.4.3 Effect of Instruction Set on Performance . . . . . . . . . . . . .      .   .   .   .   .   .   235
       3.4.4 Effect of Time to Market on Relative Performance . . . . . . .          .   .   .   .   .   .   237
       3.4.5 Summary of Equations . . . . . . . . . . . . . . . . . . . . . .        .   .   .   .   .   .   238
  3.5 Performance Analysis and Dataflow Diagrams . . . . . . . . . . . . . .          .   .   .   .   .   .   239
       3.5.1 Dataflow Diagrams, CPI, and Clock Speed . . . . . . . . . . .            .   .   .   .   .   .   239
       3.5.2 Examples of Dataflow Diagrams for Two Instructions . . . . . .           .   .   .   .   .   .   240
              3.5.2.1 Scheduling of Operations for Different Clock Periods           .   .   .   .   .   .   241
              3.5.2.2 Performance Computation for Different Clock Periods            .   .   .   .   .   .   241
              3.5.2.3 Example: Two Instructions Taking Similar Time . . .            .   .   .   .   .   .   242
              3.5.2.4 Example: Same Total Time, Different Order for A . .            .   .   .   .   .   .   243
       3.5.3 Example: From Algorithm to Optimized Dataflow . . . . . . .              .   .   .   .   .   .   244
  3.6 General Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   252
       3.6.1 Strength Reduction . . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   252
              3.6.1.1 Arithmetic Strength Reduction . . . . . . . . . . . .          .   .   .   .   .   .   252
              3.6.1.2 Boolean Strength Reduction . . . . . . . . . . . . . .         .   .   .   .   .   .   252
       3.6.2 Replication and Sharing . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   253
              3.6.2.1 Mux-Pushing . . . . . . . . . . . . . . . . . . . . .          .   .   .   .   .   .   253
              3.6.2.2 Common Subexpression Elimination . . . . . . . . .             .   .   .   .   .   .   253
              3.6.2.3 Computation Replication . . . . . . . . . . . . . . .          .   .   .   .   .   .   253
       3.6.3 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   254
  3.7 Retiming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   254
  3.8 Performance Analysis and Optimization Problems . . . . . . . . . . . .         .   .   .   .   .   .   256
       P3.1 Farmer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   256
       P3.2 Network and Router . . . . . . . . . . . . . . . . . . . . . . .         .   .   .   .   .   .   257
              P3.2.1    Maximum Throughput . . . . . . . . . . . . . . . . .         .   .   .   .   .   .   257
              P3.2.2    Packet Size and Performance . . . . . . . . . . . . .        .   .   .   .   .   .   257
       P3.3 Performance Short Answer . . . . . . . . . . . . . . . . . . . .         .   .   .   .   .   .   257
       P3.4 Microprocessors . . . . . . . . . . . . . . . . . . . . . . . . .        .   .   .   .   .   .   257
              P3.4.1    Average CPI . . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   258
              P3.4.2    Why not you too? . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   258
              P3.4.3    Analysis . . . . . . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   258
       P3.5 Dataflow Diagram Optimization . . . . . . . . . . . . . . . . .           .   .   .   .   .   .   258
       P3.6 Performance Optimization with Memory Arrays . . . . . . . .              .   .   .   .   .   .   259
       P3.7 Multiply Instruction . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   260
              P3.7.1    Highest Performance . . . . . . . . . . . . . . . . .        .   .   .   .   .   .   260
              P3.7.2    Performance Metrics . . . . . . . . . . . . . . . . .        .   .   .   .   .   .   261
viii                                                                                CONTENTS


4 Functional Verification                                                                               263
  4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   .   .   .   263
      4.1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   263
  4.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     .   .   .   263
      4.2.1 Terminology: Validation / Verification / Testing . . . . . . . . . . . .        .   .   .   264
      4.2.2 The Difficulty of Designing Correct Chips . . . . . . . . . . . . . . .         .   .   .   265
              4.2.2.1 Notes from Kenn Heinrich (UW E&CE grad) . . . . . . .                .   .   .   265
              4.2.2.2 Notes from Aart de Geus (Chairman and CEO of Synopsys)               .   .   .   265
  4.3 Test Cases and Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   266
      4.3.1 Test Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   266
      4.3.2 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   267
      4.3.3 Floating Point Divider Example . . . . . . . . . . . . . . . . . . . .         .   .   .   268
  4.4 Testbenches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    .   .   .   270
      4.4.1 Overview of Test Benches . . . . . . . . . . . . . . . . . . . . . . .         .   .   .   271
      4.4.2 Reference Model Style Testbench . . . . . . . . . . . . . . . . . . .          .   .   .   272
      4.4.3 Relational Style Testbench . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   272
      4.4.4 Coding Structure of a Testbench . . . . . . . . . . . . . . . . . . . .        .   .   .   273
      4.4.5 Datapath vs Control . . . . . . . . . . . . . . . . . . . . . . . . . .        .   .   .   273
      4.4.6 Verification Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   274
  4.5 Functional Verification for Datapath Circuits . . . . . . . . . . . . . . . . . .     .   .   .   274
      4.5.1 A Spec-Less Testbench . . . . . . . . . . . . . . . . . . . . . . . . .        .   .   .   275
      4.5.2 Use an Array for Test Vectors . . . . . . . . . . . . . . . . . . . . .        .   .   .   276
      4.5.3 Build Spec into Stimulus . . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   277
      4.5.4 Have Separate Specification Entity . . . . . . . . . . . . . . . . . . .        .   .   .   278
      4.5.5 Generate Test Vectors Automatically . . . . . . . . . . . . . . . . . .        .   .   .   280
      4.5.6 Relational Specification . . . . . . . . . . . . . . . . . . . . . . . .        .   .   .   280
  4.6 Functional Verification of Control Circuits . . . . . . . . . . . . . . . . . . .     .   .   .   281
      4.6.1 Overview of Queues in Hardware . . . . . . . . . . . . . . . . . . .           .   .   .   281
      4.6.2 VHDL Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . .          .   .   .   283
              4.6.2.1 Package . . . . . . . . . . . . . . . . . . . . . . . . . . .        .   .   .   283
              4.6.2.2 Other VHDL Coding . . . . . . . . . . . . . . . . . . . .            .   .   .   283
      4.6.3 Code Structure for Verification . . . . . . . . . . . . . . . . . . . . .       .   .   .   283
      4.6.4 Instrumentation Code . . . . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   284
      4.6.5 Coverage Monitors . . . . . . . . . . . . . . . . . . . . . . . . . . .        .   .   .   284
      4.6.6 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     .   .   .   287
      4.6.7 VHDL Coding Tips . . . . . . . . . . . . . . . . . . . . . . . . . . .         .   .   .   288
      4.6.8 Queue Specification . . . . . . . . . . . . . . . . . . . . . . . . . .         .   .   .   289
      4.6.9 Queue Testbench . . . . . . . . . . . . . . . . . . . . . . . . . . . .        .   .   .   290
  4.7 Example: Microwave Oven . . . . . . . . . . . . . . . . . . . . . . . . . . .        .   .   .   291
  4.8 Functional Verification Problems . . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   296
      P4.1 Carry Save Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . .        .   .   .   296
      P4.2 Traffic Light Controller . . . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   296
              P4.2.1    Functionality . . . . . . . . . . . . . . . . . . . . . . . . .    .   .   .   296
              P4.2.2    Boundary Conditions . . . . . . . . . . . . . . . . . . . .        .   .   .   296
              P4.2.3    Assertions . . . . . . . . . . . . . . . . . . . . . . . . . .     .   .   .   296
CONTENTS                                                                                                                               ix


         P4.3   State Machines and Verification . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   297
                P4.3.1    Three Different State Machines      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   297
                P4.3.2    State Machines in General . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   298
         P4.4   Test Plan Creation . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   298
                P4.4.1    Early Tests . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   299
                P4.4.2    Corner Cases . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   299
         P4.5   Sketches of Problems . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   299

5 Timing Analysis                                                                                                                     301
  5.1 Delays and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . .                               .   .   .   .   .   .   301
      5.1.1 Background Definitions . . . . . . . . . . . . . . . . . . . . .                                   .   .   .   .   .   .   301
      5.1.2 Clock-Related Timing Definitions . . . . . . . . . . . . . . . .                                   .   .   .   .   .   .   302
             5.1.2.1 Clock Skew . . . . . . . . . . . . . . . . . . . . . .                                   .   .   .   .   .   .   302
             5.1.2.2 Clock Latency . . . . . . . . . . . . . . . . . . . . .                                  .   .   .   .   .   .   303
             5.1.2.3 Clock Jitter . . . . . . . . . . . . . . . . . . . . . . .                               .   .   .   .   .   .   303
      5.1.3 Storage-Related Timing Definitions . . . . . . . . . . . . . . .                                   .   .   .   .   .   .   304
             5.1.3.1 Flops and Latches . . . . . . . . . . . . . . . . . . .                                  .   .   .   .   .   .   304
             5.1.3.2 Timing Parameters for a Flop . . . . . . . . . . . . .                                   .   .   .   .   .   .   305
             5.1.3.3 Hold Time . . . . . . . . . . . . . . . . . . . . . . .                                  .   .   .   .   .   .   305
             5.1.3.4 Clock-to-Q Time . . . . . . . . . . . . . . . . . . . .                                  .   .   .   .   .   .   305
      5.1.4 Propagation Delays . . . . . . . . . . . . . . . . . . . . . . . .                                .   .   .   .   .   .   306
             5.1.4.1 Load Delays . . . . . . . . . . . . . . . . . . . . . .                                  .   .   .   .   .   .   306
             5.1.4.2 Interconnect Delays . . . . . . . . . . . . . . . . . .                                  .   .   .   .   .   .   306
      5.1.5 Summary of Delay Factors . . . . . . . . . . . . . . . . . . . .                                  .   .   .   .   .   .   307
      5.1.6 Timing Constraints . . . . . . . . . . . . . . . . . . . . . . . .                                .   .   .   .   .   .   307
             5.1.6.1 Minimum Clock Period . . . . . . . . . . . . . . . .                                     .   .   .   .   .   .   308
             5.1.6.2 Hold Constraint . . . . . . . . . . . . . . . . . . . .                                  .   .   .   .   .   .   309
             5.1.6.3 Example Timing Violations . . . . . . . . . . . . . .                                    .   .   .   .   .   .   309
  5.2 Timing Analysis of Latches and Flip Flops . . . . . . . . . . . . . . . .                               .   .   .   .   .   .   311
      5.2.1 Simple Multiplexer Latch . . . . . . . . . . . . . . . . . . . .                                  .   .   .   .   .   .   311
             5.2.1.1 Structure and Behaviour of Multiplexer Latch . . . .                                     .   .   .   .   .   .   311
             5.2.1.2 Strategy for Timing Analysis of Storage Devices . . .                                    .   .   .   .   .   .   313
             5.2.1.3 Clock-to-Q Time of a Multiplexer Latch . . . . . . .                                     .   .   .   .   .   .   314
             5.2.1.4 Setup Timing of a Multiplexer Latch . . . . . . . . .                                    .   .   .   .   .   .   315
             5.2.1.5 Hold Time of a Multiplexer Latch . . . . . . . . . . .                                   .   .   .   .   .   .   323
             5.2.1.6 Example of a Bad Latch . . . . . . . . . . . . . . . .                                   .   .   .   .   .   .   326
      5.2.2 Timing Analysis of Transmission-Gate Latch . . . . . . . . . .                                    .   .   .   .   .   .   326
             5.2.2.1 Structure and Behaviour of a Transmission Gate . . .                                     .   .   .   .   .   .   327
             5.2.2.2 Structure and Behaviour of Transmission-Gate Latch                                       .   .   .   .   .   .   327
             5.2.2.3 Clock-to-Q Delay for Transmission-Gate Latch . . .                                       .   .   .   .   .   .   328
             5.2.2.4 Setup and Hold Times for Transmission-Gate Latch .                                       .   .   .   .   .   .   328
      5.2.3 Falling Edge Flip Flop . . . . . . . . . . . . . . . . . . . . . .                                .   .   .   .   .   .   328
             5.2.3.1 Structure and Behaviour of Flip-Flop . . . . . . . . .                                   .   .   .   .   .   .   329
             5.2.3.2 Clock-to-Q of Flip-Flop . . . . . . . . . . . . . . . .                                  .   .   .   .   .   .   330
             5.2.3.3 Setup of Flip-Flop . . . . . . . . . . . . . . . . . . .                                 .   .   .   .   .   .   331
x                                                                                              CONTENTS


                   5.2.3.4 Hold of Flip-Flop . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   332
          5.2.4 Timing Analysis of FPGA Cells . . . . . . . . . . . . . . . .          .   .   .   .   .   .   .   332
                   5.2.4.1 Standard Timing Equations . . . . . . . . . . . . .         .   .   .   .   .   .   .   333
                   5.2.4.2 Hierarchical Timing Equations . . . . . . . . . . .         .   .   .   .   .   .   .   333
                   5.2.4.3 Actel Act 2 Logic Cell . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   333
                   5.2.4.4 Timing Analysis of Actel Sequential Module . . . .          .   .   .   .   .   .   .   335
          5.2.5 Exotic Flop . . . . . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   336
    5.3   Critical Paths and False Paths . . . . . . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   336
          5.3.1 Introduction to Critical and False Paths . . . . . . . . . . . .       .   .   .   .   .   .   .   336
                   5.3.1.1 Example of Critical Path in Full Adder . . . . . . .        .   .   .   .   .   .   .   338
                   5.3.1.2 Preliminaries for Critical Paths . . . . . . . . . . .      .   .   .   .   .   .   .   340
                   5.3.1.3 Longest Path and Critical Path . . . . . . . . . . .        .   .   .   .   .   .   .   340
                   5.3.1.4 Timing Simulation vs Static Timing Analysis . . . .         .   .   .   .   .   .   .   343
          5.3.2 Longest Path . . . . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   343
          5.3.3 Detecting a False Path . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   345
                   5.3.3.1 Preliminaries for Detecting a False Path . . . . . .        .   .   .   .   .   .   .   345
                   5.3.3.2 Almost-Correct Algorithm to Detect a False Path . .         .   .   .   .   .   .   .   349
                   5.3.3.3 Examples of Detecting False Paths . . . . . . . . .         .   .   .   .   .   .   .   349
          5.3.4 Finding the Next Candidate Path . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   354
                   5.3.4.1 Algorithm to Find Next Candidate Path . . . . . . .         .   .   .   .   .   .   .   354
                   5.3.4.2 Examples of Finding Next Candidate Path . . . . .           .   .   .   .   .   .   .   355
          5.3.5 Correct Algorithm to Find Critical Path . . . . . . . . . . . .        .   .   .   .   .   .   .   362
                   5.3.5.1 Rules for Late Side Inputs . . . . . . . . . . . . . .      .   .   .   .   .   .   .   362
                   5.3.5.2 Monotone Speedup . . . . . . . . . . . . . . . . .          .   .   .   .   .   .   .   364
                   5.3.5.3 Analysis of Side-Input-Causes-Glitch Situation . .          .   .   .   .   .   .   .   365
                   5.3.5.4 Complete Algorithm . . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   366
                   5.3.5.5 Complete Examples . . . . . . . . . . . . . . . . .         .   .   .   .   .   .   .   367
          5.3.6 Further Extensions to Critical Path Analysis . . . . . . . . . .       .   .   .   .   .   .   .   374
          5.3.7 Increasing the Accuracy of Critical Path Analysis . . . . . . .        .   .   .   .   .   .   .   375
    5.4   Elmore Timing Model . . . . . . . . . . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   375
          5.4.1 RC-Networks for Timing Analysis . . . . . . . . . . . . . . .          .   .   .   .   .   .   .   375
          5.4.2 Derivation of Analog Timing Model . . . . . . . . . . . . . .          .   .   .   .   .   .   .   380
                   5.4.2.1 Example Derivation: Equation for Voltage at Node 3          .   .   .   .   .   .   .   382
                   5.4.2.2 General Derivation . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   383
          5.4.3 Elmore Timing Model . . . . . . . . . . . . . . . . . . . . .          .   .   .   .   .   .   .   385
          5.4.4 Examples of Using Elmore Delay . . . . . . . . . . . . . . .           .   .   .   .   .   .   .   387
                   5.4.4.1 Interconnect with Single Fanout . . . . . . . . . . .       .   .   .   .   .   .   .   387
                   5.4.4.2 Interconnect with Multiple Gates in Fanout . . . . .        .   .   .   .   .   .   .   389
    5.5   Practical Usage of Timing Analysis . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   392
          5.5.1 Speed Binning . . . . . . . . . . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   393
                   5.5.1.1 FPGAs, Interconnect, and Synthesis . . . . . . . .          .   .   .   .   .   .   .   394
          5.5.2 Worst Case Timing . . . . . . . . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   394
                   5.5.2.1 Fanout delay . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   394
                   5.5.2.2 Derating Factors . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   394
    5.6   Timing Analysis Problems . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   396
CONTENTS                                                                                                                        xi


         P5.1   Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                            .   .   .   396
         P5.2   Hold Time Violations . . . . . . . . . . . . . . . . . . . . . . . . . .                           .   .   .   396
                P5.2.1     Cause . . . . . . . . . . . . . . . . . . . . . . . . . . . .                           .   .   .   396
                P5.2.2     Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . .                           .   .   .   397
                P5.2.3     Rectification . . . . . . . . . . . . . . . . . . . . . . . . .                          .   .   .   397
         P5.3   Latch Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                           .   .   .   397
         P5.4   Critical Path and False Path . . . . . . . . . . . . . . . . . . . . . .                           .   .   .   398
         P5.5   Critical Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                          .   .   .   399
                P5.5.1     Longest Path . . . . . . . . . . . . . . . . . . . . . . . . .                          .   .   .   399
                P5.5.2     Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                         .   .   .   399
                P5.5.3     Missing Factors . . . . . . . . . . . . . . . . . . . . . . .                           .   .   .   399
                P5.5.4     Critical Path or False Path? . . . . . . . . . . . . . . . . .                          .   .   .   399
         P5.6   YACP: Yet Another Critical Path . . . . . . . . . . . . . . . . . . . .                            .   .   .   400
         P5.7   Timing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                            .   .   .   401
         P5.8   Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                           .   .   .   402
                P5.8.1     Wires in FPGAs . . . . . . . . . . . . . . . . . . . . . . .                            .   .   .   402
                P5.8.2     Age and Time . . . . . . . . . . . . . . . . . . . . . . . .                            .   .   .   402
                P5.8.3     Temperature and Delay . . . . . . . . . . . . . . . . . . .                             .   .   .   402
         P5.9   Worst Case Conditions and Derating Factor . . . . . . . . . . . . . .                              .   .   .   402
                P5.9.1     Worst-Case Commercial . . . . . . . . . . . . . . . . . . .                             .   .   .   402
                P5.9.2     Worst-Case Industrial . . . . . . . . . . . . . . . . . . . .                           .   .   .   402
                P5.9.3     Worst-Case Industrial, Non-Ambient Junction Temperature                                 .   .   .   402

6 Power Analysis and Power-Aware Design                                                                                        403
  6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   403
      6.1.1 Importance of Power and Energy . . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   403
      6.1.2 Industrial Names and Products . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   403
      6.1.3 Power vs Energy . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   404
      6.1.4 Batteries, Power and Energy . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   405
              6.1.4.1 Do Batteries Store Energy or Power?          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   405
              6.1.4.2 Battery Life and Efficiency . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   405
              6.1.4.3 Battery Life and Power . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   406
  6.2 Power Equations . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   409
      6.2.1 Switching Power . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   410
      6.2.2 Short-Circuited Power . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   411
      6.2.3 Leakage Power . . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   411
      6.2.4 Glossary . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   412
      6.2.5 Note on Power Equations . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   413
  6.3 Overview of Power Reduction Techniques . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   414
  6.4 Voltage Reduction for Power Reduction . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   415
  6.5 Data Encoding for Power Reduction . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   416
      6.5.1 How Data Encoding Can Reduce Power . . . .             .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   416
      6.5.2 Example Problem: Sixteen Pulser . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   419
              6.5.2.1 Problem Statement . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   419
              6.5.2.2 Additional Information . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   420
xii                                                                                                    CONTENTS


                   6.5.2.3 Answer . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   420
      6.6   Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   424
            6.6.1 Introduction to Clock Gating . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   424
            6.6.2 Implementing Clock Gating . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   425
            6.6.3 Design Process . . . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   426
            6.6.4 Effectiveness of Clock Gating . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   427
            6.6.5 Example: Reduced Activity Factor with Clock Gating           .   .   .   .   .   .   .   .   .   .   .   429
            6.6.6 Clock Gating with Valid-Bit Protocol . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   431
                   6.6.6.1 Valid-Bit Protocol . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   431
                   6.6.6.2 How Many Clock Cycles for Module? . . .             .   .   .   .   .   .   .   .   .   .   .   433
                   6.6.6.3 Adding Clock-Gating Circuitry . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   434
            6.6.7 Example: Pipelined Circuit with Clock-Gating . . . .         .   .   .   .   .   .   .   .   .   .   .   437
      6.7   Power Problems . . . . . . . . . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   439
            P6.1 Short Answers . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   439
                   P6.1.1   Power and Temperature . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   439
                   P6.1.2   Leakage Power . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   439
                   P6.1.3   Clock Gating . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   439
                   P6.1.4   Gray Coding . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   439
            P6.2 VLSI Gurus . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   439
                   P6.2.1   Effect on Power . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   439
                   P6.2.2   Critique . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   440
            P6.3 Advertising Ratios . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   440
            P6.4 Vary Supply Voltage . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   440
            P6.5 Clock Speed Increase Without Power Increase . . . .           .   .   .   .   .   .   .   .   .   .   .   441
                   P6.5.1   Supply Voltage . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   441
                   P6.5.2   Supply Voltage . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   441
            P6.6 Power Reduction Strategies . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   441
                   P6.6.1   Supply Voltage . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   441
                   P6.6.2   Transistor Sizing . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   441
                   P6.6.3   Adding Registers to Inputs . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   441
                   P6.6.4   Gray Coding . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   441
            P6.7 Power Consumption on New Chip . . . . . . . . . . .           .   .   .   .   .   .   .   .   .   .   .   442
                   P6.7.1   Hypothesis . . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   442
                   P6.7.2   Experiment . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   442
                   P6.7.3   Reality . . . . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   442
CONTENTS                                                                                                   xiii


7 Fault Testing and Testability                                                                            443
  7.1 Faults and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   .   .   .   .   .   443
       7.1.1 Overview of Faults and Testing . . . . . . . . . . . . . . . . . .        .   .   .   .   .   443
               7.1.1.1 Faults . . . . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   443
               7.1.1.2 Causes of Faults . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   443
               7.1.1.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   444
               7.1.1.4 Burn In . . . . . . . . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   444
               7.1.1.5 Bin Sorting . . . . . . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   444
               7.1.1.6 Testing Techniques . . . . . . . . . . . . . . . . . . .        .   .   .   .   .   445
               7.1.1.7 Design for Testability (DFT) . . . . . . . . . . . . . .        .   .   .   .   .   445
       7.1.2 Example Problem: Economics of Testing . . . . . . . . . . . . .           .   .   .   .   .   446
       7.1.3 Physical Faults . . . . . . . . . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   447
               7.1.3.1 Types of Physical Faults . . . . . . . . . . . . . . . . .      .   .   .   .   .   447
               7.1.3.2 Locations of Faults . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   447
               7.1.3.3 Layout Affects Locations . . . . . . . . . . . . . . . .        .   .   .   .   .   448
               7.1.3.4 Naming Fault Locations . . . . . . . . . . . . . . . . .        .   .   .   .   .   448
       7.1.4 Detecting a Fault . . . . . . . . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   448
               7.1.4.1 Which Test Vectors will Detect a Fault? . . . . . . . .         .   .   .   .   .   449
       7.1.5 Mathematical Models of Faults . . . . . . . . . . . . . . . . . .         .   .   .   .   .   450
               7.1.5.1 Single Stuck-At Fault Model . . . . . . . . . . . . . .         .   .   .   .   .   450
       7.1.6 Generate Test Vector to Find a Mathematical Fault . . . . . . . .         .   .   .   .   .   451
               7.1.6.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   451
               7.1.6.2 Example of Finding a Test Vector . . . . . . . . . . . .        .   .   .   .   .   452
       7.1.7 Undetectable Faults . . . . . . . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   452
               7.1.7.1 Redundant Circuitry . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   452
               7.1.7.2 Curious Circuitry and Fault Detection . . . . . . . . .         .   .   .   .   .   454
  7.2 Test Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    .   .   .   .   .   455
       7.2.1 A Small Example . . . . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   455
       7.2.2 Choosing Test Vectors . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   455
               7.2.2.1 Fault Domination . . . . . . . . . . . . . . . . . . . .        .   .   .   .   .   456
               7.2.2.2 Fault Equivalence . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   457
               7.2.2.3 Gate Collapsing . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   457
               7.2.2.4 Node Collapsing . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   458
               7.2.2.5 Fault Collapsing Summary . . . . . . . . . . . . . . .          .   .   .   .   .   458
       7.2.3 Fault Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   459
       7.2.4 Test Vector Generation and Fault Detection . . . . . . . . . . . .        .   .   .   .   .   459
       7.2.5 Generate Test Vectors for 100% Coverage . . . . . . . . . . . . .         .   .   .   .   .   459
               7.2.5.1 Collapse the Faults . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   460
               7.2.5.2 Check for Fault Domination . . . . . . . . . . . . . . .        .   .   .   .   .   462
               7.2.5.3 Required Test Vectors . . . . . . . . . . . . . . . . . .       .   .   .   .   .   463
               7.2.5.4 Faults Not Covered by Required Test Vectors . . . . . .         .   .   .   .   .   463
               7.2.5.5 Order to Run Test Vectors . . . . . . . . . . . . . . . .       .   .   .   .   .   464
               7.2.5.6 Summary of Technique to Find and Order Test Vectors             .   .   .   .   .   465
               7.2.5.7 Complete Analysis . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   466
       7.2.6 One Fault Hiding Another . . . . . . . . . . . . . . . . . . . . .        .   .   .   .   .   467
xiv                                                                                                          CONTENTS


      7.3   Scan Testing in General . . . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   467
            7.3.1 Structure and Behaviour of Scan Testing . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   468
            7.3.2 Scan Chains . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   468
                    7.3.2.1 Circuitry in Normal and Scan Mode . .            .   .   .   .   .   .   .   .   .   .   .   .   .   468
                    7.3.2.2 Scan in Operation . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   469
                    7.3.2.3 Scan in Operation with Example Circuit           .   .   .   .   .   .   .   .   .   .   .   .   .   470
            7.3.3 Summary of Scan Testing . . . . . . . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   475
            7.3.4 Time to Test a Chip . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   476
                    7.3.4.1 Example: Time to Test a Chip . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   476
      7.4   Boundary Scan and JTAG . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   477
            7.4.1 Boundary Scan History . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   477
            7.4.2 JTAG Scan Pins . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   478
            7.4.3 Scan Registers and Cells . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   478
            7.4.4 Scan Instructions . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   479
            7.4.5 TAP Controller . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   479
            7.4.6 Other descriptions of JTAG/IEEE 1194.1 . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   .   480
      7.5   Built In Self Test . . . . . . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   481
            7.5.1 Block Diagram . . . . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   481
                    7.5.1.1 Components . . . . . . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   481
                    7.5.1.2 Linear Feedback Shift Register (LFSR) .          .   .   .   .   .   .   .   .   .   .   .   .   .   483
                    7.5.1.3 Maximal-Length LFSR . . . . . . . . .            .   .   .   .   .   .   .   .   .   .   .   .   .   484
            7.5.2 Test Generator . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   485
            7.5.3 Signature Analyzer . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   486
            7.5.4 Result Checker . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   486
            7.5.5 Arithmetic over Binary Fields . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   487
            7.5.6 Shift Registers and Characteristic Polynomials . .         .   .   .   .   .   .   .   .   .   .   .   .   .   487
                    7.5.6.1 Circuit Multiplication . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   489
            7.5.7 Bit Streams and Characteristic Polynomials . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   489
            7.5.8 Division . . . . . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   489
            7.5.9 Signature Analysis: Math and Circuits . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   490
            7.5.10 Summary . . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   491
      7.6   Scan vs Self Test . . . . . . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   496
      7.7   Problems on Faults, Testing, and Testability . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   497
            P7.1 Based on Smith q14.9: Testing Cost . . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   497
            P7.2 Testing Cost and Total Cost . . . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   497
            P7.3 Minimum Number of Faults . . . . . . . . . . . .            .   .   .   .   .   .   .   .   .   .   .   .   .   498
            P7.4 Smith q14.10: Fault Collapsing . . . . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   498
            P7.5 Mathematical Models and Reality . . . . . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   .   498
            P7.6 Undetectable Faults . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   498
            P7.7 Test Vector Generation . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   498
                    P7.7.1     Choice of Test Vectors . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   499
                    P7.7.2     Number of Test Vectors . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   499
            P7.8 Time to do a Scan Test . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   499
            P7.9 BIST . . . . . . . . . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   499
                    P7.9.1     Characteristic Polynomials . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   499
CONTENTS                                                                                                                                        xv


                 P7.9.2     Test Generation . . . . . . . . . . . . . . . . . . . . . . . . .                                               .   500
                 P7.9.3     Signature Analyzer . . . . . . . . . . . . . . . . . . . . . . .                                                .   500
                 P7.9.4     Probabilty of Catching a Fault . . . . . . . . . . . . . . . . . .                                              .   500
                 P7.9.5     Probabilty of Catching a Fault . . . . . . . . . . . . . . . . . .                                              .   500
                 P7.9.6     Detecting a Specific Fault . . . . . . . . . . . . . . . . . . . .                                               .   500
                 P7.9.7     Time to Run Test . . . . . . . . . . . . . . . . . . . . . . . .                                                .   500
         P7.10   Power and BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                 .   500
         P7.11   Timing Hazards and Testability . . . . . . . . . . . . . . . . . . . . . .                                                 .   501
         P7.12   Testing Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                               .   501
                 P7.12.1 Are there any physical faults that are detectable by scan testing
                            but not by built-in self testing? . . . . . . . . . . . . . . . . .                                             . 501
                 P7.12.2 Are there any physical faults that are detectable by built-in self
                            testing but not by scan testing? . . . . . . . . . . . . . . . . .                                              .   501
         P7.13   Fault Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                              .   501
                 P7.13.1 Design test generator . . . . . . . . . . . . . . . . . . . . . .                                                  .   502
                 P7.13.2 Design signature analyzer . . . . . . . . . . . . . . . . . . . .                                                  .   502
                 P7.13.3 Determine if a fault is detectable . . . . . . . . . . . . . . . .                                                 .   502
                 P7.13.4 Testing time . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                 .   502

8 Review                                                                                                                                        503
  8.1 Overview of the Term . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   503
  8.2 VHDL . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   504
       8.2.1 VHDL Topics . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   504
       8.2.2 VHDL Example Problems . . .            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   504
  8.3 RTL Design Techniques . . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   505
       8.3.1 Design Topics . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   505
       8.3.2 Design Example Problems . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   505
  8.4 Functional Verification . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   506
       8.4.1 Verification Topics . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   506
       8.4.2 Verification Example Problems .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   506
  8.5 Performance Analysis and Optimization         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   507
       8.5.1 Performance Topics . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   507
       8.5.2 Performance Example Problems           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   507
  8.6 Timing Analysis . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   508
       8.6.1 Timing Topics . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   508
       8.6.2 Timing Example Problems . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   508
  8.7 Power . . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   509
       8.7.1 Power Topics . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   509
       8.7.2 Power Example Problems . . .           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   509
  8.8 Testing . . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   510
       8.8.1 Testing Topics . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   510
       8.8.2 Testing Example Problems . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   510
  8.9 Formulas to be Given on Final Exam . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   511
xvi                                                                                                                        CONTENTS


II Solutions to Assignment Problems                                                                                                            1
1 VHDL Problems                                                                                                                                 3
  P1.1 IEEE 1164 . . . . . . . . . . . . . . . . . . . . . . . . .                     .   .   .   .   .   .   .   .   .   .   .   .   .   .    3
  P1.2 VHDL Syntax . . . . . . . . . . . . . . . . . . . . . . .                       .   .   .   .   .   .   .   .   .   .   .   .   .   .    4
  P1.3 Flops, Latches, and Combinational Circuitry . . . . . . .                       .   .   .   .   .   .   .   .   .   .   .   .   .   .    7
  P1.4 Counting Clock Cycles . . . . . . . . . . . . . . . . . .                       .   .   .   .   .   .   .   .   .   .   .   .   .   .    9
  P1.5 Arithmetic Overflow . . . . . . . . . . . . . . . . . . .                        .   .   .   .   .   .   .   .   .   .   .   .   .   .   11
  P1.6 Delta-Cycle Simulation: Pong . . . . . . . . . . . . . .                        .   .   .   .   .   .   .   .   .   .   .   .   .   .   13
  P1.7 Delta-Cycle Simulation: Baku . . . . . . . . . . . . . .                        .   .   .   .   .   .   .   .   .   .   .   .   .   .   14
  P1.8 Clock-Cycle Simulation . . . . . . . . . . . . . . . . . .                      .   .   .   .   .   .   .   .   .   .   .   .   .   .   17
  P1.9 VHDL — VHDL Behavioural Comparison: Teradactyl .                                .   .   .   .   .   .   .   .   .   .   .   .   .   .   20
  P1.10VHDL — VHDL Behavioural Comparison: Ichtyostega                                 .   .   .   .   .   .   .   .   .   .   .   .   .   .   21
  P1.11Waveform — VHDL Behavioural Comparison . . . . . .                              .   .   .   .   .   .   .   .   .   .   .   .   .   .   23
  P1.12Hardware — VHDL Comparison . . . . . . . . . . . .                              .   .   .   .   .   .   .   .   .   .   .   .   .   .   25
  P1.138-Bit Register . . . . . . . . . . . . . . . . . . . . . . .                    .   .   .   .   .   .   .   .   .   .   .   .   .   .   27
       P1.13.1 Asynchronous Reset . . . . . . . . . . . . . . .                        .   .   .   .   .   .   .   .   .   .   .   .   .   .   27
       P1.13.2 Discussion . . . . . . . . . . . . . . . . . . . .                      .   .   .   .   .   .   .   .   .   .   .   .   .   .   28
       P1.13.3 Testbench for Register . . . . . . . . . . . . . .                      .   .   .   .   .   .   .   .   .   .   .   .   .   .   28
  P1.14Synthesizable VHDL and Hardware . . . . . . . . . . .                           .   .   .   .   .   .   .   .   .   .   .   .   .   .   30
  P1.15Datapath Design . . . . . . . . . . . . . . . . . . . . . .                     .   .   .   .   .   .   .   .   .   .   .   .   .   .   32
       P1.15.1 Correct Implementation? . . . . . . . . . . . . .                       .   .   .   .   .   .   .   .   .   .   .   .   .   .   32
       P1.15.2 Smallest Area . . . . . . . . . . . . . . . . . . .                     .   .   .   .   .   .   .   .   .   .   .   .   .   .   36
       P1.15.3 Shortest Clock Period . . . . . . . . . . . . . .                       .   .   .   .   .   .   .   .   .   .   .   .   .   .   37

2 Design Problems                                                                                                                              39
  P2.1 Synthesis . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   39
       P2.1.1 Data Structures . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   39
       P2.1.2 Own Code vs Libraries . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   39
  P2.2 Design Guidelines . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   39
  P2.3 Dataflow Diagram Optimization . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   42
       P2.3.1 Resource Usage . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   42
       P2.3.2 Optimization . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   43
  P2.4 Dataflow Diagram Design . . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   44
       P2.4.1 Maximum Performance . . . . .            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   44
       P2.4.2 Minimum area . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   46
  P2.5 Michener: Design and Optimization . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   47
  P2.6 Dataflow Diagrams with Memory Arrays             .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   48
       P2.6.1 Algorithm 1 . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   49
       P2.6.2 Algorithm 2 . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   51
  P2.7 2-bit adder . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   52
       P2.7.1 Generic Gates . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   52
       P2.7.2 FPGA . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   52
  P2.8 Sketches of Problems . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   53
CONTENTS                                                                                                                                    xvii


3 Functional Verification Problems                                                                                                            55
  P3.1 Carry Save Adder . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    55
  P3.2 Traffic Light Controller . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    55
       P3.2.1 Functionality . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    55
       P3.2.2 Boundary Conditions . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    56
       P3.2.3 Assertions . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    56
  P3.3 State Machines and Verification . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    57
       P3.3.1 Three Different State Machines . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    57
               P3.3.1.1 Number of Test Scenarios        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    57
               P3.3.1.2 Length of Test Scenario .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    58
               P3.3.1.3 Number of Flip Flops . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    58
       P3.3.2 State Machines in General . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    59
  P3.4 Test Plan Creation . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    59
       P3.4.1 Early Tests . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    60
       P3.4.2 Corner Cases . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    61
  P3.5 Sketches of Problems . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    62

4 Performance Analysis and Optimization Problems                                                                                             63
  P4.1 Farmer . . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    63
  P4.2 Network and Router . . . . . . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    64
       P4.2.1 Maximum Throughput . . . . . . . . .              .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    65
       P4.2.2 Packet Size and Performance . . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    66
  P4.3 Performance Short Answer . . . . . . . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    66
  P4.4 Microprocessors . . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    67
       P4.4.1 Average CPI . . . . . . . . . . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    67
       P4.4.2 Why not you too? . . . . . . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    68
       P4.4.3 Analysis . . . . . . . . . . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    69
  P4.5 Dataflow Diagram Optimization . . . . . . . .             .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    70
  P4.6 Performance Optimization with Memory Arrays              .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    70
  P4.7 Multiply Instruction . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    75
       P4.7.1 Highest Performance . . . . . . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    76
       P4.7.2 Performance Metrics . . . . . . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    78
xviii                                                                                          CONTENTS


5 Timing Analysis Problems                                                                                         79
  P5.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   79
  P5.2 Hold Time Violations . . . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   80
       P5.2.1 Cause . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   80
       P5.2.2 Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   80
       P5.2.3 Rectification . . . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   80
  P5.3 Latch Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   81
  P5.4 Critical Path and False Path . . . . . . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   83
  P5.5 Critical Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   83
       P5.5.1 Longest Path . . . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   84
       P5.5.2 Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   84
       P5.5.3 Missing Factors . . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   84
       P5.5.4 Critical Path or False Path? . . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   85
  P5.6 YACP: Yet Another Critical Path . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   86
  P5.7 Timing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   87
  P5.8 Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   89
       P5.8.1 Wires in FPGAs . . . . . . . . . . . . . . . . . . . . . . .         .   .   .   .   .   .   .   .   89
       P5.8.2 Age and Time . . . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   89
       P5.8.3 Temperature and Delay . . . . . . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   89
  P5.9 Worst Case Conditions and Derating Factor . . . . . . . . . . . . .         .   .   .   .   .   .   .   .   90
       P5.9.1 Worst-Case Commercial . . . . . . . . . . . . . . . . . . .          .   .   .   .   .   .   .   .   90
       P5.9.2 Worst-Case Industrial . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   90
       P5.9.3 Worst-Case Industrial, Non-Ambient Junction Temperature .            .   .   .   .   .   .   .   .   90
CONTENTS                                                                                                                                 xix


6 Power Problems                                                                                                                         91
  P6.1 Short Answers . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   91
       P6.1.1 Power and Temperature . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   91
       P6.1.2 Leakage Power . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   92
       P6.1.3 Clock Gating . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   92
       P6.1.4 Gray Coding . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   92
  P6.2 VLSI Gurus . . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   93
       P6.2.1 Effect on Power . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   93
       P6.2.2 Critique . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   93
  P6.3 Advertising Ratios . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   94
  P6.4 Vary Supply Voltage . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   94
  P6.5 Clock Speed Increase Without Power Increase       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   95
       P6.5.1 Supply Voltage . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   95
       P6.5.2 Supply Voltage . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   96
  P6.6 Power Reduction Strategies . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   96
       P6.6.1 Supply Voltage . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   97
       P6.6.2 Transistor Sizing . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   97
       P6.6.3 Adding Registers to Inputs . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   97
       P6.6.4 Gray Coding . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   98
  P6.7 Power Consumption on New Chip . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   98
       P6.7.1 Hypothesis . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   98
       P6.7.2 Experiment . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   99
       P6.7.3 Reality . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   99
xx                                                                                      CONTENTS


7 Problems on Faults, Testing, and Testability                                                        101
  P7.1 Based on Smith q14.9: Testing Cost . . . . . . . . . . . . . . . . . . . . . . . .         .   101
  P7.2 Testing Cost and Total Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      .   103
  P7.3 Minimum Number of Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . .           .   104
  P7.4 Smith q14.10: Fault Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . .       .   105
  P7.5 Mathematical Models and Reality . . . . . . . . . . . . . . . . . . . . . . . . .          .   105
  P7.6 Undetectable Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      .   105
  P7.7 Test Vector Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       .   106
       P7.7.1 Choice of Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . .        .   106
       P7.7.2 Number of Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . . .          .   106
  P7.8 Time to do a Scan Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       .   106
  P7.9 BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     .   107
       P7.9.1 Characteristic Polynomials . . . . . . . . . . . . . . . . . . . . . . . . .        .   107
       P7.9.2 Test Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       .   108
       P7.9.3 Signature Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        .   110
       P7.9.4 Probabilty of Catching a Fault . . . . . . . . . . . . . . . . . . . . . . .        .   113
       P7.9.5 Probabilty of Catching a Fault . . . . . . . . . . . . . . . . . . . . . . .        .   114
       P7.9.6 Detecting a Specific Fault . . . . . . . . . . . . . . . . . . . . . . . . .         .   114
       P7.9.7 Time to Run Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        .   115
  P7.10Power and BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       .   116
  P7.11Timing Hazards and Testability . . . . . . . . . . . . . . . . . . . . . . . . . . .       .   116
  P7.12Testing Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       .   118
       P7.12.1 Are there any physical faults that are detectable by scan testing but not by
               built-in self testing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . 118
       P7.12.2 Are there any physical faults that are detectable by built-in self testing but
               not by scan testing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     .   118
  P7.13Fault Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      .   119
       P7.13.1 Design test generator . . . . . . . . . . . . . . . . . . . . . . . . . . . .      .   119
       P7.13.2 Design signature analyzer . . . . . . . . . . . . . . . . . . . . . . . . .        .   119
       P7.13.3 Determine if a fault is detectable . . . . . . . . . . . . . . . . . . . . . .     .   120
       P7.13.4 Testing time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     .   120
Part I

Course Notes




     1
Chapter 1

VHDL: The Language

1.1 Introduction to VHDL
1.1.1 Levels of Abstraction

There are many different levels of abstraction for working with hardware:


   • Quantum: Schrodinger’s equations describe movement of electrons and holes through mate-
     rial.

   • Energy band: 2-dimensional diagrams that capture essential features of Schrodinger’s equa-
     tions. Energy-band diagrams are commonly used in nano-scale engineering.

   • Transistor: Signal values and time are continous (analog). Each transistor is modeled by a
     resistor-capacitor network. Overall behaviour is defined by differential equations in terms of
     the resistors and capacitors. Spice is a typical simulation tool.

   • Switch: Time is continuous, but voltage may be either continuous or discrete. Linear equa-
     tions are used, rather than differential equations. A rising edge may be modeled as a linear
     rise over some range of time, or the time between a definite low value and a definite high
     value may be modeled as having an undefined or rising value.

   • Gate: Transistors are grouped together into gates (e.g. AND, OR, NOT). Voltages are discrete
     values such as pure Boolean (0 or 1) or IEEE Standard Logic 1164, which has representations
     for different types of unknown or undefined values. Time may be continuous or may be
     discrete. If discrete, a common unit is the delay through a single inverter (e.g. a NOT gate
     has a delay of 1 and AND gate has a delay of 2).

   • Register transfer level: The essential characteristic of the register transfer level is that the
     behaviour of hardware is modeled as assignments to registers and combinational signals.
     Equations are written where a register signal is a function of other signals (e.g. c = a

                                                 3
4                                                                              CHAPTER 1. VHDL


       and b;). The assignments may be either combinational or registered. Combinational as-
       signments happen instanteously and registered assignments take exactly one clock cycle.
       There are variations on the pure register-transfer level. For example, time may be measured
       in clock phases rather than clock cycles, so as to allow assignments on either the rising or
       falling edge of a clock. Another variation is to have multiple clocks that run at different
       speeds — a clock on a bus might run at half the speed of the primary clock for the chip.

    • Transaction level: The basic unit of computation is a transaction, such as executing an in-
      struction on a microprocessor, transfering data across a bus, or accessing memory. Time
      is usually measured as an estimate (e.g. a memory write requires 15 clock cycles, or a
      bus transfer requires 250 ns.). The building blocks of the transaction level are processors,
      controllers, memory arrays, busses, intellectual property (IP) blocks (e.g. UARTs). The
      behaviour of the building blocks are described with software-like models, often written in
      behavioural VHDL, SystemC, or SystemVerilog. The transaction level has many similarities
      to a software model of a distributed system.

    • Electronic-system level: Looks at an entire electronic system, with both hardware and soft-
      ware.


In this course, we will focus on the register-transfer level. In the second half of the course, we will
look at how analog phenomenon, such as timing and power, affect the register-transfer level. In
these chapters we will occasionally dip down into the transistor, switch, and gate levels.


1.1.2 VHDL Origins and History


VHDL = VHSIC Hardware Description Language
   VHSIC = Very High Speed Integrated Circuit


       The VHSIC Hardware Description Language (VHDL) is a formal notation intended
       for use in all phases of the creation of electronic systems. Because it is both machine
       readable and human readable, it supports the development, verification, synthesis and
       testing of hardware designs, the communication of hardware design data, and the
       maintenance, modification, and procurement of hardware.

       Language Reference Manual (IEEE Design Automation Standards Committee,
       1993a)

•   development                                     •   hardware designs
•   verification                                     •   communication
•   synthesis                                       •   maintenance
•   testing                                         •   modification
1.1.2 VHDL Origins and History                                                                 5


• procurement



        VHDL is a lot more than synthesis of digital
                       hardware

VHDL History      ....................................................................... .

• Developed by the United States Department of Defense as part of the very high speed integrated
  circuit (VHSIC) program in the early 1980s.
• The Department of Defense intended VHDL to be used for the documentation, simulation and
  verification of electronic systems.
• Goals:
  – improve design process over schematic entry
  – standardize design descriptions amongst multiple vendors
  – portable and extensible
• Inspired by the ADA programming language
  – large: 97 keywords, 94 syntactic rules
  – verbose (designed by committee)
  – static type checking, overloading
  – complicated syntax: parentheses are used for both expression grouping and array indexing
    Example:
    a <= b * (3 + c);           -- integer
    a <= (3 + c);               -- 1-element array of integers


• Standardized by IEEE in 1987 (IEEE 1076-1987), revised in 1993, 2000.
• In 1993 the IEEE standard VHDL package for model interoperability, STD_LOGIC_1164
  (IEEE Standard 1164-1993), was developed.
  – std_logic_1164 defines 9 different values for signals
• In 1997 the IEEE standard packages for arithmetic over std logic and bit signals were
  defined (IEEE Standard 1076.3–1997).
  – numeric_std defines arithmetic over std logic vectors and integers.
              Note:      This is the package that you should use for arithmetic. Don’t
              use std logic arith — it has less uniform support for mixed inte-
              ger/signal arithmetic and has a greater tendency for differences between
              tools.
  – numeric_bit defines arithmetic over bit vectors and integers. We won’t use bit
    signals in this course, so you don’t need to worry about this package.
6                                                                           CHAPTER 1. VHDL


1.1.3 Semantics

The original goal of VHDL was to simulate circuits. The semantics of the language define circuit
behaviour.
                                               a
                 c <= a AND b;   simulation
                                               b

                                               c



But now, VHDL is used in simulation and synthesis. Synthesis is concerned with the structure of
the circuit.

Synthesis: converts one type of description (behavioural) into another, lower level, description
(usually a netlist).
                                                          a
                            c <= a AND b;     synthesis              c
                                                          b


Synthesis is a computer-aided design (CAD) technique that transforms a designer’s concise, high-
level description of a circuit into a structural description of a circuit.


CAD Tools     ............................................................................


CAD Tools allow designers to automate lower-level design processes in implementing the desired
functionality of a system.

NOTE: EDA = Electronic Design Automation. In digital hardware design EDA = CAD.


Synthesis vs Simulation     ................................................................


For synthesis, we want the code we write to define the structure of the hardware that is generated.
                                                          a
                            c <= a AND b;     synthesis              c
                                                          b
1.1.4 Synthesis of a Simulation-Based Language                                                   7


The VHDL semantics define the behaviour of the hardware that is generated, not the structure
of the hardware. The scenario below complies with the semantics of VHDL, because the two
synthesized circuits produce the same behaviour. If the two synthesized circuits had different
behaviour, then the scenario would not comply with the VHDL Standard.
                                                                           a
                                          a
                                                          c   simulation   b
                                          b
                                sis
                            th e
                                                                           c
                        syn


                                              different                          same
            c <= a AND b;
                                              structure                        behaviour
                         syn
                            the




                                                                           a
                              sis




                                      a
                                                          c   simulation   b
                                      b
                                                                           c




1.1.4 Synthesis of a Simulation-Based Language
• Not all of VHDL is synthesizable
  – c <= a AND b; (synthesizable)
  – c <= a AND b AFTER 2ns; (NOT synthesizable)
    ∗ how do you build a circuit with exactly 2ns of delay through an AND gate?
    ∗ more examples of non-synthesizable code are in section 1.11
    – See section 1.11 for more details
• Different synthesis tools support different subsets of VHDL
• Some tools generate erroneous hardware for some code
  – behaviour of hardware differs from VHDL semantics
• Some tools generate unpredictable hardware (Hardware that has the correct behaviour, but un-
  desirable or weird structure).
• There is an IEEE standard (1076.6) for a synthesizable subset of VHDL, but tool vendors don’t
  yet conform to it. (Most vendors still don’t have full support for the 1993 extensions to VHDL!).
  For more info, see http://www.vhdl.org/siwg/.


1.1.5 Solution to Synthesis Sanity
•   Pick a high-quality synthesis tool and study its documentation thoroughly
•   Learn the idioms of the tool
•   Different VHDL code with same behaviour can result in very different circuits
•   Be careful if you have to port VHDL code from one tool to another
8                                                                             CHAPTER 1. VHDL


• KISS: Keep It Simple Stupid
  – VHDL examples will illustrate reliable coding techniques for the synthesis tools from Synop-
    sys, Mentor Graphics, Altera, Xilinx, and most other companies as well.
  – Follow the coding guidelines and examples from lecture
  – As you write VHDL, think about the hardware you expect to get.
             Note:     If you can’t predict the hardware, then the hardware probably
             won’t be very good (small, fast, correct, etc)


1.1.6 Standard Logic 1164

At the core of VHDL is a package named STANDARD that defines a type named bit with values
of ’0’ and ’1’. For simulation, it helpful to have additional values, such as “undefined” and
“high impedance”. Many companies created their own (incompatible) definitions of signal types
for simulation. To regain compatibility amongst packages from different companies, the IEEE
defined std logc 1164 to be the standard type for signal values in VHDL simulation.
    ’U’    uninitialized
    ’X’    strong unknown
    ’0’    strong 0
    ’1’    strong 1
    ’Z’    high impedance
    ’W’    weak unknown
    ’L’    weak 0
    ’H’    weak 1
    ’--’   don’t care
The most common values are: ’U’, ’X’, ’0’, ’1’.

If you see ’X’ in a simulation, it usually means that there is a mistake in your code.

Every VHDL file that you write should begin with: library ieee;
                                                      use ieee.std_logic_1164.all;

           Note: std logic vs boolean    The std logic values ’1’ and ’0’ are not
           the same as the boolean values true and false. For example, you must
           write if a = ’1’ then .... The code if a then ... will not type-
           check if a is of type std logic.

From a VLSI perspective, a weak value will come from a smaller gate. One aspect of VHDL that
we don’t touch on in ece327 is ”resolution”, which describes how to determine the value of a signal
if the signal is driven by ¡b¿more than one¡/b¿ process. (In ece327, we restrict ourselves to having
each signal be driven by (be the target of) exactly one process). The std logic 1164 library provides
a resolution function to deal with situation where different processes drive the same signal with
different values. In this situation, a strong value (e.g. ’1’) will overpower a weak value (e.g. ’L’).
If two processes drive the signal with different strong values (e.g. ’1’ and ’0’) the signal resolves
1.2. COMPARISON OF VHDL TO OTHER HARDWARE DESCRIPTION LANGUAGES                                  9


to a strong unknown (’X’). If a signal is driven with two different weak values (e.g. ’H’ and ’L’),
the signal resolves to a weak unknown (’W’).



1.2 Comparison of VHDL to Other Hardware Description Lan-
    guages
1.2.1 VHDL Disadvantages
•   Some VHDL programs cannot be synthesized
•   Different tools support different subsets of VHDL.
•   Different tools generate different circuits for same code
•   VHDL is verbose
    – Many characters to say something simple
• VHDL is complicated and confusing
  – Many different ways of saying the same thing
  – Constructs that have similar purpose have very different syntax (case vs. select)
  – Constructs that have similar syntax have very different semantics (variables vs signals)
• Hardware that is synthesized is not always obvious (when is a signal a flip-flop vs latch vs
  combinational)
  – The infamous latch inference problem (See section 1.5.2 for more information)


1.2.2 VHDL Advantages
• VHDL supports unsynthesizable constructs that are useful in writing high-level models, test-
  benches and other non-hardware or non-synthesizable artifacts that we need in hardware design.
  VHDL can be used throughout a large portion of the design process in different capacities, from
  specification to implementation to verification.
• VHDL has static typechecking — many errors can be caught before synthesis and/or simulation.
  (In this respect, it is more similar to Java than to C.)
• VHDL has a rich collection of datatypes
• VHDL is a full-featured language with a good module system (libraries and packages).
• VHDL has a well-defined standard.
10                                                                            CHAPTER 1. VHDL


1.2.3 VHDL and Other Languages

1.2.3.1 VHDL vs Verilog
• Verilog is a “simpler” language: smaller language, simple circuits are easier to write
• VHDL has more features than Verilog
  – richer set of data types and strong type checking
  – VHDL offers more flexibility and expressivity for constructing large systems.
• The VHDL Standard is more standard than the Verilog Standard
  – VHDL and Verilog have simulation-based semantics
  – Simulation vendors generally conform to VHDL standard
  – Some Verilog constructs give different behaviours in simulation and synthesis
•    VHDL is used more than Verilog in Europe and Japan
•    Verilog is used more than VHDL in North America
•    VHDL is used more in FPGAs than in ASICs
•    South-East Asia, India, South America: ?????


1.2.3.2 VHDL vs System Verilog
• System Verilog is a superset of Verilog. It extends Verilog to make it a full object-oriented
  hardware modelling language
• Syntax is based on Verilog and C++.
• As of 2007, System Verilog is used almost exclusively for test benches and simulation. Very
  few people are trying to use it to do hardware design.
• System Verilog grew out of Superlog, a proposed language that was based on Verilog and C.
  Basic core came from Verilog. C-like extensions included to make language more expressive and
  powerful. Developed by originally the company Co-Design Automation and then standardized
  by Accellera, an organization aimed at standardizing EDA languages. Co-Design was purchased
  by Synopsys and now Synopsys is the leading proponent of System Verilog.


1.2.3.3 VHDL vs SystemC
• System C looks like C — familiar syntax
• C is often used in algorithmic descriptions of circuits, so why not try to use it for synthesizable
  code as well?
• If you think VHDL is hard to synthesize, try C....
• SystemC simulation is slower than advertised
1.3. OVERVIEW OF SYNTAX                                                                      11


1.2.3.4 Summary of VHDL Evaluation
• VHDL is far from perfect and has lots of annoying characteristics
• VHDL is a better language for education than Verilog because the static typechecking enforces
  good software engineering practices
• The richness of VHDL will be useful in creating concise high-level models and powerful test-
  benches



1.3 Overview of Syntax
This section is just a brief overview of the syntax of VHDL, focusing on the constructs that are
most commonly used. For more information, read a book on VHDL and use online resources.
(Look for “VHDL” under the “Documentation” tab in the E&C 327 web pages.)


1.3.1 Syntactic Categories

There are five major categories of syntactic constructs.
           (There are many, many minor categories and subcategories of constructs.)
• Library units (section 1.3.2)
  – Top-level constructs (packages, entities, architectures)
• Concurrent statements (section 1.3.4)
  – Statements executed at the same time (in parallel)
• Sequential statements (section 1.3.7)
  – Statements executed in series (one after the other)
• Expressions
  – Arithmetic (section 1.10), Boolean, Vectors , etc
• Declarations
  – Components , signals, variables, types, functions, ....


1.3.2 Library Units

Library units are the top-level syntactic constructs in VHDL. They are used to define and include
libraries, declare and implement interfaces, define packages of declarations and otherwise bind
together VHDL code.
• Package body
   – define the contents of a library
• Packages
  – determine which parts of the library are externally visible
12                                                                               CHAPTER 1. VHDL


• Use clause
  – use a library in an entity/architecture or another package
  – technically, use clauses are part of entities and packages, but they proceed the entity/package
    keyword, so we list them as top-level constructs
• Entity (section 1.3.3)
  – define interface to circuit
• Architecture (section 1.3.3)
  – define internal signals and gates of circuit


1.3.3 Entities and Architecture

Each hardware module is described with an Entity/Architecture pair


                      entity                                              entity
                                                                       architecture

                  architecture




                                 Figure 1.1: Entity and Architecture


• Entity: interface                                 • Architecture: internals
  – names, modes (in / out), types of
    externally visible signals of circuit              – structure and behaviour of module




library ieee;
use ieee.std_logic_1164.all;

entity and_or is
  port (
    a, b, c : in std_logic ;
    z       : out std_logic
  );
end and_or;
                                  Figure 1.2: Example of an entity
1.3.3 Entities and Architecture                                              13


The syntax of VHDL is defined using a variation on Backus-Naur forms (BNF).



[ { use_clause } ]
entity ENTITYID is
  [ port (
      { SIGNALID : (in | out) TYPEID [ := expr ] ; }
    );
  ]
  [ { declaration } ]
[ begin
    { concurrent_statement } ]
end [ entity ] ENTITYID ;
                          Figure 1.3: Simplified grammar of entity



architecture    main of and_or is
  signal x :    std_logic;
begin
  x <= a AND    b;
  z <= x OR     (a AND c);
end main;
                            Figure 1.4: Example of architecture



[ { use_clause } ]
architecture ARCHID of ENTITYID is
  [ { declaration } ]
begin
  [ { concurrent_statement } ]
end [ architecture ] ARCHID ;
                       Figure 1.5: Simplified grammar of architecture
14                                                                         CHAPTER 1. VHDL


1.3.4 Concurrent Statements
• Architectures contain concurrent statements
• Concurrent statements execute in parallel (Figure1.6)
  – Concurrent statements make VHDL fundamentally different from most software languages.
  – Hardware (gates) naturally execute in parallel — VHDL mimics the behaviour of real hard-
    ware.
  – At each infinitesimally small moment of time, each gate:
      1. samples its inputs
      2. computes the value of its output
      3. drives the output


           architecture main of bowser is             architecture main of bowser is
           begin                                      begin
             x1 <= a AND b;                             z <= NOT x2;
             x2 <= NOT x1;                              x2 <= NOT x1;
             z <= NOT x2;                               x1 <= a AND b;
           end main;                                  end main;




                                  a         x1   x2
                                                          z
                                  b




                Figure 1.6: The order of concurrent statements doesn’t matter
1.3.4 Concurrent Statements                                                    15




                     . . . <= . . . when . . . else . . .;
 conditional assignment
                     • normal assignment (. . . <= . . .)
                     • if-then-else style (uses when)
       c <= a+b when sel=’1’ else a+c when sel=’0’ else "0000";
 selected assignment        with . . . select
                                   . . . <= . . . when . . . | . . . ,
                                             . . . when . . . | . . . ,
                                             ...
                                             . . . when . . . | . . . ;

                          • case/switch style assignment
         with color select d <= "00" when red , "01" when . . .;
 component instantiation  . . . : . . . port map ( . . . => . . . , . . . );
                         • use an existing circuit
                         • section 1.3.5
      add1 : adder port map( a => f, b => g, s => h, co => i);
 for-generate                    . . . : for . . . in . . . generate
                                         ...
                                     end generate;

                          • replicate some hardware
     bgen:     for i in 1 to 7 generate b(i)<=a(7-i); end generate;
 if-generate                   . . . : if . . . generate
                                             ...
                                       end generate;

                          • conditionally create some hardware
               okgen : if optgoal /= fast then generate
                 result <= ((a and b) or (d and not e)) or g;
               end generate;
               fastgen : if optgoal = fast then generate
                 result <= ’1’;
               end generate;
 process                       process . . . begin
                                  ...
                               end process;

                          • the body of a process is executed sequentially
                          • Sections 1.3.6, 1.6



                 Figure 1.7: The most commonly used concurrent statements
16                                                                            CHAPTER 1. VHDL


1.3.5 Component Declaration and Instantiations

There are two different syntaxes for component declaration and instantiation. The VHDL-93 syn-
tax is much more concise than the VHDL-87 syntax.

Not all tools support the VHDL-93 syntax. For E&CE 327, some of the tools that we use do not
support the VHDL-93 syntax, so we are stuck with the VHDL-87 syntax.


1.3.6 Processes


•    Processes are used to describe complex and potentially unsynthesizable behaviour
•    A process is a concurrent statement (Section 1.3.4).
•    The body of a process contains sequential statements (Section 1.3.7)
•    Processes are the most complex and difficult to understand part of VHDL (Sections 1.5 and 1.6)


process (a, b, c)                                     process
begin                                                 begin
  y <= a AND b;                                         y <= a AND b;
  if (a = ’1’) then                                     z <= ’0’;
    z1 <= b AND c;                                      wait until rising_edge(clk);
    z2 <= NOT c;                                        if (a = ’1’) then
  else                                                    z <= ’1’;
    z1 <= b OR c;                                         y <= ’0’;
    z2 <= c;                                              wait until rising_edge(clk);
  end if;                                               else
end process;                                              y <= a OR b;
                                                        end if;
                                                      end process;

                                Figure 1.8: Examples of processes


• Processes must have either a sensitivity list or at least one wait statement on each execution path
  through the process.
• Processes cannot have both a sensitivity list and a wait statement.


Sensitivity List     ....................................................................... .


The sensitivity list contains the signals that are read in the process.

A process is executed when a signal in its sensitivity list changes value.
1.3.7 Sequential Statements                                                                       17


An important coding guideline to ensure consistent synthesis and simulation results is to include
all signals that are read in the sensitivity list. If you forget some signals, you will either end up
with unpredictable hardware and simulation results (different results from different programs) or
undesirable hardware (latches where you expected purely combinational hardware). For more on
this topic, see sections 1.5.2 and 1.6.

There is one exception to this rule: for a process that implements a flip-flop with an if rising edge
statement, it is acceptable to include only the clock signal in the sensitivity list — other signals
may be included, but are not needed.



[ PROCLAB : ] process ( sensitivity_list )
  [ { declaration } ]
begin
  { sequential_statement }
end process [ PROCLAB ] ;
                           Figure 1.9: Simplified grammar of process



1.3.7 Sequential Statements

Used inside processes and functions.


    wait                 wait until . . . ;
    signal assignment    . . . <= . . . ;
    if-then-else         if . . . then . . . elsif . . . end if;
    case                 case . . . is
                            when . . . | . . . => . . . ;
                            when . . . => . . . ;
                         end case;
    loop                 loop . . . end loop;
    while loop           while . . . loop . . . end loop;
    for loop             for . . . in . . . loop . . . end loop;
    next                 next . . . ;


                  Figure 1.10: The most commonly used sequential statements
18                                                                           CHAPTER 1. VHDL


1.3.8 A Few More Miscellaneous VHDL Features

Some constructs that are useful and will be described in later chapters and sections:
report : print a message on stderr while simulating
assert : assertions about behaviour of signals, very useful with report statements.
generics : parameters to an entity that are defined at elaboration time.
attributes : predefined functions for different datatypes. For example: high and low indices of a
      vector.



1.4 Concurrent vs Sequential Statements
All concurrent assignments can be translated into sequential statements. But, not all sequential
statements can be translated into concurrent statements.




1.4.1 Concurrent Assignment vs Process

The two code fragments below have identical behaviour:

architecture main of tiny is                          architecture main of tiny is
begin                                                 begin
  b <= a;                                               process (a) begin
end main;                                                 b <= a;
                                                        end process;
                                                      end main;




1.4.2 Conditional Assignment vs If Statements

The two code fragments below have identical behaviour:

         Concurrent Statements                                   Sequential Statements
                                                      if <cond> then
t <=   <val1> when <cond>                               t <= <val1>;
  else <val2>;                                        else
                                                        t <= <val2>;
                                                      end if
1.4.3 Selected Assignment vs Case Statement                                                   19


1.4.3 Selected Assignment vs Case Statement

The two code fragments below have identical behaviour

         Concurrent Statements                                        Sequential Statements
with <expr> select                                              case <expr> is
t <= <val1> when <choices1>,                                      when <choices1> =>
     <val2> when <choices2>,                                        t <= <val1>;
     <val3> when <choices3>;                                      when <choices2> =>
                                                                    t <= <val2>;
                                                                  when <choices3> =>
                                                                    t <= <val3>;
                                                                end case;




1.4.4 Coding Style

Code that’s easy to write with sequential statements, but difficult with concurrent:

          Sequential Statements                                 Concurrent Statements

case <expr> is                                      Overall structure:
  when <choice1> =>                                 with <expr> select
    if <cond> then                                    t <= ... when <choice1>,
        o <= <expr1>;                                      ... when <choice2>;
    else
        o <= <expr2>;                               Failed attempt:
    end if;                                         with <expr> select
  when <choice2> =>                                   t <= -- want to write:
    ...                                                    --   <val1> when <cond>
end case;                                                  --   else <val2>
                                                           -- but conditional assignment
                                                           -- is illegal here
                                                           when c1,
                                                           ...
                                                           when c2;



      Concurrent statement with correct behaviour, but messy:
        t <=   <expr1> when (expr = <choice1> AND     <cond>)
          else <expr2> when (expr = <choice1> AND NOT <cond>)
          else . . .
        ;
20                                                                        CHAPTER 1. VHDL


1.5 Overview of Processes
Processes are the most difficult VHDL construct to understand. This section gives an overview of
processes. Section 1.6 gives the details of the semantics of processes.
• Within a process, statements are executed almost sequentially
• Among processes, execution is done in parallel
• Remember: a process is a concurrent statement!


entity ENTITYID is
  interface declarations
end ENTITYID;

architecture ARCHID of ENTITYID is
begin
     concurrent statements ⇐=
     process begin
       sequential statements ⇐=
     end process;
  concurrent statements   ⇐=
end ARCHID;
                       Figure 1.11: Sequential statements in a process

Key concepts in VHDL semantics for processes:
• VHDL mimics hardware
• Hardware (gates) execute in parallel
• Processes execute in parallel with each other
• All possible orders of executing processes must produce the same simulation results (wave-
  forms)
• If a signal is not assigned a value, then it holds its previous value

 All orders of executing concurrent statements must
            produce the same waveforms
It doesn’t matter whether you are running on a single-threaded operating system, on a multi-
threaded operating system, on a massively parallel supercomputer, or on a special hardware emu-
lator with one FPGA chip per VHDL process — all simulations must be the same.

These concepts are the motivation for the semantics of executing processes in VHDL (Section 1.6)
and lead to the phenomenon of latch-inference (Section 1.5.2).
1.5. OVERVIEW OF PROCESSES                                                                           21




                              execution sequence       execution sequence       execution sequence
      architecture
       procA: process
          stmtA1;             A1                                 A1             A1
          stmtA2;                  A2                                 A2             A2
          stmtA3;                       A3                                 A3             A3
       end process;
       procB: process
          stmtB1;
          stmtB2;                            B1        B1                       B1
       end process;                               B2        B2                       B2




                           single threaded:   single threaded:   multithreaded: procA
                           procA before procB procB before procA and procB in parallel


                         Figure 1.12: Different process execution sequences




                     Figure 1.13: All execution orders must have same behaviour

Sections 1.5.1–1.5.3 discuss the hardware generated by processes.

Sections 1.6–1.6.7 discuss the behaviour and execution of processes.
22                                                                                                                CHAPTER 1. VHDL


1.5.1 Combinational Process vs Clocked Process

Each well-written synthesizable process is either combinational or clocked. Some synthesizable
processes that do not conform to our coding guidelines are both combinational and clocked. For
example, in a flip-flop with an asynchronous reset, the output is a combinational function of the
reset signal and a clocked function of the data input signal. We will deal with only with processes
that follow our coding conventions, and so we will continue to say that each process is either
combinational xor clocked.


Combinational process:       . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..

•    Executing the process takes part of one clock cycle
•    Target signals are outputs of combinational circuitry
•    A combinational processes must have a sensitivity list
•    A combinational process must not have any wait statements
•    A combinational process must not have any rising_edges, or falling_edges
•    The hardware for a combinational process is just combinational circuitry


Clocked process:      ..................................................................... .

•    Executing the process takes one (or more) clock cycles
•    Target signals are outputs of flops
•    Process contains one or more wait or if rising edge statements
•    Hardware contains combinational circuitry and flip flops


           Note:         Clocked processes are sometimes called “sequential processes”,
           but this can be easily confused with “sequential statements”, so in E&CE 327
           we’ll refer to synthesizable processes as either “combinational” or “clocked”.


Example Processes       ................................................................... .


         Combinational Process

process (a,b,c)
  p1 <= a;
  if (b = c) then
    p2 <= b;
  else
    p2 <= a;
  end if;
end process;
1.5.2 Latch Inference                                                                           23


                   Clocked Processes
process
begin
  wait until rising_edge(clk);
  b <= a;
end process;
process (clk)
begin
  if rising_edge(clk) then
    b <= a;
  end if;
end process;


1.5.2 Latch Inference

The semantics of VHDL require that if a signal is assigned a value on some passes through a
process and not on other passes, then on a pass through the process when the signal is not assigned
a value, it must maintain its value from the previous pass.


process (a, b, c)
begin                                              a
  if (a = ’1’) then
                                                   b
    z1 <= b;
    z2 <= b;                                       c
  else
                                                  z1
    z1 <= c;
  end if;                                         z2
end process;

                             Figure 1.14: Example of latch inference

When a signal’s value must be stored, VHDL infers a latch or a flip-flop in the hardware to store
the value.

If you want a latch or a flip-flop for the signal, then latch inference is good.

If you want combinational circuitry, then latch inference is bad.


Loop, Latch, Flop      .................................................................... .
24                                                                            CHAPTER 1. VHDL


                 a
                                         b                   z         b       D   Q       z
  b
                                z
                                         a       EN                    a
                                    Latch                         Flip-flop
       Combinational loop


     Question:       Write VHDL code for each of the above circuits


     Answer:


        combinational loop
              if a = ’1’ then
                z <= b;
              else
                z <= z;
              end if;
        latch
              if a = ’1’ then
                z <= b;
              end if;
        flop
              if rising edge(a) then
                z <= b;
              end if;



Causes of Latch Inference           ............................................................ .


Usually, “latch inference” refers to the unintentional creation of latches.

The most common cause of unintended latch inference is missing assignments to signals in if-then-
else and case statements.

Latch inference happens during elaboration. When using the Synopsys tools, look for:
                              Inferred memory devices

in the output or log files.
1.5.3 Combinational vs Flopped Signals                                                                                                     25


1.5.3 Combinational vs Flopped Signals

Signals assigned to in combinational processes are combinational.

Signals assigned to in clocked processes are outputs of flip-flops.



1.6 Details of Process Execution
In this section we go through the detailed semantics of how processes execute. These semantics
form the foundation for the simulation and synthesis of VHDL. The semantics define the simulation
behaviour, and the duty of synthesis is to produce hardware that has the same behaviour as the
simulation of the original VHDL code.


1.6.1 Simple Simulation

Before diving into the details of processes, we briefly review gate-level simulation with a simple
example, which we will then explore in excruciating detail through the semantics of VHDL.

With knowledge of just basic gate-level behaviour, we simulate the circuit below with waveforms
for a and b and calculate the behaviour for c, d, and e.
                                                0ns                                              10ns            12ns 15ns

                                        a

a               c       d               b
                                    e
                                        c
b
                                        d

                                        e




Different Programs, Same Behaviour          . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..


There are many different VHDL programs that will synthesize to this circuit. Three examples are:
26                                                                              CHAPTER 1. VHDL


process (a,b)                    process (a,b,c,d)                 process (a,b)
begin                            begin                             begin
  c <= a and b;                    c <= a and b;                     c <= a and b;
end process;                       d <= not c;                     end process;
process (b,c,d)                    e <= b and d;                   process (c)
begin                            end process;                      begin
  d <= not c;                                                        d <= not c;
  e <= b and d;                                                    end process;
end process;                                                       process (b,d)
                                                                   begin
                                                                     e <= b and d;
                                                                   end process;

The goal of the VHDL semantics is that all of these programs will have the same behaviour.
The two main challenges to make this happen are: a value change on a signal must propagate
instantaneously, and all gates must operate in parallel. We will return to these points in section 1.6.3


1.6.2 Temporal Granularities of Simulation

There are several different granularities of time to analyze VHDL behaviour. In this course, we
will discuss three major granularities: clock cycles, timing simulation, and “delta cycles”.
clock-cycle
       • smallest unit of time is a clock cycle
       • combinational logic has zero delay
       • flip-flops have a delay of one clock cycle
       • used for simulation early in the design cycle
       • fastest simulation run times
timing simulation
     • smallest unit of time is a nano, pico, or fempto second
     • combinational logic and wires have delay as computed by timing analysis tools
     • flip-flops have setup, hold, and clock-to-Q timing parameters
     • used for simulation when fine-tuning design and confirming that timing contraints are
       satisfied
     • slow simulation times for large circuits
delta cycles
     • units of time are artifacts of VHDL semantics and simulation software
     • simulation cycles, delta cycles, and simulation steps are infinitesimally small amounts of
        time
     • VHDL semantics are defined in terms of these concepts

In assignments and exams, you will need to be able to simulate VHDL code at each of the three
different levels of temporal granularity. In the laboratories and project, you will use simulation
1.6.3 Intuition Behind Delta-Cycle Simulation                                                       27


programs for both clock-cycle simulation and timing simulation. We don’t have access to a pro-
gram that will produce delta-cycle waveforms, but if anyone is looking for a challenging co-op job
or fourth-year design project....

For the remainder of section 1.6, we’ll look at only the delta cycle view of the world.


1.6.3 Intuition Behind Delta-Cycle Simulation

Zero-delay simulation might appear to be the simpler than simulation with delays through gates
(timing simulation), but in reality, zero-delay simulation algorithms are more complicated than
algorithms for timing simulation. The reason is that in zero-delay simulation, a sequence of de-
pendent events must appear to happen instantaneously (in zero time). In particular, the effect of an
event must propagate instantaneously through the combinational circuitry.

Two fundamental rules for zero-delay simulation:

   1. events appear to propagate through combinational circuitry instantaneously.
   2. all of the gates appear to operate in parallel

To make it appear that events propagate instaneously, VHDL introduces an artificial unit of time,
the delta cycle, to represent an infinitesimally small amount of time. In each delta cycle, every gate
in the circuit will sample its inputs, compute its result, and drive its output signal with the result.

Because software executes in serial, a simulator cannot run/simulate multiple gates in parallel.
Instead, the simulator must simulate the gates one at a time, but make the waveforms appear as
if all of the gates were simulated in parallel. In each delta cycle, the simulator will simulate any
gate whose input changed in the previous delta cycle. To preserve the illusion that the gates ran in
parallel, the effect of simulating a gate remains invisible until the end of the delta cycle.


1.6.4 Definitions and Algorithm

1.6.4.1 Process Modes

An architecture contains a set of processes. Each process is in one of the following modes: active,
suspended, or postponed.

         Note: “postponed”   This use of the word “postponed” differs from that in
         the VHDL Standard. We won’t be using postponed processes as defined in the
         Standard.

         Note: “postponed”     “Postponed” in VHDL terminology is a synonym for
         some operating-systems’ usage of “ready” to describe a process that is ready
         to execute.
28                                                                           CHAPTER 1. VHDL



                                               • Suspended
                                                 – Nothing to currently execute
                                                 – A process stays suspended until the event
                     active                        that it is waiting for occurs: either a
                                                   change in a signal on its sensitivity list

                               su
                                                   or the condition in a wait statement
             e




                                  sp
             at




                                     e
          tiv




                                    nd
        ac




                                               • Postponed
                                                 – Wants to execute, but not currently active
                                                 – A process stays postponed until the sim-
     postponed                   suspended         ulator chooses it from the pool of post-
                                                   poned processes
                    resume                     • Active
                                                 – Currently executing
                                                 – A process stays active until it hits a wait
                                                   statement or sensitivity list, at which
                                                   point it suspends
                                   Figure 1.15: Process modes



1.6.4.2 Simulation Algorithm

The algorithm presented here is a simplification of the actual algorithm in Section 12.6 of the
VHDL Standard. The most significant simplification is that this algorithm does not support de-
layed assignments. To support delayed assignments, each signal’s provisional value would be gen-
eralized to an event wheel, which is a list containing the times and values for multiple provisional
assignments in the future.

A somewhat ironic note, only six of the two hundred pages in the VHDL Standard are devoted to
the semantics of executing processes.


The Algorithm      ....................................................................... .


Simulations start at step 1 with all processes postponed and all signals with a default value (e.g.,
’U’ for std logic).
1.6.4 Definitions and Algorithm                                                                        29


  1. While there are postponed processes:
      (a) Pick one or more postponed processes to execute (become active).
      (b) As a process executes, assignments to signals are provisional — new values do
          not become visible until step 3
      (c) A process executes until it hits its sensitivity list or a wait statement, at which point
          it suspends. At a wait statement, the process will suspend even if the condition is
          true during the current simulation cycle.
      (d) Processes that become suspended stay suspended until there are no more post-
          poned or active processes.

  2. Each process looks at signals that changed value (provisional value differs from visible
     value) and at the simulation time. If a signal in a process’s sensitivity list changed value,
     or if the wait condition on which a process is suspended became true, then the process
     resumes (becomes postponed).
  3. Each signal that changed value is updated with its provisional value (the provisional
     value becomes visible).
  4. If there are no postponed processes, then increment simulation time to the next sched-
     uled event.
        Note: Parallel execution       In n-threaded execution, at most n processes are
        active at a time
30                                                                            CHAPTER 1. VHDL


1.6.4.3 Delta-Cycle Definitions

     Definition simulation step: Executing one sequential assignment or process mode
       change.


     Definition simulation cycle: The operations that occur in one iteration of the simulation
       algorithm.


     Definition delta cycle: A simulation cycle that does not advance simulation time.
       Equivalently: A simulation cycle with zero-delay assignments where the assignment
       causes a process to resume.


     Definition simulation round: A sequence of simulation cycles that all have the same
       simulation time. Equivalently: a contiguous sequence of zero or more delta cycles
       followed by a simulation cycle that increments time (i.e., the simulation cycle is not a
       delta cycle).


           Note: Official and unofficial terminology      Simulation cycle and delta cycle
           are official definitions in the VHDL Standard. Simulation step and simulation
           round are not standard definitions. They are used in E&CE 327 because we
           need words to associate with the concepts that they describe.
1.6.5 Example 1: Process Execution (Bamboozle)                                                   31


1.6.5 Example 1: Process Execution (Bamboozle)

This example (Bamboozle) and the next example (Flummox, section 1.6.6) are very similar. The
VHDL code for the circuit is slightly different, but the hardware that is generated is the same. The
stimulus for signals a and b also differ.



entity bamboozle is
begin
end bamboozle;

architecture main of bamboozle is
  signal a, b, c, d : std_logic;
begin
  procA : process (a, b) begin
    c <= a AND b;
  end process;
  procB : process (b, c, d)
  begin
    d <= NOT c;
    e <= b AND d;
  end process;
  procC : process
  begin
    a <= ’0’;
    b <= ’1’;
    wait for 10 ns;
    a <= ’1’;
    wait for 2 ns;
    b <= ’0’;
    wait for 3 ns;
    a <= ’0’;
    wait for 20 ns;
end main;
                 Figure 1.16: Example bamboozle circuit for process execution
32                                                                                         CHAPTER 1. VHDL


Initial conditions (Shown in slides, not in notes)
Step 1(a): Activate procA (Shown in slides, not in notes)

 A     procA: process (a, b) begin                          U
         c <= a AND b;                     a                                 Uc   Ud
                                                                                       U
       end process;                                                                          e
                                                            U
       procB: process (b, c, d) begin      b
 P
                                         0ns
         d <= NOT c;
         e <= b AND d;         sim round    B
       end process;            sim cycle    B
 P     procC: process begin delta cycle     ?
         a <= ’0’;                 procA P A
         b <= ’1’;                 procB P
         wait for 10 ns;           procC P
         a <= ’1’;                     a U
         wait for 2 ns;
         b <= ’0’;                     b U

         wait for 3 ns;                c U
         a <= ’0’;
         wait for 20 ns;               d U

       end process;                    e U

                                                         Step 1(a): Activate procA

Step 1(c): Suspend procA (Shown in slides, not in notes)
Step 1(a): Activate procC (Shown in slides, not in notes)
Step 1(b): Provisional assignment to a (Shown in slides, not in notes)
Step 1(b): Provisional assignment to b (Shown in slides, not in notes)

 S     procA: process (a, b) begin                       0U
         c <= a AND b;                     a                                UUc   Ud
                                                                                       U
       end process;                                                                          e
                                                         1U
       procB: process (b, c, d) begin      b
 P
                                         0ns
         d <= NOT c;
         e <= b AND d;         sim round    B
       end process;            sim cycle    B
       procC: process begin delta cycle     ?
         a <= ’0’;                 procA P A                    S
 A       b <= ’1’;                 procB P
         wait for 10 ns;           procC P                          A
         a <= ’1’;                     a U                              U
         wait for 2 ns;
         b <= ’0’;                     b U                                  U

         wait for 3 ns;                c U                  U
         a <= ’0’;
                                       d U
         wait for 20 ns;
       end process;                    e U

                                                  Step 1(b): Provisional assignment to b
1.6.5 Example 1: Process Execution (Bamboozle)                                                            33


Step 1(a): Activate procB (Shown in slides, not in notes)
Step 1(b): Provisional assignment to d (Shown in slides, not in notes)
Step 1(b): Provisional assignment to e (Shown in slides, not in notes)
Step 1(c): Suspend procB (Shown in slides, not in notes)

S      procA: process (a, b) begin                       0U
         c <= a AND b;                     a                               UUc       UUd
                                                                                                 UU
       end process;                                                                                   e
                                                         1U
       procB: process (b, c, d) begin      b
S
                                         0ns
         d <= NOT c;
         e <= b AND d;         sim round    B
       end process;            sim cycle    B
       procC: process begin delta cycle     ?
         a <= ’0’;                 procA P A                   S
         b <= ’1’;                 procB P                                           A           S
S        wait for 10 ns;           procC P                         A             S
         a <= ’1’;                     a                               U
         wait for 2 ns;
         b <= ’0’;                     b                                   U

         wait for 3 ns;                c                  U
         a <= ’0’;
         wait for 20 ns;               d                                                 U

       end process;                    e                                                     U

                                                        Step 1(c): Suspend procB


S      procA: process (a, b) begin                       0U
         c <= a AND b;                     a                               UUc       UUd
                                                                                                 UU
       end process;                                                                                   e
                                                         1U
       procB: process (b, c, d) begin      b
S
                                         0ns
         d <= NOT c;
         e <= b AND d;         sim round    B
       end process;            sim cycle    B                                                    E
       procC: process begin delta cycle     ?                                                    ?
         a <= ’0’;                 procA P A                   S
         b <= ’1’;                 procB P                                           A           S
S        wait for 10 ns;           procC P                         A             S
         a <= ’1’;                     a U                             U
         wait for 2 ns;
                                       b U                                 U
         b <= ’0’;
         wait for 3 ns;                c U                U
         a <= ’0’;
         wait for 20 ns;               d U                                               U

       end process;                    e U                                                   U

                                                          All processes suspended

Step 3: Update signal values (Shown in slides, not in notes)
34                                                                                          CHAPTER 1. VHDL



P    procA: process (a, b) begin                0
       c <= a AND b;                     a                       UUc       UUd
                                                                                        UU
     end process;                                                                               e
                                                1
     procB: process (b, c, d) begin      b
P
                                       0ns
       d <= NOT c;
       e <= b AND d;         sim round    B
     end process;            sim cycle    B
     procC: process begin delta cycle     ?
       a <= ’0’;                 procA P A          S                                       P
       b <= ’1’;                 procB P                                   A            S   P
S      wait for 10 ns;           procC P                 A             S
       a <= ’1’;                     a U                     U                                  0
       wait for 2 ns;
                                     b U                         U                              1
       b <= ’0’;
       wait for 3 ns;                c U       U
       a <= ’0’;
       wait for 20 ns;               d U                                       U

     end process;                    e U                                            U

                                                        Step 3: Update signal values


S    procA: process (a, b) begin                0
       c <= a AND b;                     a                       Uc            Ud
                                                                                            U
     end process;                                                                               e
                                                1
     procB: process (b, c, d) begin      b
S
                                       0ns
       d <= NOT c;
       e <= b AND d;         sim round    B
     end process;            sim cycle    B                                                         E
     procC: process begin delta cycle     B                                                         E
       a <= ’0’;                 procA P A          S                                       P
       b <= ’1’;                 procB P                                   A            S   P
S      wait for 10 ns;           procC P                 A             S
       a <= ’1’;                     a U                     U                                  0
       wait for 2 ns;
       b <= ’0’;                     b U                         U                              1

       wait for 3 ns;                c U       U
       a <= ’0’;
       wait for 20 ns;               d U                                       U

     end process;                    e U                                            U

                                Step 4: Simulation time remains at 0 ns --- delta cycle
1.6.5 Example 1: Process Execution (Bamboozle)                                                      35


Step 1(a): Activate procA (Shown in slides, not in notes)
Step 1(b): Provisional assignment to c (Shown in slides, not in notes)
Step 1(c): Suspend procA (Shown in slides, not in notes)
Step 1(a): Activate procB (Shown in slides, not in notes)
Step 1(b): Provisional assignment to d (Shown in slides, not in notes)
Step 1(b): Provisional assignment to e (Shown in slides, not in notes)
Step 1(c): Suspend procB (Shown in slides, not in notes)

S      procA: process (a, b) begin               0
         c <= a AND b;                     a                       0Uc         UUd
                                                                                           UU
       end process;                                                                             e
                                                 1
       procB: process (b, c, d) begin      b
S
                                         0ns
         d <= NOT c;
         e <= b AND d;         sim round    B
       end process;            sim cycle    B      E           B
       procC: process begin delta cycle     B      E           ?
         a <= ’0’;                 procA P         P               A       S
         b <= ’1’;                 procB P         P                           A           S
S        wait for 10 ns;           procC P
         a <= ’1’;                     a U    U    0
         wait for 2 ns;
         b <= ’0’;                     b U    U    1

         wait for 3 ns;                c U      U                      U
         a <= ’0’;
                                       d U      U                                  U
         wait for 20 ns;
       end process;                    e U      U                                      U

                                                         All processes suspended

Step 3: Update signal values (Shown in slides, not in notes)
Step 4: Simulation time remains at 0ns — delta cycle (Shown in slides, not in notes)
Compact simulation cycle (Shown in slides, not in notes)
Begin next simulation cycle (Shown in slides, not in notes)
Step 1(a): Activate procB (Shown in slides, not in notes)
Step 1(b): Provisional assignment to d (Shown in slides, not in notes)
Step 1(b): Provisional assignment to e (Shown in slides, not in notes)
Step 1(c): Suspend procB (Shown in slides, not in notes)
All processes suspended (Shown in slides, not in notes)
36                                                                                                    CHAPTER 1. VHDL



S      procA: process (a, b) begin               0
         c <= a AND b;                     a                                0c       1Ud
                                                                                                  UU
       end process;                                                                                       e
                                                 1
       procB: process (b, c, d) begin      b
S
                                         0ns
         d <= NOT c;
         e <= b AND d;         sim round    B
       end process;            sim cycle    B      E           B            E    B
       procC: process begin delta cycle     B      E           B            E    ?
         a <= ’0’;                 procA P         P
         b <= ’1’;                 procB P         P                        P        A            S
S        wait for 10 ns;           procC P
         a <= ’1’;                     a U    U    0
         wait for 2 ns;
         b <= ’0’;                     b U    U    1

         wait for 3 ns;                c U      U                   U       0
         a <= ’0’;
         wait for 20 ns;               d U      U                       U                U
       end process;                    e U      U                       U                     U

                                                         All processes suspended

Step 3: Update signal values (Shown in slides, not in notes)

 S     procA: process (a, b) begin               0
         c <= a AND b;                     a                                0c           1d
                                                                                                      U
       end process;                                                                                       e
                                                 1
       procB: process (b, c, d) begin      b
P
                                         0ns
         d <= NOT c;
         e <= b AND d;         sim round    B
       end process;            sim cycle    B      E            B           E    B
       procC: process begin delta cycle     B      E            B           E    ?
         a <= ’0’;                 procA P         P
         b <= ’1’;                 procB P         P                        P        A            S   P
 S       wait for 10 ns;           procC P
         a <= ’1’;                     a U    U    0
         wait for 2 ns;
         b <= ’0’;                     b U    U    1

         wait for 3 ns;                c U      U                   U       0
         a <= ’0’;
         wait for 20 ns;               d U      U                       U                U                1

       end process;                    e U      U                       U                     U

                                                          Step 3: Update signal values

Compact simulation cycle (Shown in slides, not in notes)
Begin next simulation cycle (Shown in slides, not in notes)
Step 1(a): Activate procB (Shown in slides, not in notes)
Step 1(b): Provisional assignment to d (Shown in slides, not in notes)
Step 1(b): Provisional assignment to e (Shown in slides, not in notes)
Step 1(c): Suspend procB (Shown in slides, not in notes)
1.6.5 Example 1: Process Execution (Bamboozle)                                                                                 37



S      procA: process (a, b) begin               0
         c <= a AND b;                     a                                 0c           11d
                                                                                                       1U
       end process;                                                                                                e
                                                 1
       procB: process (b, c, d) begin      b
S
                                         0ns
         d <= NOT c;
         e <= b AND d;         sim round    B
       end process;            sim cycle    B      E             B           E    B           E B
       procC: process begin delta cycle     B      E             B           E    B           E ?
         a <= ’0’;                 procA P         P
         b <= ’1’;                 procB P         P                         P                P        A               S
S        wait for 10 ns;           procC P
         a <= ’1’;                     a U    U    0
         wait for 2 ns;
         b <= ’0’;                     b U    U    1

         wait for 3 ns;                c U      U                    U       0
         a <= ’0’;
         wait for 20 ns;               d U      U                        U            U       1

       end process;                    e U      U                        U                U                    U

                                                        Step 1(c): Suspend procB

Step 3: Update signal values (Shown in slides, not in notes)

S      procA: process (a, b) begin               0
         c <= a AND b;                     a                                 0c               1d
                                                                                                           1
       end process;                                                                                                e
                                                 1
       procB: process (b, c, d) begin      b
S
                                         0ns
         d <= NOT c;
         e <= b AND d;         sim round    B
       end process;            sim cycle    B      E             B           E    B           E B
       procC: process begin delta cycle     B      E             B           E    B           E    ?
         a <= ’0’;                 procA P         P
         b <= ’1’;                 procB P         P                         P                P        A               S
S        wait for 10 ns;           procC P
         a <= ’1’;                     a U    U    0
         wait for 2 ns;
         b <= ’0’;                     b U    U    1

         wait for 3 ns;                c U      U                    U       0
         a <= ’0’;
         wait for 20 ns;               d U      U                        U            U       1

       end process;                    e U      U                        U                U                    U           1

                                                               Step 3: Update signal values

Compact simulation cycle (Shown in slides, not in notes)
38                                                                                                      CHAPTER 1. VHDL


Begin next simulation cycle (Shown in slides, not in notes)
Step 1: No postponed processes (Shown in slides, not in notes)

S      procA: process (a, b) begin               0
         c <= a AND b;                     a                                 0c               1d
                                                                                                        1
       end process;                                                                                         e
                                                 1
       procB: process (b, c, d) begin      b
S
                                         0ns                                                            10ns
         d <= NOT c;
         e <= b AND d;         sim round    B                                                           E
       end process;            sim cycle    B      E             B           E    B           E B       E
       procC: process begin delta cycle     B      E             B           E    B           E
         a <= ’0’;                 procA P         P
         b <= ’1’;                 procB P         P                         P                P
S        wait for 10 ns;           procC P
         a <= ’1’;                     a U    U    0
         wait for 2 ns;
         b <= ’0’;                     b U    U    1

         wait for 3 ns;                c U      U                    U       0
         a <= ’0’;
         wait for 20 ns;               d U      U                        U            U       1

       end process;                    e U      U                        U                U         U   1

                                                     Step 1: no postponed processes

Compact simulation cycle (Shown in slides, not in notes)
1.6.5 Example 1: Process Execution (Bamboozle)                                                                                 39


Begin next simulation cycle (Shown in slides, not in notes)
Step 1(a): Activate procC (Shown in slides, not in notes)
Step 1(b): Provisional assignment to a (Shown in slides, not in notes)
Step 1(c): Suspend procC (Shown in slides, not in notes)
Step 2: Check sensitivity list; resume processes (Shown in slides, not in notes)
Step 3: Update signal values (Shown in slides, not in notes)

P      procA: process (a, b) begin               1
         c <= a AND b;                     a                                0c               1d
                                                                                                       1
       end process;                                                                                        e
                                                 1
       procB: process (b, c, d) begin      b
 S
                                         0ns                                                           10ns
         d <= NOT c;
         e <= b AND d;         sim round    B                                                          E   B
       end process;            sim cycle    B      E            B           E    B           E B       E   B
       procC: process begin delta cycle     B      E            B           E    B           E             B
         a <= ’0’;                 procA P         P                                                                   P
         b <= ’1’;                 procB P         P                        P                P
         wait for 10 ns;           procC P                                                                 P   A   S
         a <= ’1’;                     a U    U    0                                                                       1
 S       wait for 2 ns;
         b <= ’0’;                     b U    U    1

         wait for 3 ns;                c U      U                   U       0
         a <= ’0’;
         wait for 20 ns;               d U      U                       U            U       1

       end process;                    e U      U                       U                U         U   1

                                                       Step 3: Update signal values

Compact simulation cycle (Shown in slides, not in notes)
40                                                                                                                                                            CHAPTER 1. VHDL


1.6.6 Example 2: Process Execution (Flummox)

This example is a variation of the Bamboozle example from section 1.6.5.



entity flummox is
begin
end flummox;

architecture main of flummox is
  signal a, b, c, d : std_logic;
begin
  proc1 : process (a, b, c) begin
    c <= a AND b;
    d <= NOT c;
  end process;
  proc2 : process (b, d)
  begin
    e <= b AND d;
  end process;
  proc3 : process
  begin
    a <= ’1’;
    b <= ’0’;
    wait for 3 ns;
    b <= ’1’;
    wait for 99 ns;
end main;
                              Figure 1.17: Example flummox circuit for process execution

                  0ns                                        +1δ                          +2δ             +3δ          3ns                                                                        102ns
sim round         B                                                                                                    EB                                                                     E
 sim cycle        B                                         EB                           EB               EB           EB            EB                         EB               EB           E
delta cycle       B                                         EB                           EB               E             B            EB                         EB               E
    proc1     P                               A           S PA           S               PA           S                              P            A           S PA           S
    proc2     P A         S                                 P                A       S                    PA       S                 PA       S                                  PA       S
    proc3     P               A           S                                                                             PA       S
       a      U                   1

       b      U                       0                                                                                      1

       c      U                                   U              0                            0                                                       1              1

       d      U                                       U              U                            1                                                       1              0

       e      U       U                                                          0                             0                          1                                           0
1.6.6 Example 2: Process Execution (Flummox)                                                                                                  41


To get a more natural view of the behaviour of the signals, we draw just the waveforms and use a
timescale of nanoseconds plus delta cycles:
                           0ns                                    3ns                                    102ns
                                    +1δ       +2δ       +3δ                 +1δ       +2δ       +3δ

                       a U     U

                       b U     U

                       c U               U

                       d U                         U

                       e U               U




Finally, we draw the behaviour of the signals using the standard time scale of nanoseconds. Notice
that the delta-cycles within a simulation round all collapse to the left, so the signals change value
exactly at the nanosecond boundaries. Also, the glitch on e dissappears.


   Answer:


                                     0ns       1ns       2ns       3ns       4ns                100ns 101ns 102ns


                                 a U

                                 b U

                                 c U

                                 d U

                                 e U




Note and Questions      . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . ..


         Note:     If a signal is updated with the same value it had in the previous sim-
         ulation cycle, then it does not change, and therefore does not trigger processes
         to resume.


   Question: What are the different granularities of time that occur when doing
     delta-cycle simulation?
42                                                                             CHAPTER 1. VHDL


     Answer:
        simulation step, delta cycle, simulation cycle, simulation round


     Question: What is the order of granularity, from finest to coarsest, amongst the
       different granularities related to delta-cycle simulation?


     Answer:
        Same order as listed just above. Note: delta cycles have a finer granularity
       that simulation cycles, because delta cycles do not advance time, while
       simulation cycles that are not delta cycles do advance time.


1.6.7 Example: Need for Provisional Assignments

This is an example of processes where updating signals during a simulation cycle leads to different
results for different process execution orderings.


architecture main of swindle is
begin
  p_c: process (a, b) begin
    c <= a AND b;
                                                              a
  end process;                                                                         d
                                                                           c
  p_d: process (a, c) begin
                                                              b
    d <= a XOR c;
  end process;
end main;


                 Figure 1.18: Circuit to illustrate need for provisional assignments

     1. Start with all signals at ’0’.
     2. Simultaneously change to a = ’1’ and b = ’1’.
1.6.7 Example: Need for Provisional Assignments                                                  43


.                                                                                                .


If assignments are not visible within same simulation cycle (correct: i.e. provisional assignments
                                             are used)



p_c          P A        S                             p_c            P            A    S
p_d          P              A    S P A        S       p_d            P A      S            P A        S

    a   0                                                a   0

    b   0                                                b   0

    c   0                                                c   0

    d   0                                                d   0


If p c is scheduled before p d, then d will            If p d is scheduled before p c, then d will
have a ’1’ pulse.
.                                                      have a ’1’ pulse.                        .


                 If assignments are visible within same simulation cycle (incorrect)



p_c          P A        S                             p_c            P            A    S
p_d          P              A    S P A        S       p_d            P A      S            P A        S

    a   0                                                a   0

    b   0                                                b   0

    c   0                                                c   0

    d   0                                                d   0


If p c is scheduled before p d, then d will            If p d is scheduled before p c, then d will
stay constant ’0’.                                     have a ’1’ pulse.
With provisional assignments, both orders of scheduling processes result in the same behaviour
on all signals. Without provisional assignments, different scheduling orders result in different
behaviour.
44                                                                                                    CHAPTER 1. VHDL


      1.6.8 Delta-Cycle Simulations of Flip-Flops

      This example illustrates the delta-cycle simulation of a flip-flop. Notice how the delta-cycle simu-
      lation captures the expected behaviour of the flip flop: the signal q changes at the same time (10ns)
      as rising edge on the clock.
                                                       p_clk : process begin
                                                                 clk <= ’0’;
      p_a : process begin                                        wait for 10 ns;
              a <= ’0’;                                          clk <= ’1’;
              wait for 15 ns;                                    wait for 10 ns;
              a <= ’1’;                                        end process;
              wait for 20 ns;                          flop : process ( clk ) begin
            end process;                                        if rising_edge( clk ) then
                                                                  q <= a;
                                                                end if;
                                                               end process;
           0ns                           0ns+1δ 10ns                         15ns             20ns               30ns               35ns

sim round      B                             E B                             E B              E B               E B                 E
 sim cycle     B                         E B E B/E B              E B        E B/E            E B/E B       E B E B/E B       E B   E
delta cycle    B                         E     B/E B              E              B/E          E B/E B       E     B/E B       E
       p_a P   A       S                        P                                P A          S
    p_clk P                A       S               A          S                                  P A        S     P A     S
     flop P                          A S P A S                    P A        S                              P A S             P A   S
       a U         U                     0                                                    1

     clk U                     U         0                        1                                         0                 1

       q U                                                              U    0                                                      1




      Redraw with Normal Time Scale                    .......................................................


      To clarify the behaviour, we redraw the same simulation using a normal time scale.



                                             0ns   5ns    10ns        15ns   20ns      25ns       30ns   35ns

                                        a U
                                      clk U

                                        q                 U
1.6.8 Delta-Cycle Simulations of Flip-Flops                                                                                45


Back-to-Back Flops              .................................................................. .


In the previous simulation, the input to the flip-flop (a) changed several nanoseconds before the
rising-edge on the clock. In zero delay simulation, the output of a flip-flop changes exactly on
the rising edge of the clock. This means that the input to the next flip-flop will change at exactly
the same time as a rising edge. This example illustrates how delta-cycle simulation handles the
situation correctly.
                                                    p_clk : process begin
                                                              clk <= ’0’;
                                                              wait for 10 ns;
p_a : process begin                                           clk <= ’1’;
        a <= ’0’;                                             wait for 10 ns;
        wait for 15 ns;                                     end process;
        a <= ’1’;                                   flops : process ( clk ) begin
        wait for 20 ns;                                       if rising_edge( clk ) then
      end process;                                              q1 <= a;
                                                                q2 <= q1;
                                                              end if;
                                                             end process;

                    10ns                              15ns         20ns                 30ns                        35ns

      sim round       B                               E B          E B               E B                            E
       sim cycle      B/E   B         E B             E B/E B      E B/E B       E B E B/E B              E B       E
      delta cycle     B/E   B         E                   B/E        B/E B       E     B/E B              E
             p_a                                          P A      S
          p_clk       P A         S                                   P A        S     P A            S
          flops                       P A             S                          P A S                    P A       S
              a 0                                                  1

           clk 0                      1                                          0                        1

             q1 U                           U         0                                                             1

             q2 U                               U                                                               U




Redraw with Normal Time Scale                    .......................................................


To clarify the behaviour, we redraw the same simulation using a normal time scale.
                                          0ns   5ns       10ns   15ns     20ns   25ns   30ns   35ns

                                  a U

                                clk U
                                q1                        U

                                q2                                                      U
46                                                                       CHAPTER 1. VHDL


External Inputs and Flops     ............................................................ .


In our work so far with delta-cycle simulation, we have worked through the mechanics of simula-
tion. This example applies knowledge of delta-cycle simulation at a conceptual level. We could
answer the question by thinking about the semantics of delta-cycle simulation or by mechanicaly
doing the simulation.


     Question:   Do the signals b1 and b2 have the same behaviour from 20–30 ns?


architecture mathilde of sauv´ is
                             e
  signal clk, a, b : std_logic;
begin
  process begin
    clk <= ’1’;
    wait for 10 ns;
    clk <= ’0’;
    wait for 10 ns;
  end process;
  process begin
    wait for 20 ns;
    a1 <= ’1’;
  end process;
  process begin
    wait until rising_edge(clk);
    a1 <= ’1’;
  end process;
  process begin
    wait until rising_edge( clk );
    b1 <= a1;
    b2 <= a2;
  end process;
end architecture;


     Answer:


       The signals b1 and b2 will have the same behaviour if a1 and a2 have the
       same behaviour. The difference in the code between a1 and a2 is that a1 is
       waiting for 20ns and a2 is waiting until a rising edge of the clock. There is a
       rising edge of the clock at 20ns, so we might be tempted to conclude
       (incorrectly) that both a1 and a2 transition from ’U’ to 0 at exactly 20ns and
       therefore have exactly the same behaviour.
1.6.8 Delta-Cycle Simulations of Flip-Flops                                             47


     The difference between the behaviour of a1 and a2 is that in the first
     simulation cycle for 20 ns, the process for a1 becomes postponed, while the
     process for a2 becomes postponed only after the rising edge of clock.

     The signal a1 is waiting for 20ns, so in the first simulation cycle for 20ns, the
     process for a1 becomes postponed. In the second simulation cycle for 20ns,
     the clock toggles from 0 to 1 and a1 toggles from ’U’ to 1. The rising edge
     on the clock causes the processes for a2, b1, and b2 to become postponed.

     In the third simulation cycle for 20ns:
     • a2 toggles from ’U’ to 1.
     • b1 sees the value of 1 for a1, because a1 became 1 in the first simulation
        cycle.
     • b2 sees the old value of ’U’ for a2, because the process for a2 did not run
        in the second simulation cycle.

                       0ns 10ns 20ns 20ns+1δ                        20ns+2δ             30ns
         sim round                     B                                                E
         sim cycle                    B/E B                     E   B                   E
       delta cycle                    B/E B                     E
          proc_clk                     P A        S
           proc_a1                     P              A     S
           proc_a2                                              P A        S
             proc_b                                             P              A        S
                 clk U

                  a1 U

                  a2

                  b1

                  b2
48                                                                                 CHAPTER 1. VHDL


  Testbenches and Clock Phases                  ........................................................ .


  env :        process begin
                 a   <= ’1’;
                 clk <= ’0’;
                 wait for 10 ns;
                 a   <= ’0’;
                 clk <= ’1’;
                 wait for 10 ns;
               end process;

  flop : process ( clk ) begin
            if rising_edge( clk ) then
              q1 <= a
            end if;
          end process;
           0ns                                      0ns+1δ          10ns                                      20ns

sim round      B                                                    E B                                       E
 sim cycle     B                                     E B            E B E B             E B                   E
delta cycle    B                                     E                B E B             E B
       env P   A           S                                            P A         S
    flop1 P                    A        S           P A S                               P A       S
    flop2 P                                 A     S P     A S                           P             A   S
       a U         U                                 1                                  0

     clk U             U                             0                                  1

       q1 U                        U                                                          U




  Redraw with Normal Time Scale                   .......................................................

                                            0ns              10ns          20ns

                                       a U

                                   clk U

                                       q1                    U
1.6.8 Delta-Cycle Simulations of Flip-Flops                                                        49


        Note: Testbench signals       For consistent results across different simulators,
        simulation scripts vs test benches, and timing-simulation vs zero-delay simula-
        tion do not change signals in your testbench or script at the same time as the
        clock changes.


                                                   0ns   10ns   20ns   30ns   40ns   50ns   60ns

                                               a U
     a is output of clocked or combina-
     tional process                         clk U

                                              q1         U

                                                   0ns   10ns   20ns   30ns   40ns   50ns   60ns

                                               a U
     a is output of timed process
     (testbench or environment) POOR        clk U
     DESIGN                                   q1         U

                                                   0ns   10ns   20ns   30ns   40ns   50ns   60ns

                                               a U
     a is output of timed process (test-
     bench or environment) GOOD             clk U
     DESIGN                                   q1         U
50                                                                                                                                                                                                          CHAPTER 1. VHDL


1.7 Register-Transfer-Level Simulation
Delta-cycle simulation is very tedious for both humans and computers. For many circuits, the
complexity of delta-cycle simulation is not needed and register-transfer-level simulation, which is
much simpler, can be used instead.

The major complexities of delta-cycle simulation come from running a process multiple times
within a single simulation round and keeping track of the modes of the proceses. Register-transfer-
level simulation avoids both of these complexities. By evaluating each signal only once per sim-
ulation round, an entire simulation round can be reduced to a single column in a timing diagram.
The disadvantage of register-transfer-level simulation is that it does not work for all VHDL pro-
grams — in particular, it does not support combinational loops.
              0ns                                       0ns+1δ                     0ns+2δ 0ns+2δ3ns                       3ns+1δ                     3ns+2δ 3ns+3δ                102ns
sim round         B                                                                                         EB                                                                E
 sim cycle        B                                    EB                         EB             EB         EB           EB                         EB             EB         E
delta cycle       B                                    EB                         EB             E           B           EB                         EB             E          E
    proc1     P                            A           S PA           S             PA           S                         P            A           S PA           S
    proc2     P A         S                              P                A       S                PA       S              PA       S                                PA       S                 0ns   1ns   2ns   3ns
    proc3     P               A        S                                                                        PA       S
                                                                                                                                                                                                                        102ns
       a      U               U1                                                                                                                                                          a   U 1

       b      U                    0                                                                                 1                                                                    b   U 0                 1

       c      U                                U              0                          0                                                  1              1                              c   U 0                 1

       d      U                                    U              U                          1                                                  1              0                          d   U 1                 0

       e      U       U                                                       0                         0                       1                                         0               e   U 0


                                                       Delta cycle simulation                                                                                                                         RTL simulation


1.7.1 Overview

In delta-cycle simulations, we often simulated the same process multiple times within the same
simulation round. In looking at the circuit though, we mentally can calculate the output value
by evaluating each gate only once per simulation round. For both humans and computers (or the
humans waiting for results from computers), it is desirable to avoid the wasted work of simulating
a gate when the output will remain at ’U’ or will change again later in the same simulation round.

In register-transfer-level simulation, we evaluate each gate only once per simulation round. Register-
transfer-level simulation is simpler and faster than delta-cycle simuation, because it avoids delta
cycles and provisional assignments.

In delta-cycle simulation, we evaluate a gate multiple times in a single simulation round if the
process that drives the gate is active in multiple simulation cycles, which happens when the process
is triggered in multiple simulation cycles. To avoid this, we must evaluate a signal only after all of
the signals that it depends on have stable values, that is, the signals will not change value later in
the simulation round.

A combinational loop is a circuit that contains a cyclic path through the circuit that includes only
combinational gates. Combinational loops can cause signals to oscillate, which in delta-cycle
simulation with zero-delay assignments, corresponds to an infinite sequence of delta cycles. We
1.7.1 Overview                                                                                    51


immediately see that when doing zero-delay simulation of a combinational loop such as
a <= not(a);, the change on a will trigger the process to re-run and re-evaluate a an infinite
number of times. Hence, register-transfer-level simulation does not support combinational loops.

To make register-transfer simulation work, we preprocess the VHDL program and transform it so
that each process is dependent upon only those processes that appear before it. This dependency
ordering is called topological ordering. If a circuit has combinational loops, we cannot sort the
processes into a topological order.

The register-transfer level is a coarser level of temporal abstraction than the delta-cycle level.
In delta-cycle simulation, many delta-cycles can elapse without an increment in real time (e.g.
nanoseconds). In register-transfer-level simulation, all of the events that take place in the same
moment of real time take place at same moment in the simulation. In other words, all of the events
that take place at the same time are drawn in the same column of the waveform diagram.

Register-transfer-level simulation can be done for legal VHDL code, either synthesizable or unsyn-
thesizable, so long as the code does not contain combinational loops. For any piece of VHDL code
without combinational loops, the register-transfer-level simulation and the delta-cycle simulation
will have same value for each signal at the end of each simulation round.

By sorting the processes in topological order, when we execute a process, all of the signals that the
process depends on will have already been evaluated, and so we know that we are reading the final,
stable values that each signal will have for that moment in time. This is good, because for most
processes, we want to read the most recent values of signals. The exceptions are timed processes
that are dependent upon other timed processes running at the same moment in time and clocked
processes that are dependent upon other clocked processes.

process begin                                         Question: In this code, what value
  a <= ’0’;                                             should b have 10 ns?
  wait for 10 ns;
  a <= ’1’;
  ...                                                 Answer:
end process;                                             Both processes will execute in
                                                        the same simulation cycle at 10
process begin
                                                        ns. The statement b <= a will
  b <= ’0’;
                                                        see the value of a from the
  wait for 10 ns;
  b <= a;
                                                        previous simulation cycle, which
  ...                                                   is before a <= ’1’; is
end process;                                            evaluated. The signal b will be
                                                        ’0’ at 10 ns.
As the above example illustrates, if a clocked process reads the values of signals from processes
that resume at the same time, it must read the previous value of those signals. Similarly, if a
clocked process reads the values of signals from processes that are sensitive to the same clock,
those processes will all resume in the same simulation cycle — the cycle immediately after the
rising-edge of the clock (assuming that the processes use if rising edge or wait until
rising edge statements). Because the processes run in the same simulation cycle, they all read
52                                                                              CHAPTER 1. VHDL


the previous values of the signals that they depend on. If this were not the case, then the VHDL
code for pair of back-to-back flip flops would not operate correctly, because the output of the first
flip-flop would appear immediately at the output of the second flip-flop.

Simulation rounds begin with incrementing time, which triggers timed processes. Therefore, the
first processes in the topological order are the timed processes. Timed processes may be run in any
order, and they read the previous values of signals that they depend on. This gives the same effect
as in delta-cycle simulation, where the timed processes would run in the same simulation cycle and
read the values that signals had before the simulation cycle began.

We then sort the clocked and combinational processes based on their dependencies, so that each
process appears (is run) after all of the processes on which it depends.

Although a clocked process may read many signals, we say that a clocked process is dependent
upon only its clock signal. It is the change in the clock signal that causes the process to resume.
So, as long as the process is run after the clock signal is stable, we can be sure that it will not need
to be run again at this time step. Clocked processes may be run in any order. They read the current
value of their clock signal and the previous value of the other signals that they depend on. As
with timed processes, this gives the same effect as in delta-cycle simulation, where the clock edge
would trigger the clocked processes to run in the same simulation cycle and the processes would
read the values that signals had before the simulation cycle began.


1.7.2 Technique for Register-Transfer Level Simulation
     1. Pre-processing

         (a) Separate processes into combinational and non-combinational (clocked and timed)
         (b) Decompose each combinational process into separate processes with one target signal
             per process
         (c) Sort processes into topological order based on dependencies

     2. For each clock cycle or unit of time:

         (a) Run non-combinational processes in any order. Non-combinational assignments read
             from earlier clock cycle / time step, except that clocked processes read the current value
             of the clock signal.
         (b) Run combinational processes in topological order. Combinational assignments read
             from current clock cycle / time step.


Combinational Process Decomposition             ................................................ .
1.7.3 Examples of RTL Simulation                                                           53


  proc(a,b,c)                                    proc(a,b,c)
    if a = ’1’ then                                if a = ’1’ then
      d <= b;                                        d <= b;
      e <= c;                                      else
    else                                             d <= not b;
      d <= not b;                                  end if;
      e <= b and c;                              end process;
    end if;                                      proc(a,b,c)
  end process;                                     if a = ’1’ then
                                                     e <= c;
                                                   else
                                                     e <= b and c;
                                                   end if;
                                                 end process;

                Original code                              After decomposition


1.7.3 Examples of RTL Simulation

1.7.3.1 RTL Simulation Example 1

We revisit an earlier example from delta-cycle simulation, but change the code slightly and do
register-transfer-level simulation.


  1. Original code:
     proc1: process (a, b, c) begin
                                                      proc3: process begin
         d <= NOT c;
                                                        a <= ’1’;
         c <= a AND b;
                                                        b <= ’0’;
       end process;
                                                        wait for 3 ns;
                                                        b <= ’1’;
     proc2: process (b, d) begin
                                                        wait for 99 ns;
         e <= b AND d;
                                                      end process;
       end process;


  2. Decompose combinational processes into single-target processes:
54                                                                                                                  CHAPTER 1. VHDL


        proc1d: process (c) begin                                           proc1c: process (a, b) begin
            d <= NOT c;                                                         c <= a AND b;
          end process;                                                        end process;

        proc1c: process (a, b) begin                                        proc1d: process (c) begin
            c <= a AND b;                                                       d <= NOT c;
          end process;                                                        end process;

        proc2: process (b, d) begin                                         proc2: process (b, d) begin
            e <= b AND d;                                                       e <= b AND d;
          end process;                                                        end process;

                           Decomposed                                                                       Sorted
     3. To sort combinational processes into topological order, move proc1d after proc1c, be-
        cause d depends on c.
     4. Run timed process (proc3) until suspend at wait for 3 ns;.
        • The signal a gets ’1’ from 0 to 3 ns.
        • The signal b gets ’0’ from 0 to 3 ns.
     5. Run proc1c
        • The signal c gets a AND b (0 AND 1 = ’0’) from 0 to 3 ns.
     6. Run proc1d
        • The signal d gets NOT c (NOT 0 = ’1’) from 0 to 3 ns.
     7. Run proc2
        • The signal e gets b AND d (0 AND 1 = ’0’) from 0 to 3 ns.
     8. Run the timed process until suspend at wait for 99 ns;, which takes us from 3ns to
        102ns.
     9. Run combinational processes in topological order to calculate values on c, d, e from 3ns to
        102ns.


Waveforms       .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . . .. . . . . .. . . . . ...

                                  0ns               1ns              2ns               3ns                        102ns
                       a      U 1

                       b      U 0                                                     1

                       c      U 0                                                     1

                       d      U 1                                                     0

                       e      U 0
1.7.3 Examples of RTL Simulation                                                                                                                                                             55


  Question:          Draw the RTL waveforms that correspond to the delta-cycle waveform
    below.

                 0ns                                       0ns+1δ                     0ns+2δ 0ns+2δ3ns                       3ns+1δ                     3ns+2δ 3ns+3δ                102ns
   sim round         B                                                                                         EB                                                                E
    sim cycle        B                                    EB                         EB             EB         EB           EB                         EB             EB         E
   delta cycle       B                                    EB                         EB             E           B           EB                         EB             E          E
       proc1     P                            A           S PA           S             PA           S                         P            A           S PA           S
       proc2     P A         S                              P                A       S                PA       S              PA       S                                PA       S
       proc3     P               A        S                                                                        PA       S
          a      U               U1

          b      U                    0                                                                                 1

          c      U                                U              0                          0                                                  1              1

          d      U                                    U              U                          1                                                  1              0

          e      U       U                                                       0                         0                       1                                         0




  Answer:

                                                           0ns                       1ns               2ns                   3ns                                  102ns
                                          a           U 1

                                          b           U 0                                                                    1

                                          c           U 0                                                                    1

                                          d           U 1                                                                    0

                                          e           U 0
56                                                                                               CHAPTER 1. VHDL


      Example: Communicating State Machines           . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..




                   huey: process                                                           louie: process
                     begin                                                                   begin
                       clk <= ’0’;                                                             d <= ’1’;
                       wait for 10 ns;                                                         wait until re(clk);
10 30 50 70 90 110                                        10        30 50 90                   if (a >= 2) then
                       clk <= ’1’;
                       wait for 10 ns;                                                            d <= ’0’;
20 40 60 80 100
                     end process;                                                                 wait until re(clk);
                                                                         70 110                end if;
                    dewey: process                                                              end process;
                      begin
                        a <= to_unsigned(0,4);
                        wait until re(clk);
10                110   while (a < 4) loop
                           a <= a + 1;
                           wait until re(clk);
                        end loop;
 30 50 70 90
                      end process;



               I 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100                        110       120

     clk       U 0     1

       a       U 0     1          2           3             4                    0                     1

       d       U 1     1          1           0             1                    0                     1
1.7.3 Examples of RTL Simulation                                                        57


      A Related Simulation     ..................................................................


      Small changes to the code can cause significant changes to the behaviour.
                 riri: process                                           loulou: process
                   begin                                                   begin
                     clk <= ’1’;                                             wait until re(clk);
                     wait for 10 ns;                                         d <= ’1’;
                     clk <= ’0’;                                             if (a < 2) then
                     wait for 10 ns;                                            d <= ’0’;
                   end process;                                                 wait until re(clk);
                                                                             end if;
                 fifi: process                                              end process;
                   begin
                     a <= to_unsigned(0,4);
                     wait until re(clk);
                     while (a < 4) loop
                        a <= a + 1;
                        wait until re(clk);
                     end loop;
                    end process;

         I 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100    110   120

clk

 a

 d
58                                                                             CHAPTER 1. VHDL


1.8 VHDL and Hardware Building Blocks
This section outlines the building blocks for register transfer level design and how to write VHDL
code for the building blocks.


1.8.1 Basic Building Blocks




                                                     (also: n-to-1 muxes)
                            2:1 mux




                              R
                        D         Q       WE                  WE

                        CE                A     DO            A0    DO0

                              S           DI                  DI0

                                                              A1    DO1




             Hardware                                        VHDL
 AND, OR, NAND, NOR, XOR,                       and, or, nand, nor, xor, xnor
 XNOR
 multiplexer                                    if-then-else, case statement,
                                                selected assignment, conditional as-
                                                signment
 adder, subtracter, negater                     +, -, -
 shifter, rotater                               sll, srl, sla, sra, rol, ror
 flip-flop                                        wait until,          if-then-else,
                                                rising edge
 memory array, register file, queue              2-d array or library component

                                      Figure 1.19: RTL Building Blocks
1.8.2 Deprecated Building Blocks for RTL                                                        59


1.8.2 Deprecated Building Blocks for RTL

Some of the common gates you have encountered in previous courses should be avoided when
synthesizing register-transfer-level hardware, particularly if FPGAs are the implementation tech-
nology.


1.8.2.1 An Aside on Flip-Flops and Latches
flip-flop Edge sensitive: output only changes on rising (or falling) edge of clock
latch Level sensitive: output changes whenever clock is high (or low)

A common implementation of a flip-flop is a pair of latches (Master/Slave flop).

Latches are sometimes called “transparent latches”, because they are transparent (input directly
connected to output) when the clock is high.

The clock to a latch is sometimes called the “enable” line.

There is more information in the course notes on timing analysis for storage devices (Section 5.2).


1.8.2.2 Deprecated Hardware
Latches
    • Use flops, not latches
    • Latch-based designs are susceptible to timing problems
    • The transparent phase of a latch can let a signal “leak” through a latch — causing the
       signal to affect the output one clock cycle too early
    • It’s possible for a latch-based circuit to simulate correctly, but not work in real hardware,
       because the timing delays on the real hardware don’t match those predicted in synthesis
T, JK, SR, etc flip-flops
     • Limit yourself to D-type flip-flops
     • Some FPGA and ASIC cell libraries include only D-type flip flops. Others, such as Al-
        tera’s APEX FPGAs, can be configured as D, T, JK, or SR flip-flops.
Tri-State Buffers
     • Use multiplexers, not tri-state buffers
     • Tri-state designs are susceptible to stability and signal integrity problems
     • Getting tri-state designs to simulate correctly is difficult, some library components don’t
        support tri-state signals
     • Tri-state designs rely on the code never letting two signals drive the bus at the same time
     • It can be difficult to check that bus arbitration will always work correctly
60                                                                              CHAPTER 1. VHDL


       • Manufacturing and environmental variablity can make real hardware not work correctly
         even if it simulates correctly
       • Typical industrial practice is to avoid use of tri-state signals on a chip, but allow tri-state
         signals at the board level
                Note:      Unfortunately and surprisingly, PalmChip has been awarded a
                US patent for using uni-directional busses (i.e. multiplexers) for system-
                on-chip designs. The patent was filed in 2000, so all fourth-year design
                projects since 2000 that use muxes on FPGAs will need to pay royalties to
                PalmChip


1.8.3 Hardware and Code for Flops

1.8.3.1 Flops with Waits and Ifs

The two code fragments below synthesize to identical hardware (flops).
                       If
                                                                              Wait
     process (clk)                                       process
     begin                                               begin
       if rising_edge(clk) then                            wait until rising_edge(clk);
         q <= d;                                           q <= d;
       end if;                                           end process;
     end process;



1.8.3.2 Flops with Synchronous Reset

The two code fragments below synthesize to identical hardware (flops with synchronous reset).
Notice that the synchronous reset is really nothing more than an AND gate on the input.
                       If                                                     Wait

     process (clk)                                       process
     begin                                               begin
       if rising_edge(clk) then                            wait until rising_edge(clk);
         if (reset = ’1’) then                             if (reset = ’1’) then
           q <= ’0’;                                         q <= ’0’;
         else                                              else
           q <= d;                                           q <= d0;
         end if;                                           end if;
       end if;                                           end process;
     end process;
1.8.3 Hardware and Code for Flops                                                              61


1.8.3.3 Flops with Chip-Enable

The two code fragments below synthesize to identical hardware (flops with chip-enable lines).
                    If
                                                                       Wait
                                                    process
  process (clk)
                                                    begin
  begin
                                                      wait until rising_edge(clk);
    if rising_edge(clk) then
                                                      if (ce = ’1’) then
      if (ce = ’1’) then
                                                        q <= d;
        q <= d;
                                                      end if;
      end if;
                                                    end process;
    end if;
  end process;




1.8.3.4 Flop with Chip-Enable and Mux on Input

The two code fragments below synthesize to identical hardware (flops with chip-enable lines and
muxes on inputs).

                    If                                                 Wait
  process (clk)                                     process
  begin                                             begin
    if rising_edge(clk) then                          wait until rising_edge(clk);
      if (ce = ’1’) then                              if (ce = ’1’) then
        if (sel = ’1’) then                             if (sel = ’1’) then
          q <= d1;                                        q <= d1;
        else                                            else
          q <= d0;                                        q <= d0;
        end if;                                         end if;
      end if;                                         end if;
    end if;                                         end process;
  end process;
62                                                                            CHAPTER 1. VHDL


1.8.3.5 Flops with Chip-Enable, Muxes, and Reset

The two code fragments below synthesize to identical hardware (flops with chip-enable lines,
muxes on inputs, and synchronous reset). Notice that the synchronous reset is really nothing
more than a mux, or an AND gate on the input.

           Note:      The specific combination and order of tests is important to guarantee
           that the circuit synthesizes to a flop with a chip enable, as opposed to a level-
           sensitive latch testing the chip enable and/or reset followed by a flop.

           Note:      The chip-enable pin on the flop is connected to both ce and reset.
           If the chip-enable pin was not connected to reset, then the flop would ignore
           reset unless chip-enable was asserted.

                       If                                                   Wait
     process (clk)                        process
     begin                                begin
       if rising_edge(clk) then             wait until rising_edge(clk);
         if (ce = ’1’ or reset =’1’ ) then if (ce = ’1’ or reset = ’1’) then
           if (reset = ’1’) then              if (reset = ’1’) then
             q <= ’0’;                            q <= ’0’;
           elsif (sel = ’1’) then             elsif (sel = ’1’) then
             q <= d1;                           q <= d1;
           else                               else
             q <= d0;                           q <= d0;
           end if;                            end if;
         end if;                            end if;
       end if;                            end process;
     end process;


1.8.4 An Example Sequential Circuit

There are many ways to write VHDL code that synthesizes to the schematic in figure1.20. The
major choices are:
     1. Categories of signals
         (a) All signals are outputs of flip-flops or inputs (no combinational signals)
         (b) Signals include both flopped and combinational

     2. Number of flopped signals per process
         (a) All flopped signals in a single process
         (b) Some processes with multiple flopped signals
         (c) Each flopped signal in its own process

     3. Style of flop code
1.8.4 An Example Sequential Circuit                                                                                                                 63


           (a) Flops use if statements
           (b) Flops use wait statements

    Some examples of these different options are shown in figures1.21–1.24.


      sel reset
                                                               entity and_not_reg is
                                                                 port (
                         R
                                                                   reset,
                             a                                     clk,
                                                                   sel    : in std_logic;
                         S                        R                c      : out std_logic
                                                             c   );
         clk                                                   end;
                                                  S




    Schematic and entity for examples of different code organizations in Figures1.21–1.24

                         Figure 1.20: Schematic and entity for and not reg



    One Process, Flops, Wait     . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..




    architecture one_proc of and_not_reg is
      signal a : std_logic;
    begin
      process begin
        wait until rising_edge(clk);
        if (reset = ’1’) then
          a <= ’0’;
        elsif (sel = ’1’) then
          a <= NOT a;
        else
          a <= a;
        end if;
        c <= NOT a;
      end process;
    end one_proc;
Figure 1.21: Implementation of Figure1.20: all signals are flops, all flops in one process, flops use waits
64                                                                                                            CHAPTER 1. VHDL


  Two Processes, Flops, Wait     . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..




  architecture two_proc_wait of and_not_reg is
    signal a : std_logic;
  begin
    process begin
      wait until rising_edge(clk);
      if (reset = ’1’) then
        a <= ’0’;
      elsif (sel = ’1’) then
        a <= NOT a;
      else
        a <= a;
      end if;
    end process;
    process begin
      wait until rising_edge(clk);
      c <= NOT a;
    end process;
  end two_proc_wait;
Figure 1.22: Implementation of Figure1.20: all signals are flops, one flop per process, flops use waits
1.8.4 An Example Sequential Circuit                                                             65


     Two Processes with If-Then-Else       .......................................................




     architecture two_proc_if of and_not_reg is
       signal a : std_logic;
     begin
       process (clk)
       begin
         if rising_edge(clk) then
           if (reset = ’1’) then
             a <= ’0’;
           elsif (sel = ’1’) then
             a <= NOT a;
           else
             a <= a;
           end if;
         end if;
       end process;
       process (clk)
       begin
         if rising_edge(clk) then
           c <= NOT a;
         end if;
       end process;
     end two_proc_if;
Figure 1.23: Implementation of Figure1.20: all signals are flops, one flop per process, flops use if-then-else
66                                                                          CHAPTER 1. VHDL


             Concurrent Statements         ................................................................




             architecture comb of and_not_reg is
               signal a, b, d : std_logic;
             begin
               process (clk) begin
                 if rising_edge(clk) then
                   if (reset = ’1’) then
                     a <= ’0’;
                   else
                     a <= d;
                   end if;
                 end if;
               end process;
               process (clk) begin
                 if rising_edge(clk) then
                   c <= NOT a;
                 end if;
               end process;
               d <=   b when (sel = ’1’) else a;
               b <= NOT a;
             end comb;
Figure 1.24: Implementation of Figure1.20: flopped and combinational signals, one flop per process, flops use if-then-else



             1.9 Arrays and Vectors
             VHDL supports multidimensional arrays over elements of any type. The most common array is an
             array of std logic signals, which has a predefined type: std logic vector. Throughout
             the rest of this section, we will discuss only std logic vector, but the rules apply to arrays
             of any type.

             VHDL supports reading from and assigning to slices (aka “discrete subranges”) of vectors. The
             rules for working with slices of vectors are listed below and illustrated in figure1.25.

                  1. The ranges on both sides of the assignment must be the same.
                  2. The direction (downto or to) of each slice must match the direction of the signal declara-
                     tion.
                  3. The direction of the target and expression may be different.
1.10. ARITHMETIC                                                                                67




Declarations

----------------------------------------------------
a, b       : in std_logic_vector(15 downto 0);
c, d, e    : out std_logic_vector(15 downto 0);
----------------------------------------------------
ax, bx     : in std_logic_vector(0 to 15);
cx, dx, ex : out std_logic_vector(0 to 15);
----------------------------------------------------
m, n       : in unsigned(15 downto 0);
p, q, r    : out unsigned(15 downto 0);
----------------------------------------------------
w, x       : in signed(15 downto 0);
y, z       : out signed(15 downto 0)
----------------------------------------------------
Legal code

c(3 downto 0)      <=   a(15 downto 12);
cx(0 to   3)       <=   a(15 downto 12);
(e(3), e(4))       <=   bx(12 to 13);
(e(5), e(6))       <=   b(13 downto 12);
Illegal code

d(0 to   3)        <=   a(15   to 12);       -- slice dirs must be same as decl
e(3) & e(2)        <=   b(12   to 13);       -- syntax error on &
p(3 downto 0)      <=   (m +   n)( 3 downto 0); -- syntax error on )(
z(3 downto 0)      <=   m(15   downto 12);   -- types on lhs and rhs must match
                     Figure 1.25: Illustration of Rules for Slices of Vectors



1.10 Arithmetic
VHDL includes all of the common arithmetic and logical operators.

Use the VHDL arithmetic operators and let the synthesis tool choose the better implementation for
you. It is almost impossible for a hand-coded implementation to beat vendor-supplied arithmetic
libraries.

To use the operators, you must choose which arithmetic package you wish to use (section 1.10.1).
The arithmetic operators are overloaded, and you can usually use any mixture of constants and sig-
nals of different types that you need (Section 1.10.3). However, you might need to convert a signal
from one type (e.g. std logic vector) to another type (e.g. integer) (Section 1.10.7).
68                                                                            CHAPTER 1. VHDL


1.10.1 Arithmetic Packages

Rushton Ch-7 covers arithmetic packages. Rushton Appendex A.5 has the code listing for the
numeric std package.
To do arithmetic with signals, use the numeric_std package. This package defines types
signed and unsigned, which are std_logic vectors on which you can do signed or un-
signed arithmetic.
numeric std supersedes earlier arithmetic packages, such as std logic arith.
Use only one arithmetic package, otherwise the different definitions will clash and you can get
strange error messages.


1.10.2 Shift and Rotate Operations

Shift and rotate operations are described with three character acronyms:

                           shift/rotate   left/right   arithmetic/logical

The shift right arithmetic (sra) operation preserves the sign of the operand, by copying the most
significant bit into lower bit positions.
The shift left arithmetic (sla) does the analogous operation, except that the least significant bit is
copied.


1.10.3 Overloading of Arithmetic

The arithmetic operators +, -, and * are overloaded on signed vectors, unsigned vectors, and
integers. Tables1.1–1.4 show the different combinations of target and source types and widths that
can be used.


                    Table 1.1: Overloading of Arithmetic Operations (+, -)

                         target   src1/2   src2/1
                        unsigned unsigned integer OK
                           —     unsigned signed fails in analysis


In these tables “—” means “don’t care”. Also, src1/2 and src2/1 mean first or second operand, and
respectively second or first operand. The first line of the table means that either the fist operand is
unsigned and the second is an integer, or the second operand is unsigned and the first is an integer.
Or, more concisely: one of the operands is unsigned and the other is integer.
1.10.4 Different Widths and Arithmetic                                                 69


1.10.4 Different Widths and Arithmetic


            Table 1.2: Different Vector Widths and Arithmetic Operations (+, -)

                       target src1/2 src2/1
                      narrow wide      —    fails in elaboration
                        wide narrow    int  fails in elaboration
                        wide   wide    —    OK
                      narrow narrow narrow OK
                      narrow narrow    int  OK


                                  Example vectors
                          wide   unsigned(7 downto 0)
                          narrow unsigned(4 downto 0)



1.10.5 Overloading of Comparisons


            Table 1.3: Overloading of Comparison Operations (=, /=, >=, >, <)

                            src1/2   src2/1
                           unsigned integer OK
                            signed integer OK
                           unsigned signed fails in analysis



1.10.6 Different Widths and Comparisons


      Table 1.4: Different Vector Widths and Comparison Operations (=, /=, >=, >, <)

                                  src1/2 src2/1
                                   wide    —    OK
                                  narrow   —    OK
70                                                                            CHAPTER 1. VHDL


1.10.7 Type Conversion

The functions unsigned, signed, to integer, to unsigned and to signed are used
to convert between integers, std-logic vectors, signed vectors and unsigned vectors.

If you convert between two types of the same width, then no additional hardware will be generated.

The listing below summarizes the types of these functions.

unsigned( val : std_logic_vector )                             return unsigned;
signed( val : std_logic_vector )                               return signed;

to_integer( val : signed )                                     return integer;
to_integer( val : unsigned )                                   return integer;

to_unsigned( val : integer; width : natural)                    return unsigned;

to_signed( val : integer; width : natural)                     return signed;
The most common need to convert between two types arises when using a signal as an index into
an array. To use a signal as an index into an array, you must convert the signal into an integer
using the function to_integer (Figure1.26).

signal i : unsigned( 3 downto 0);
signal a : std_logic_vector(15 downto 0);
...
... a(i) ...               -- BAD: won’t typecheck
... a( to_integer(i) ) ... -- OK
Avoid (or at least take care when) converting a signal into an integer and then performing arithmetic
on the signal. The default size for integers is 32 bits, so sometimes when a signal is converted into
an integer, the resulting signals will be 32 bits wide.



library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
...
  signal bit_sig : std_logic;
  signal uns_sig : unsigned(7 downto 0);
  signal vec_sig : std_logic_vector(255 downto 0);
  ...
  bit_sig <= vec_sig( to_integer(uns_sig) );
...


                   Figure 1.26: Using an unsigned signal as an index to array
1.11. SYNTHESIZABLE VS NON-SYNTHESIZABLE CODE                                                   71


To convert a std_logic_vector signal into an integer, you must first say whether the signal
should be interpreted as signed or unsigned. As illustrated in figure1.27, this is done by:


   1. Convert the std_logic_vector signal to signed or unsigned, using the function
      signed or unsigned

   2. Convert the signed or unsigned signal into an integer, using to_integer




library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
...
  signal bit_sig : std_logic;
  signal std_sig : std_logic_vector(7 downto 0);
  signal vec_sig : std_logic_vector(255 downto 0);
  ...
  bit_sig <= vec_sig( to_integer( unsigned( std_sig ) ) );
...


               Figure 1.27: Using a std logic vector as an index to array



1.11 Synthesizable vs Non-Synthesizable Code
Synthesis is done by matching VHDL code against templates or patterns. It’s important to use
idioms that your synthesis tools recognizes. If you aren’t careful, you could write code that has
the same behaviour as one of the idioms, but which results in inefficient or incorrect hardware.
Section 1.8 described common idioms and the resulting hardware.

Most synthesis tools agree on a large set of idioms, and will reliably generate hardware for these
idioms. This section is based on the idioms that Synopsys, Xilinx, Altera, and Mentor Graphics are
able to synthesize. One exception is that Altera’s Quartus does not support implicit state machines
(as of v5.0).

Section 1.11.1 gives rules for unsynthesizable VHDL code. Section 1.11.2 gives rules for code
that is synthesizable, but violates the ece327 guidelines for good practices. The ece327 coding
guidelines are designed to produce circuits suitable for FPGAs. Bad code for FPGAs produce
circuits with the following features:
• latches
• asynchronous resets
• combinational loops
• multiple drivers for a signal
72                                                                           CHAPTER 1. VHDL


• tri-state buffers
We limit our definition of bad practice to code that produces undesirable hardware. Coding styles
that lead to inefficient hardware might be useful in the early stages of the design process, when the
focus is on functionality and not optimality. As such, inefficient code is not considered bad prac-
tice. Poor coding styles that do not affect the hardware, for example, including extraneous signals
in a sensitivity list, should certainly be avoided, but fall into the general realm of programming
guidelines and will not be discussed.


1.11.1 Unsynthesizable Code

1.11.1.1 Initial Values

Initial values on signals (UNSYNTHESIZABLE)

signal bad_signal : std_logic := ’0’;

Reason: In most implementation technologies, when a circuit powers up, the values on signals
are completely random. Some FPGAs are an exception to this. For some FPGAs, when a chip is
powered up, all flip flops will be ’0’. For other FPGAs, the initial values can be programmed.


1.11.1.2 Wait For

Wait for length of time (UNSYNTHESIZABLE)

wait for 10 ns;

Reason: Delays through circuits are dependent upon both the circuit and its operating environment,
particularly supply voltage and temperature.


1.11.1.3 Different Wait Conditions

wait statements with different conditions in a process (UNSYNTHESIZABLE)
-- different clock signals                       -- different clock edges
process                                          process
begin                                            begin
  wait until rising_edge(clk1);                    wait until rising_edge(clk);
  x <= a;                                          x <= a;
  wait until rising_edge(clk2);                    wait until falling_edge(clk);
  x <= a;                                          x <= a;
end process;                                     end process;
Reason: processes with multiple wait statements are turned into finite state machines. The wait
statements denote transitions between states. The target signals in the process are outputs of flip
1.11.1 Unsynthesizable Code                                                                     73


flops. Using different wait conditions would require the flip flops to use different clock signals
at different times. Multiple clock signals for a single flip flop would be difficult to synthesize,
inefficient to build, and fragile to operate.


1.11.1.4 Multiple “if rising edge” in Process

Multiple if rising edge statements in a process (UNSYNTHESIZABLE)

process (clk)
begin
  if rising_edge(clk) then
     q0 <= d0;
  end if;
  if rising_edge(clk) then
     q1 <= d1;
  end if;
end process;
Reason: The idioms for synthesis tools generally expect just a single if rising edge state-
ment in each process. The simpler the VHDL code is, the easier it is to synthesize hardware.
Programmers of synthesis tools make idiomatic restrictions to make their jobs simpler.


1.11.1.5 “if rising edge” and “wait” in Same Process

An if rising edge statement and a wait statement in the same process (UNSYNTHESIZ-
ABLE)

process (clk)
begin
  if rising_edge(clk) then
     q0 <= d0;
  end if;
  wait until rising_edge(clk);
  q0 <= d1;
end process;
Reason: The idioms for synthesis tools generally expect just a single type of flop-generating state-
ment in each process.
74                                                                        CHAPTER 1. VHDL


1.11.1.6 “if rising edge” with “else” Clause

The if statement has a rising edge condition and an else clause (UNSYNTHESIZABLE).

process (clk)
begin
  if rising_edge(clk) then
    q0 <= d0;
  else
    q0 <= d1;
  end if;
end process;
Reason: Generally, an if-then-else statement synthesizes to a multiplexer. The condition that is
tested in the if-then-else becomes the select signal for the multiplexer. In an if rising edge
with else, the select signal would need to detect a rising edge on clk, which isn’t feasible to
synthesize.


1.11.1.7 “if rising edge” Inside a “for” Loop

An if rising edge statement in a for-loop (UNSYNTHESIZABLE-Synopsys)

     process (clk) begin
       for i in 0 to 7 loop
         if rising_edge(clk) then
           q(i) <= d;
         end if;
       end loop;
     end process;
Reason: just an idiom of the synthesis tool.

Some loop statements are synthesizable (Rushton Section 8.7). For-loops in general are de-
scribed in Ashenden. Examples of for loops in E&CE will appear when describing testbenches for
functional verification (Chapter 4).
1.11.1 Unsynthesizable Code                                                                   75


Synthesizable Alternative    ..............................................................


A synthesizable alternative to an if rising edge statement in a for-loop is to put the if-rising-
edge outside of the for loop.

  process (clk) begin
    if rising_edge(clk) then
      for i in 0 to 7 loop
        q(i) <= d;
      end loop;
    end if;
  end process;


1.11.1.8 “wait” Inside of a “for loop”

wait statements in a for loop (UNSYNTHESIZABLE)

  process
  begin
    for i in 0 to 7 loop
      wait until rising_edge(clk);
      x <= to_unsigned(i,4);
    end loop;
  end process;
Reason: Unknown. while-loops with the same behaviour are synthesizable.

         Note: Combinational for-loops Combinational for-loops are usually
         synthesizable. They are often used to build a combinational circuit for each
         element of an array.

         Note: Clocked for-loops        Clocked for-loops are not synthesizable,
         but are very useful in simulation, particular to generate test vectors for test
         benches.
76                                                                            CHAPTER 1. VHDL


Synthesizable Alternative to Wait-Inside-For         .......................................... .


while loop (synthesizable)

This is the synthesizable alternative to the the wait statement in a for loop above.

     process
     begin
       -- output values from 0 to 4 on i
       -- sending one value out each clock cycle
       i <= to_unsigned(0,4);
       wait until rising_edge(clk);
       while (4 > i) loop
         i <= i + 1;
         wait until rising_edge(clk);
       end loop;
     end process;


1.11.2 Synthesizable, but Bad Coding Practices

         Note:     For some of the results in this section, the results are highly depen-
         dent upon the synthesis tool that you use and the target technology library.


1.11.2.1 Asynchronous Reset

In an asynchronous reset, the test for reset occurs outside of the test for the clock edge.

      process (reset, clk)
      begin
        if (reset = ’1’) then
          q <= ’0’;
        elsif rising_edge(clk) then
          q <= d1;
        end if;
      end process;
Asynchronous resets are bad, because if a reset occurs very close to a clock edge, some parts of
the circuit might be reset in one clock cycle and some in the subsequent clock cycle. This can lead
the circuit to be out of sync as it goes through the reset sequence, potentially causing erroneous
internal state and output values.
1.11.2 Synthesizable, but Bad Coding Practices                                                     77


1.11.2.2 Combinational “if-then” Without “else”

     process (a, b)
     begin
       if (a = ’1’) then
         c <= b;
       end if;
     end process;
Reason: This code synthesizes c to be a latch, and latches are undesirable.


1.11.2.3 Bad Form of Nested Ifs

if rising edge statement inside another if (BAD HARDWARE)

In Synopsys, with some target libraries, this design results in a level-sensitive latch whose input is
a flop.

     process (ce, clk)
     begin
       if (ce = ’1’) then
         if rising_edge(clk) then
            q <= d1;
         end if;
       end if;
     end process;


1.11.2.4 Deeply Nested Ifs

Deeply chained if-then-else statements can lead to long chains of dependent gates, rather
than checking different cases in parallel.

               Slow (maybe)                                            Fast (hopefully)
if cond1 then                                           if only one of the conditions can be true at a
  stmts1                                                time, then try using a case statement or some
elsif cond2 then                                        other technique that allows the conditions to
  stmts2                                                be evaluated in parallel.
elsif cond3 then
  stmts3
elsif cond4 then
  stmts4
end if;
78                                                                           CHAPTER 1. VHDL


1.11.3 Synthesizable, but Unpredictable Hardware

Some coding styles are synthesizable and might produce desirable hardware with a particular syn-
thesis tool, but either be unsynthesizable or produce undesirable hardware with another tool.
• variables
• level-sensitive wait statements
• missing signals in sens list
If you are using a single synthesis tool for an extended period of time, and want to get the full
power of the tool, then it can be advantageous to write your code in a way that works for your tool,
but might produce undesirable results with other tools.



1.12 Synthesizable VHDL Coding Guidelines
This section gives guidelines for building robust, portable, and synthesizable VHDL code. Porta-
bility is both for different simulation and synthesis tools and for different implementation tech-
nologies.

Remember, there is a world of difference between getting a design to work in simulation and
getting it to work on a real FPGA. And there is also a huge difference between getting a design
to work in an FPGA for a few minutes of testing and getting thousands of products to work for
months at a time in thousands of different environments around the world.

The coding guidelines here are designed both for helping you to get your E&CE 327 project to
work as well as all of the subsequent industrial designs.

Finally, note that there are exceptions to every rule. You might find yourself in a circumstance
where your particular situation (e.g. choice of tool, target technology, etc) would benefit from
bending or breaking a guideline here. Within E&CE 327, of course, there won’t be any such
circumstances.


1.12.1 Signal Declarations
• Use signals, do not use variables
  reason The intention of the creators of VHDL was for signals to be wires and variables to be
       just for simulation. Some synthesis tools allow some uses of variables, but when using
       variables, it is easy to create a design that works in simulation but not in real hardware.
• Use std_logic signals, do not use bit or Boolean
  reason std_logic is the most commonly used signal type across synthesis tools, simulation
       tools, and cell libraries
• Use in or out, do not use inout
  reason inout signals are tri-state.
1.12.2 Flip-Flops and Latches                                                                   79


   note If you have an output signal that you also want to read from, you might be tempted to
       declare the mode of the signal to be inout. A better solution is to create a new, internal,
       signal that you both read from and write to. Then, your output signal can just read from
       the internal signal.
• Declare the primary inputs and outputs of chips as either std logic and std logic vector.
  Do not use signed or unsigned for primary inputs or outputs.
  reason Both the Altera tool Quartus and the Xilinx tool ngd2vhdl convert signed and unsigned
       vectors in entities into std-logic-vectors. If you want your same testbench to work for both
       functional simulation and timing simulation, you must not use signed or unsigned signals
       in the top-level entity of your chip.
  note Signed and unsigned signals are fine inside testbenches, for non-top-level entities, and
       inside architectures. It is only the top-level entity that should not use signed or unsigned
       signals.


1.12.2 Flip-Flops and Latches
• Use flops, not latches (see section 1.8.2).
• Use D-flops, not T, JK, etc (see section 1.8.2).
• For every signal in your design, know whether it should be a flip-flop or combinational. Before
  simulating your design, examine the log file e.g. LOG/dc shell.log to see if the flip
  flops in your circuit match your expectations, and to check that you don’t have any latches in
  your design.
• Do not assign a signal to itself (e.g. a <= a; is bad). If the signal is a flop, use a chip enable
  to cause the signal to hold its value. If the signal is combinational, then assigning a signal to
  itself will cause combinational loops, which are bad.


1.12.3 Inputs and Outputs
• Put flip flops on primary inputs and outputs of a chip
  reason Creates more robust implementations. Signal delays between chips are unpredictable.
       Signal integrity can be a problem (remember transmission lines from E&CE 324?). Putting
       flip flops on inputs and outputs of chip provides clean boundaries between circuits.
  note This only applies to primary inputs and outputs of a chip (the signals in the top-level
       entity). Within a chip, you should adopt a standard of putting flip-flops on either inputs or
       outputs of modules. Within a chip, you do not need to put flip-flops on both inputs and
       outputs.


1.12.4 Multiplexors and Tri-State Signals
• Use multiplexors, not tri-state buffers (see section 1.8.2).
80                                                                            CHAPTER 1. VHDL


1.12.5 Processes
• For a combinational process, the sensitivity list should contain all of the signals that are read in
  the process.
   reason Gives consistent results across different tools. Many synthesis tools will implicitly
        include all signals that a process reads in its sensitivity list. This differs from the VHDL
        Standard. A tool that adheres to the standard will introduce latches if not all signals that
        are read from are included in the sensitivity list.
   exception In a clocked process using an if rising edge, it is acceptable to have only the
        clock in the sensitivity list
• For a combinational process, every signal that is assigned to, must be assigned to in every branch
  of if-then and case statements.
  reason If a signal is not assigned a value in a path through a combinational process, then that
        signal will be a latch.
   note For a clocked process, if a signal is not assigned a value in a clock cycle, then the flip-flop
        for that signal will have a chip-enable pin. Chip-enable pins are fine; they are available on
        flip-flops in essentially every cell library.
• Each signal should be assigned to in only one process.
  reason Multiple processes driving the same signal is the same as having multiple gates driving
       the same wire. This can cause contention, short circuits, and other bad things.
  exception Multiple drivers are acceptable for tri-state busses or if your implementation tech-
       nology has wired-ANDs or wired-ORs. FPGAs don’t have wired-ANDs or wired-ORs.
• Separate unrelated signals into different processes
  reason Grouping assignments to unrelated signals into a single process can complicate the
       control circuitry for that process. Each branch in a case statement or if-then-else adds a
       multiplexor or chip-enable circuitry.
  reason Synthesis tools generally optimize each process individually, the larger a process is, the
       longer it will take the synthesis program to optimize the process. Also, larger processes
       tend to be more complicated and can cause synthesis programs to miss helpful optimiza-
       tions that they would notice in smaller processes.


1.12.6 State Machines
• In a state machine, illegal and unreachable states should transition to the reset state
   reason Creates more robust implementations. In the field, your circuit will be subjected to
        illegal inputs, voltage spikes, temperature fluctuations, clock speed variations, etc. At
        some point in time, something weird will happen that will cause it to jump into an illegal
        state. Having a system reset and reboot is much better than having it generate incorrect
        outputs that aren’t detected.
• If your state machine has less than 16 states, use a one-hot encoding.
1.12.7 Reset                                                                                          81


   reason For n states, a one-hot encoding uses n flip-flops, while a binary encoding uses log2 n
       flip-flops. One-hot signals are simpler to decode, because only one bit must be checked to
       determine if the circuit is in a particular state. For small values of n, a one-hot signal results
       in a smaller and faster circuit. For large values of n, the number of signals required for a
       one-hot design is too great of a penalty to compensate for the simplicity of the decoding
       circuitry.
   note Using an enumerated type for states allows the synthesis tool to choose state encodings
       that it thinks will work well to balance area and clock speed. Quartus uses a “modified
       one-hot” encoding, where the bit that denotes the reset state is inverted. That is, when the
       reset bit is ’0’, the system is in the reset state and when the reset bit is a ’1’ the system
       is not in the reset state. The other bits have the normal polarity. The result is that when the
       system is in the reset state, all bits are ’0’ and when the system is in a non-reset state, two
       bits are ’1’.
   note Using your own encoding allows you to leverage knowledge about your design that the
       synthesis tool might not be able to deduce.


1.12.7 Reset
• Include a reset signal in all clocked circuits.
   reason For most implementation technologies, when you power-up the circuit, you do not
       know what state it will start in. You need a reset signal to get the circuit into a known state.
   reason If something goes wrong while the circuit is running, you need a way to get it into a
       known state.
• For implicit state machines (section 2.5.1.3), check for reset after every wait statement.
  reason Missing a wait statement means that your circuit might not notice a reset signal, or
       different signals could reset in different clock cycles, causing your circuit to get out of
       synch.
• Connect reset to the important control signals in the design, such as the state signal. Do not reset
  every flip flop.
   reason Using reset adds area and delay to a circuit. The fewer signals that need reset, the
       faster and smaller your design will be.
   note Connect the reset signal to critical flip-flops, such as the state signal. Datapath signals
       rarely need to be reset. You do not need to reset every signal
• Use synchronous, not asynchronous, reset
  reason Creates more robust implementations. Signal propagation delays mean that asyn-
       chronous resets cause different parts of the circuit to be reset at different times. This can
       lead to glitches, which then might cause the circuit to move to an illegal state.
82                                                                        CHAPTER 1. VHDL


Covering All Cases    ....................................................................


When writing case statements or selected assignments that test the value of std logic signals,
you will get an error unless you include a provision for non ’1’/’0’ signals.

For example:



      signal t : std_logic;
      ...
      case t is
        when ’1’ => ...
        when ’0’ => ...
      end case;


will result in an error message about missing cases. You must provide for t being ’H’, ’U’, etc.
The simplest thing to do is to make the last test when other.
1.13. VHDL PROBLEMS                                                                                83


1.13 VHDL Problems
P1.1 IEEE 1164

For each of the values in the list below, answer whether or not it is defined in the ieee.std_logic_1164
library. If it is part of the library, write a 2–3 word description of the value.

Values: ’-’, ’#’, ’0’, ’1’, ’A’, ’h’, ’H’, ’L’, ’Q’, ’X’, ’Z’.


P1.2 VHDL Syntax

Answer whether each of the VHDL code fragments q2a through q2f is legal VHDL code.


 NOTES: 1)      “...” represents a fragment of legal VHDL code.
        2)      For full marks, if the code is illegal, you must explain why.
        3)      The code has been written so that, if it is illegal, then it is illegal for both
                simulation and synthesis.



q2a architecture main of anchiceratops is
       signal a, b, c : std_logic;
     begin
       process begin
         wait until rising_edge(c);
         a <= if (b = ’1’) then
                ...
              else
                ...
              end if;
       end process;
     end main;


q2b architecture main of tulerpeton is
     begin
       lab: for i in 15 downto 0 loop
         ...
       end loop;
     end main;
84                                          CHAPTER 1. VHDL


q2c architecture main of metaxygnathus is
       signal a : std_logic;
     begin
       lab: if (a = ’1’) generate
         ...
       end generate;
     end main;

q2d architecture main of temnospondyl is
       component compa
         port (
           a : in std_logic;
           b : out std_logic
         );
       end component;
       signal p, q : std_logic;
     begin
       coma_1 : compa
         port map (a => p, b => q);
       ...
     end main;



q2e architecture main of pachyderm is
       function inv(a : std_logic)
         return std_logic is
       begin
         return(NOT a);
       end inv;
       signal p, b : std_logic;
     begin
       p <= inv(b => a);
       ...
     end main;

q2f architecture main of apatosaurus is
       type state_ty is (S0, S1, S2);
       signal st : state_ty;
       signal p : std_logic;
     begin
       case st is
         when S0 | S1 => p <= ’0’;
         when others => p <= ’1’;
       end case;
     end main;
P1.3 Flops, Latches, and Combinational Circuitry                                           85


P1.3 Flops, Latches, and Combinational Circuitry

For each of the signals p...z in the architecture main of montevido, answer whether the signal
is a latch, combinational gate, or flip-flop.

entity montevido is
  port (
    a, b0, b1, c0, c1, d0, d1, e0, e1 : in std_logic;
    l : in std_logic_vector (1 downto 0);
    p, q, r, s, t, u, v, w, x, y, z : out std_logic
  );
end montevido;

architecture main of montevido is                process begin
  signal i, j : std_logic;                         wait until rising_edge(a);
begin                                              t <= b0 XOR b1;
  i <= c0 XOR c1;                                  u <= NOT t;
  j <= c0 XOR c1;                                  v <= NOT x;
  process (a, i, j) begin                        end process;
    if (a = ’1’) then
      p <= i AND j;                             process begin
    else                                          case l is
      p <= NOT i;                                   when "00" =>
    end if;                                           wait until rising_edge(a);
  end process;                                        w <= b0 AND b1;
  process (a, b0, b1) begin                           x <= ’0’;
    if rising_edge(a) then                          when "01" =>
      q <= b0 AND b1;                                 wait until rising_edge(a);
    end if;                                           w <= ’-’;
  end process;                                        x <= ’1’;
                                                    when "1-" =>
  process                                             wait until rising_edge(a);
    (a, c0, c1, d0, d1, e0, e1)                       w <= c0 XOR c1;
  begin                                               x <= ’-’;
    if (a = ’1’) then                             end case;
      r <= c0 OR c1;                            end process;
      s <= d0 AND d1;                           y <= c0 XOR c1;
    else                                        z <= x XOR w;
      r <= e0 XOR e1;                         end main;
    end if;
  end process;
86                                                                   CHAPTER 1. VHDL


P1.4 Counting Clock Cycles

This question refers to the VHDL code shown below.

NOTES:
  1. “...” represents a legal fragment of VHDL code
     2. assume all signals are properly declared
     3. the VHDL code is intendend to be legal, synthesizable code
     4. all signals are initially ’U’
P1.4 Counting Clock Cycles                                                                   87


                             architecture main of tinyckt is
entity bigckt is               component bigckt ( ... );
  port (                       signal ... : std_logic;
    a, b : in std_logic;     begin
    c    : out std_logic       p0 : process begin
  );                             wait until rising_edge(clk);
end bigckt;                      p0_a <= i;
                                 wait until rising_edge(clk);
architecture main of bigckt is end process;
begin                          p1 : process begin
  process (a, b)                 wait until rising_edge(clk);
  begin                          p1_b <= p1_d;
    if (a = ’0’) then            p1_c <= p1_b;
      c <= ’0’;                  p1_d <= s2_k;
    else                       end process;
      if (b = 1’) then
        c <= ’1’               p2 : process (p1_c, p3_h, p4_i, clk) begin
      else                       if rising_edge(clk) then
        c <= ’0’;                  p2_e <= p3_h;
      end if;                      p2_f <= p1_c = p4_i;
    end if;                      end if;
  end process;                 end process;
end main;
                               p3 : process (i, s4_m) begin
entity tinyckt is                p3_g <= i;
  port (                         p3_h <= s4_m;
    clk : in std_logic;        end process;
    i   : in std_logic;
    o   : out std_logic        p4 : process (clk, i) begin
  );                             if (clk = ’1’) then
end tinyckt;                       p4_i <= i;
                                 else
                                   p4_i <= ’0’;
                                 end if;
                               end process;

                                         huge : bigckt
                                           (a => p2_e, b => p1_d, c => h_y);
                                         s1_j <= s3_l;
                                         s2_k <= p1_b XOR i;
                                         s3_l <= p2_f;
                                         s4_m <= p2_f;
                                       end main;


For each of the pairs of signals below, what is the minimum length of time between when a change
occurs on the source signal and when that change affects the destination signal?
88                                                                        CHAPTER 1. VHDL


 src        dst   Num clock cycles
 i         p0 a
 i         p1 b
 i         p1 b
 i         p1 c
 i         p2 e
 i         p3 g
 i         p4 i
 s4    m   hy
 p1    b   p1 d
 p2    f   s1 j
 p2    f   s2 k


P1.5 Arithmetic Overflow

Implement a circuit to detect overflow in 8-bit signed addition.

An overflow in addition happens when the carry into the most significant bit is different from the
carry out of the most significant bit.

When performing addition, for overflow to happen, both operands must have the same sign. Pos-
itive overflow occurs when adding two positive operands results in a negative sum. Negative
overflow occurs when adding two negative operands results in a positive sum.
P1.6 Delta-Cycle Simulation: Pong                                                              89


P1.6 Delta-Cycle Simulation: Pong

Perform a delta-cycle simulation of the following VHDL code by drawing a waveform diagram.
INSTRUCTIONS:

  1. The simulation is to be done at the granularity of simulation-steps.
  2. Show all changes to process modes and signal values.
  3. Each column of the timing diagram corresponds to a simulation step that changes a signal or
     process.
  4. Clearly show the beginning and end of each simulation cycle, delta cycle, and simulation
     round by writing in the appropriate row a B at the beginning and an E at the end of the cycle
     or round.
  5. End your simulation just before 20 ns.

architecture main of pong_machine is
  signal ping_i, ping_n, pong_i, pong_n : std_logic;
begin
  reset_proc: process                     next_proc: process (clk)
    reset <= ’1’;                         begin
    wait for 10 ns;                         if rising_edge(clk) then
    reset <= ’0’;                             ping_n <= ping_i;
    wait for 100 ns;                          pong_n <= pong_i;
  end process;                              end if;
                                          end process;
  clk_proc: process
    clk <= ’0’;                           comb_proc: process (pong_n, ping_n, reset)
    wait for 10 ns;                       begin
    clk <= ’1’;                             if (reset = ’1’) then
    wait for 10 ns;                           ping_i <= ’1’;
  end process;                                pong_i <= ’0’;
                                            else
                                              ping_i <= pong_n;
                                              pong_i <= ping_n;
                                            end if;
                                          end process;

                                       end main;


P1.7 Delta-Cycle Simulation: Baku

Perform a delta-cycle simulation of the following VHDL code by drawing a waveform diagram.
INSTRUCTIONS:
90                                                                                 CHAPTER 1. VHDL


     1. The simulation is to be done at the granularity of simulation-steps.
     2. Show all changes to process modes and signal values.
     3. Each column of the timing diagram corresponds to a simulation step.
     4. Clearly show the beginning and end of each simulation cycle, delta cycle, and simulation
        round by writing in the appropriate row a B at the beginning and an E at the end of the cycle
        or round.
     5. Write “t=5ns” and “t=10ns” at the top of columns where time advances to 5 ns and 10 ns.
     6. Begin your simulation at 5 ns (i.e. after the initial simulation cycles that initialize the signals
        have completed).
     7. End your simulation just before 15 ns;

entity baku is
  port (
    clk, a, b : in std_logic;
    f         : out std_logic
  );
end baku;

architecture main of baku is
  signal c, d, e : std_logic;
begin
     proc_clk: process
     begin
                                                       proc_1 : process (a, b, c)
       clk <= ’0’;
                                                       begin
       wait for 10 ns;
                                                         c <= a and b;
       clk <= ’1’;
                                                         d <= a xor c;
       wat for 10 ns;
                                                       end process;
     end process;
                                                       proc_2 : process
     proc_extern : process
                                                       begin
     begin
                                                         e <= d;
       a <= ’0’;
                                                         wait until rising_edge(clk);
       b <= ’0’;
                                                       end process;
       wait for 5 ns;
                                                       proc_3 : process (c, e) begin
       a <= ’1’;
                                                         f <= c xor e;
       b <= ’1’;
                                                       end process;
       wait for 15 ns;
                                                     end main;
     end process;
P1.8 Clock-Cycle Simulation                                                        91


 P1.8 Clock-Cycle Simulation

 Given the VHDL code for anapurna and waveform diagram below, answer what the values of
 the signals y, z, and p will be at the given times.

 entity anapurna is
   port (
     clk, reset, sel : in std_logic;
     a, b : in unsigned(15 downto 0);
     p    : out unsigned(15 downto 0)
   );
 end anapurna;

 architecture main of anapurna is
   type state_ty is (mango, guava, durian, papaya);
   signal y, z : unsigned(15 downto 0);
   signal state : state_ty;
 begin


proc_herzog: process
begin
  top_loop: loop
    wait until (rising_edge(clk));
    next top_loop when (reset = ’1’);     proc_hillary: process (clk)
    state <= durian;                      begin
    wait until (rising_edge(clk));          if rising_edge(clk) then
    state <= papaya;                          if (state = durian) then
    while y < z loop                            z <= a;
      wait until (rising_edge(clk));          else
      if sel = ’1’ then                         z <= z + 2;
        wait until (rising_edge(clk));        end if;
        next top_loop when (reset = ’1’);   end if;
        state <= mango;                   end process;
      end if;                             y <= b;
      state <= papaya;                    p <= y + z;
    end loop;                           end main;
  end loop;
end process;
92                                                                     CHAPTER 1. VHDL


P1.9 VHDL — VHDL Behavioural Comparison: Teradactyl

For each of the VHDL architectures q3a through q3c, does the signal v have the same behaviour
as it does in the main architecture of teradactyl?


     NOTES: 1)   For full marks, if the code has different behaviour, you must explain
                 why.
            2)   Ignore any differences in behaviour in the first few clock cycles that is
                 caused by initialization of flip-flops, latches, and registers.
            3)   All code fragments in this question are legal, synthesizable VHDL code.

entity teradactyl is                              architecture q3a of teradactyl is
  port (                                            signal b, c, d : std_logic;
    a : in std_logic;                             begin
    v : out std_logic                               b <= a;
  );                                                c <= b;
end teradactyl;                                     d <= c;
architecture main of teradactyl is                  v <= d;
  signal m : std_logic;                           end q3a;
begin
  m <= a;
  v <= m;
end main;


architecture q3b of teradactyl is                 architecture q3c of teradactyl is
  signal m : std_logic;                             signal m : std_logic;
begin                                             begin
  process (a, m) begin                              process (a) begin
    v <= m;                                           m <= a;
    m <= a;                                         end process;
  end process;                                      process (m) begin
end q3b;                                              v <= m;
                                                    end process;
                                                  end q3c;
P1.10 VHDL — VHDL Behavioural Comparison: Ichtyostega                                       93


P1.10 VHDL — VHDL Behavioural Comparison: Ichtyostega

For each of the VHDL architectures q4a through q4c, does the signal v have the same behaviour
as it does in the main architecture of ichthyostega?


  NOTES: 1)      For full marks, if the code has different behaviour, you must explain
                 why.
            2)   Ignore any differences in behaviour in the first few clock cycles that is
                 caused by initialization of flip-flops, latches, and registers.
            3)   All code fragments in this question are legal, synthesizable VHDL code.

entity ichthyostega is                architecture q4a of ichthyostega is
  port (                                signal bx, cx : signed(3 downto 0);
    clk : in std_logic;               begin
    b, c : in signed(3 downto 0);       process begin
    v    : out signed(3 downto 0)         wait until (rising_edge(clk));
  );                                      bx <= b;
end ichthyostega;                         cx <= c;
                                        end process;
architecture main of ichthyostega is    process begin
  signal bx, cx : signed(3 downto 0);     if (cx > 0) then
begin                                       wait until (rising_edge(clk));
  process begin                             v <= bx;
    wait until (rising_edge(clk));        else
    bx <= b;                                wait until (rising_edge(clk));
    cx <= c;                                v <= to_signed(-1, 4);
  end process;                            end if;
  process begin                         end process;
    wait until (rising_edge(clk));    end q4a;
    if (cx > 0) then
      v <= bx;
    else
      v <= to_signed(-1, 4);
    end if;
  end process;
end main;
94                                                    CHAPTER 1. VHDL



architecture q4b of ichthyostega is   architecture q4c of ichthyostega is
  signal bx, cx : signed(3 downto 0);   signal bx, cx, dx : signed(3 downto 0);
begin                                 begin
  process begin                         process begin
    wait until (rising_edge(clk));        wait until (rising_edge(clk));
    bx <= b;                              bx <= b;
    cx <= c;                              cx <= c;
    wait until (rising_edge(clk));      end process;
    if (cx > 0) then                    process begin
      v <= bx;                            wait until (rising_edge(clk));
    else                                  v <= dx;
      v <= to_signed(-1, 4);            end process;
    end if;                             dx <=   bx when (cx > 0)
  end process;                             else to_signed(-1, 4);
end q4b;                              end q4c;
P1.11 Waveform — VHDL Behavioural Comparison                                                       95


P1.11 Waveform — VHDL Behavioural Comparison

Answer whether each of the VHDL code fragments q3a through q3d has the same behaviour as
the timing diagram.


 NOTES: 1)     “Same behaviour” means that the signals a, b, and c have the same values at
               the end of each clock cycle in steady-state simulation (ignore any irregularities
               in the first few clock cycles).
          2)   For full marks, if the code does not match, you must explain why.
          3)   Assume that all signals, constants, variables, types, etc are properly defined
               and declared.
          4)   All of the code fragments are legal, synthesizable VHDL code.

clk

 a
 b

 c


q3a                                             q3b
architecture q3a of q3 is
begin                                           architecture q3b of q3 is
  process begin                                 begin
    a <= ’1’;                                     process begin
    loop                                            b <= ’0’;
      wait until rising_edge(clk);                  a <= ’1’;
      a <= NOT a;                                   wait until rising_edge(clk);
    end loop;                                       a <= b;
  end process;                                      b <= a;
  b <= NOT a;                                       wait until rising_edge(clk);
  c <= NOT b;                                     end process;
end q3a;                                          c <= a;
                                                end q3b;
                                                ˜
96                                                    CHAPTER 1. VHDL


q3c                                q3d
architecture q3c of q3 is          architecture q3d of q3 is
begin                              begin
  process begin                      process (b, clk) begin
    a <= ’0’;                          a <= NOT b;
    b <= ’1’;                        end process;
    wait until rising_edge(clk);     process (a, clk) begin
    b <= a;                            b <= NOT a;
    a <= b;                          end process;
    wait until rising_edge(clk);     c <= NOT b;
  end process;                     end q3d;
  c <= NOT b;
end q3c;
                                   ˜

q3e                                q3f
architecture q3e of q3 is
begin                              architecture q3f of q3 is
  process                          begin
  begin                              process begin
    b <= ’0’;                          a <= ’1’;
    a <= ’1’;                          b <= ’0’;
    wait until rising_edge(clk);       c <= ’1’;
    a <= c;                            wait until rising_edge(clk);
    b <= a;                            a <= c;
    wait until rising_edge(clk);       b <= a;
  end process;                         c <= NOT b;
  c <= not b;                          wait until rising_edge(clk);
end q3e;                             end process;
                                   end q3f;
                                   ˜
P1.12 Hardware — VHDL Comparison                                              97


P1.12 Hardware — VHDL Comparison

For each of the circuits q2a–q2d, answer       entity q2 is
whether the signal d has the same behaviour      port (
as it does in the main architecture of q2.         a, clk, reset : in std_logic;
                                                   d             : out std_logic
                                                 );
                                               end q2;
                                               architecture main of q2 is
                                                 signal b, c : std_logic;
                                               begin
                                                 b <=   ’0’ when (reset = ’1’)
                                                   else a;
                                                 process (clk) begin
                                                   if rising_edge(clk) then
                                                     c <= b;
                                                     d <= c;
                                                   end if;
                                                 end process;
                                               end main;

                      reset

                                                             reset
               0
                                                d
                                                     0
      a
                                                                                       d
                                                    a
q2a clk                                       q2b clk


                                                    reset
                                                      clk
             reset
                                                                 0
      0
                                          d              a                         d
      a
q2c clk                                       q2d    clk
98                                                                            CHAPTER 1. VHDL


P1.13 8-Bit Register

Implement an 8-bit register that has:
• clock signal clk
• input data vector d
• output data vector q
• synchronous active-high input reset
• synchronous active-high input enable


P1.13.1 Asynchronous Reset

Modify your design so that the reset signal is asynchronous, rather than synchronous.


P1.13.2 Discussion

Describe the tradeoffs in using synchonous versus asynchronous reset in a circuit implemented on
an FPGA.


P1.13.3 Testbench for Register

Write a test bench to validate the functionality of the 8-bit register with synchronous reset.
P1.14 Synthesizable VHDL and Hardware                                                           99


P1.14 Synthesizable VHDL and Hardware

For each of the fragments of VHDL q4a...q4f, answer whether the the code is synthesizable. If the
code is synthesizable, draw the circuit most likely to be generated by synthesizing the datapath of
the code. If the the code is not synthesizable, explain why.


    process begin
      wait until rising_edge(a);
      e <= d;
q4a
      wait until rising_edge(b);
      e <= NOT d;
    end process;


    process begin
      while (c /= ’1’) loop
        if (b = ’1’) then
          wait until rising_edge(a);
          e <= d;
q4b     else
          e <= NOT d;
        end if;
      end loop;
      e <= b;
    end process;


   process (a, d) begin
     e <= d;
   end process;
   process (a, e) begin
q4c if rising_edge(a) then
       f <= NOT e;
     end if;
   end process;



    process (a) begin
      if rising_edge(a) then
        if b = ’1’ then
          e <= ’0’;
q4d     else
          e <= d;
        end if;
      end if;
    end process;
100                              CHAPTER 1. VHDL

    process (a,b,c,d) begin
      if rising_edge(a) then
        e <= c;
      else
q4e     if (b = ’1’) then
          e <= d;
        end if;
      end if;
    end process;


    process (a,b,c) begin
      if (b = ’1’) then
        e <= ’0’;
      else
q4f     if rising_edge(a) then
          e <= c;
        end if;
      end if;
    end process;
P1.15 Datapath Design                                                                          101


P1.15 Datapath Design

Each of the three VHDL fragments q4a–q4c, is intended to be the datapath for the same circuit.
The circuit is intended to perform the following sequence of operations (not all operations are
required to use a clock cycle):
• read in source and destination addresses from i src1,
  i src2, i dst                                                   clk
• read operands op1 and op2 from memory                        i_src1                   o_result
• compute sum of operands sum                                  i_src2
• write sum to memory at destination address dst                i_dst
• write sum to output o result


P1.15.1 Correct Implementation?

For each of the three fragments of VHDL q4a–q4c, answer whether it is a correct implementation
of the datapath. If the datapath is not correct, explain why. If the datapath is correct, answer in
which cycle you need load=’1’.

NOTES:
  1. You may choose the number of clock cycles required to execute the sequence of operations.
   2. The cycle in which the addresses are on i src1, i src2, and i dst is cycle #0.
   3. The control circuitry that controls the datapath will output a signal load, which will be ’1’
      when the sum is to be written into memory.
   4. The code fragment with the signal declaractions, connections for inputs and outputs, and the
      instantiation of memory is to be used for all three code fragments q4a–q4c.
   5. The memory has registered inputs and combinational (unregistered) outputs.
   6. All of the VHDL is legal, synthesizable code.
102                                              CHAPTER 1. VHDL


  -- This code is to be used for
  -- all three code fragments q4a--q4c.
  signal state : std_logic_vector(3 downto 0);
  signal src1, src2, dst, op1, op2, sum,
         mem_in_a, mem_out_a, mem_out_b,
         mem_addr_a, mem_addr_b
         : unsigned(7 downto 0);
  ...
  process (clk)
  begin
    if rising_edge(clk) then
      src1     <= i_src1;
      src2     <= i_src2;
      dst      <= i_dst;
      o_result <= sum;
    end if;
  end process;
  mem : ram256x16d
    port map (clk      => clk,
              i_addr_a => mem_addr_a,
              i_addr_b => mem_addr_b,
              i_we_a   => mem_we,
              i_data_a => mem_in_a,
              o_data_a => mem_out_a,
              o_data_b => mem_out_b);
P1.15 Datapath Design                             103


q4a

  op1        <=   mem_out_a when state = "0010"
             else (others => ’0’);
  op2        <=   mem_out_b when state = "0010"
             else (others => ’0’);
  sum        <=   op1 + op2 when state = "0100"
             else (others => ’0’);
  mem_in_a   <=   sum when state = "1000"
             else (others => ’0’);
  mem_addr_a <=   dst when state = "1000"
             else src1;
  mem_we     <=   ’1’ when state = "1000"
             else ’0’;
  mem_addr_b <=   src2;
  process (clk)
  begin
    if rising_edge(clk) then
      if (load = ’1’) then
        state <= "1000";
      else
        -- rotate state vector one bit to left
        state <= state(2 downto 0) & state(3);
      end if;
    end if;
  end process;
q4b

  process (clk) begin
    if rising_edge(clk) then
      op1 <= mem_out_a;
      op2 <= mem_out_b;
    end if;
  end process;
  sum        <= op1 + op2;
  mem_in_a   <= sum;
  mem_we     <= load;
  mem_addr_a <=   dst when load = ’1’
             else src1;
  mem_addr_b <=   src2;
104                                                                           CHAPTER 1. VHDL


q4c

  process
  begin
    wait until rising_edge(clk);
    op1      <= mem_out_a;
    op2      <= mem_out_b;
    sum      <= op1 + op2;
    mem_in_a <= sum;
  end process;
  process (load, dst, src1) begin
    if load = ’1’ then
      mem_addr_a <= dst;
    else
      mem_addr_a <= src1;
    end if;
  end process;
  mem_addr_b <= src2;


P1.15.2 Smallest Area

Of all of the circuits (q4a–q4c), including both correct and incorrect circuits, predict which will
have the smallest area.

If you don’t have sufficient information to predict the relative areas, explain what additional infor-
mation you would need to predict the area prior to synthesizing the designs.


P1.15.3 Shortest Clock Period

Of all of the circuits (q4a–q4c), including both correct and incorrect circuits, predict which will
have the shortest clock period.

If you don’t have sufficient information to predict the relative periods, explain what additional
information you would need to predict the period prior to performing any synthesis or timing
analysis of the designs.
Chapter 2

RTL Design with VHDL: From
Requirements to Optimized Code

2.1 Prelude to Chapter
2.1.1 A Note on EDA for FPGAs and ASICs

The following is from John Cooley’s column The Industry Gadfly from 2003/04/30. The title of
this article is: “The FPGA EDA Slums”.


     For 2001, Dataquest reported that the ASIC market was US$16.6 billion while the
     FPGA market was US$2.6 billion.

     What’s more interesting is that the 2001 ASIC EDA market was US$2.2 billion while
     the FPGA EDA market was US$91.1 million. Nope, that’s not a mistake. It’s ASIC
     EDA and billion versus FPGA EDA and million. Do the math and you’ll see that for
     every dollar spent on an ASIC project, roughly 12 cents of it goes to an EDA vendor.
     For every dollar spent on a FPGA project, roughly 3.4 cents goes to an EDA vendor.
     Not good.

     It’s the old free milk and a cow story according to Gary Smith, the Senior EDA
     Analyst at Dataquest. “Altera and Xilinx have fowled their own nest. Their free tools
     spoil the FPGA EDA market,” says Gary. “EDA vendors know that there’s no money
     to be made in FPGA tools.”




                                             105
106                                                          CHAPTER 2. RTL DESIGN WITH VHDL


2.2 FPGA Background and Coding Guidelines
2.2.1 Generic FPGA Hardware

2.2.1.1 Generic FPGA Cell

 “Cell”   =   “Logic Element” (LE) in Altera
          =   “Configurable Logic Block” (CLB) in Xilinx

                       carry_in


                                                               comb_data_out

 comb_data_in
                         comb                        R
                                                D        Q     flop_data_out
                                                CE

                                                     S
 flop_data_in

       ctrl_in



                      carry_out




2.2.2 Area Estimation

To estimate the number of FPGA cells that will be required to implement a circuit, recall that an
FPGA lookup-table can implement any function with up to four inputs and one output.

We will describe two methods to estimate the area (number of FPGA cells) required to implement
a gate-level circuit:


   1. Rough estimate based simply upon the number of flip-flops and primary inputs that are in
      the fanin of each flip-flop.

   2. A more accurate estimate, based upon greedily including as many gates as possible into each
      FPGA cell.


Allocating gates to FPGA cells is a form of technology mapping: moving from the implementation
technology of generic gates to the implementation technology of FPGA cells.

As with almost all other design tasks, allocating gates to cells is an NP-complete problem: the only
way to ensure that we get the smallest design possible is to try all possible designs. To deal with
NP-complete problems, design tools use heuristics or search techniques to explore efficiently a
subset of the options and hopefully produce a design that is close to the absolute smallest. Because
2.2.2 Area Estimation                                                                          107


different synthesis tools use different heuristics and search algorithms, different tools will give
results.

The circuitry for any flip-flop signal with up to four source flip-flops can be implemented on a
single FPGA cell. If a flip-flop signal is dependent upon five source flip-flops, then two FPGA
cells are required.

 Source flops/inputs Minimum cells
         1               1
         2               1
         3               1
         4               1
         5               2
         6               2
         7               2
         8               3
         9               3
         10              3
         11              4

For a single target signal, this technique gives a lower bound on the number of cells needed. For
example, some functions of seven inputs require more than two cells. As a particular example, a
four-to-one multiplexer has six inputs and requires three cells.

When dealing with multiple target signals, this technique might be an overestimate, because a
single cell can drive several other cells (common subexpression elimination).


PLA and Flop for Different Functions       ..................................................

                      carry_in


                                                            comb_data_out
 comb_data_in
                         comb                       R
                                               D        Q   flop_data_out
                                               CE

                                                    S
 flop_data_in

       ctrl_in



                      carry_out
108                                                   CHAPTER 2. RTL DESIGN WITH VHDL


PLA and Flop for Same Function   ..................................................... .

                   carry_in


                                                        comb_data_out

 comb_data_in
                     comb                     R
                                         D        Q     flop_data_out
                                         CE

                                              S
 flop_data_in

      ctrl_in



                  carry_out
2.2.2 Area Estimation                                                              109


PLA and Flop for Same Function   ......................................................

                   carry_in


                                                     comb_data_out

 comb_data_in
                        comb                 R
                                        D        Q   flop_data_out
                                        CE

                                             S
 flop_data_in

      ctrl_in



                   carry_out
110                                                    CHAPTER 2. RTL DESIGN WITH VHDL


Estimate Area for Circuit        ..............................................................


To have a more accurate estimate of the area of a circuit, we begin with each flip-flop and output,
then traverse backward through the fanin gathering as much combinational circuitry as possible
into the FPGA cell. Usually, this means that we continue as long as we have four or fewer inputs
to the cell. However, when traversing through some circuits, we will temporarily have five or
more signals as input, then further back in the fanin, the circuit will collapse back to less than five
signals.

Once we can no longer include more circuitry into an FPGA cell, we start with a fresh FPGA cell
and continue to traverse backward through the fanin.

Many signals have more than one target, so many FPGA cells will be connected to multiple des-
tinations. When choosing whether to include a gate in an FPGA cell, consider whether the gate
drives multiple targets. There are two options: include the gate in an FPGA cell that drives both
targets, or duplicate the gate and incorporate it into two FPGA cells. The choice of which option
will lead to the smaller circuit is dependent on the details of the design.


    Question:    Map the combinational circuits below onto generic FPGA cells.

a                                                                                 z
                                                  a
b                                                 b
                                                  c     comb              R
                                                  d                  D        Q

                                                                     CE
                     z
c                                                                         S




d


                                                                                  z                            x
a                    g                            a
                                                  b
                                                                                      z
                                                                                      y
           e                              x       c     comb              R
                                                                                          comb         R
b                                                 d                  D

                                                                     CE
                                                                              Q   y               D

                                                                                                  CE
                                                                                                           Q



                     h                                                    S                            S
c          f
                             z      y
d
                     i
2.2.2 Area Estimation                                                          111


                                                                z                           x
                                        a                           z
                                        b                           y
                                        c   comb        R               comb        R
                                        d          D        Q   y              D        Q

                                                   CE                          CE

                                                        S                           S

a                 g
         e                          x
b
                  h
c        f
                        z       y
d                                                               w
                  i                     b
                                        c
                                        d   comb        R
                                                   D        Q

                                                   CE

                            w                           S
112                                                    CHAPTER 2. RTL DESIGN WITH VHDL


2.2.2.1 Interconnect for Generic FPGA

         Note:    In these slides, the space between tightly grouped wires sometimes
         disappears, making a group of wires appear to be a single large wire.




There are two types of wires that connect a cell to the rest of the chip:
• General purpose interconnect (configurable, slow)
• Carry chains and cascade chains (verticaly adjacent cells, fast)


2.2.2.2 Blocks of Cells for Generic FPGA

Cells are organized into blocks. There is a great deal of interconnect (wires) between cells within
a single block. In large FPGAs, the blocks are organized into larger blocks. These large blocks
might themselves be organized into even larger blocks. Think of an FPGA as bunch of nested
for-generate statements that replicate a single component (cell) hundreds of thousands of
times.
2.2.2 Area Estimation                                                                            113




Cells not used for computation can be used as “wires” to shorten length of path between cells.
114                                                                             CHAPTER 2. RTL DESIGN WITH VHDL


2.2.2.3 Clocks for Generic FPGAs

Characteristics of clock signals:
• High fanout (drive many gates)
• Long wires (destination gates scattered all over chip)
Characteristics of FPGAs:
• Very few gates that are large (strong) enough to support a high fanout.
• Very few wires that traverse entire chip and can be connected to every flip-flop.


2.2.2.4 Special Circuitry in FPGAs

Memory       ............................................................................. .


For more than five years, FPGAs have had special circuits for RAM and ROM. In Altera FPGAs,
these circuits are called ESBs (Embedded System Blocks). These special circuits are possible
because many FPGAs are fabricated on the same processes as SRAM chips. So, the FPGAs simply
contain small chunks of SRAM.


Microprocessors      . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..


A new feature to appear in FPGAs in 2001 and 2002 is hardwired microprocessors on the same
chip as programmable hardware.

                                              Hard                                                           Soft
      Altera                         Arm 922T with 200 MIPs                                           Nios with ?? MIPs
      Xilinx: Virtex-II Pro        Power PC 405 with 420 D-MIPs                                   Microblaze with 100 D-MIPs

The Xilinx-II Pro has 4 Power PCs and enough programmable hardware to implement the first-
generation Intel Pentium microprocessor.


Arithmetic Circuitry          ................................................................. .


A new feature to appear in FPGAs in 2001 and 2002 is hardwired circuits for multipliers and
adders.

                                 Altera: Mercury                           16 × 16 at 130MHz
                                 Xilinx: Virtex-II Pro                     18 × 18 at ???MHz

Using these resources can improve significantly both the area and performance of a design.
2.2.3 Generic-FPGA Coding Guidelines                                                           115


Input / Output     ....................................................................... .


Recently, high-end FPGAs have started to include special circuits to increase the bandwidth of
communication with the outside world.

                                               Product
                                  Altera True-LVDS (1 Gbps)
                                  Xilinx Rocket I/O (3 Gbps)


2.2.3 Generic-FPGA Coding Guidelines
• Flip-flops are almost free in FPGAs
  reason In FPGAs, the area consumed by a design is usually determined by the amount of
       combinational circuitry, not by the number of flip-flops.
• Aim for using 80–90% of the cells on a chip.
  reason If you use more than 90% of the cells on a chip, then the place-and-route program
      might not be able to route the wires to connect the cells.
  reason If you use less than 80% of the cells, then probably:
          there are optimizations that will increase performance and still allow the design to fit
           on the chip;
       or you spent too much human effort on optimizing for low area;
       or you could use a smaller (cheaper!) chip.
   exception In E&CE 327 (unlike in real life), the mark is based on the actual number of cells
       used.
• Use just one clock signal
  reason If all flip-flops use the same clock, then the clock does not impose any constraints on
       where the place-and-route tool puts flip-flops and gates. If different flip-flops used different
       clocks, then flip-flops that are near each other would probably be required to use the same
       clock.
• Use only one edge of the clock signal
  reason There are two ways to use both rising and falling edges of a clock signal: have rising-
       edge and falling-edge flip flops, or have two different clock signals that are inverses of
       each other. Most FPGAs have only rising-edge flip flops. Thus, using both edges of a
       clock signal is equivalent to having two different clock signals, which is deprecated by the
       preceding guideline.
116                                                      CHAPTER 2. RTL DESIGN WITH VHDL


2.3 Design Flow
2.3.1 Generic Design Flow

Most people agree on the general terminology and process for a digital hardware design flow.
However, each book and course has its own particular way of presenting the ideas. Here we will
lay out the consistent set of definitions that we will use in E&CE 327. This might be different from
what you have seen in other courses or on a work term. Focus on the ideas and you will be fine
both now and in the future.

The design flow presented here focuses on the artifacts that we work with, rather than the opera-
tions that are performed on the artifacts. This is because the same operations can be performed at
different points in the design flow, while the artifacts each have a unique purpose.


                 Requirements


   Modify
                  Algorithm
   Analyze

             Modify
                         High-Level Model
             Analyze
                                   dp/ctrl
                                  specific

                       Modify
                                     DP+Ctrl Code
                       Analyze

                                 Modify
                                                Opt. RTL Code
                                 Analyze

                                             Modify
                                                        Implementation
                                             Analyze



                                                          Hardware

                                  Figure 2.1: Generic Design Flow
2.3.2 Implementation Flows                                                                     117



                            Table 2.1: Artifacts in the Design Flow

    Requirements           Description of what the customer wants
    Algorithm              Functional description of computation. Probably not syn-
                           thesizable. Could be a flowchart, software, diagram,
                           mathematical equation, etc..
    High-Level Model       HDL code that is not necessarily synthesizable, but di-
                           vides algorithm into signals and clock cycles. Possibly
                           mixes datapath and control. In VHDL, could be a single
                           process that captures the behaviour of the algorithm. Usu-
                           ally synthesizable; resulting hardware is usually big and
                           slow compared to optimized RTL code.
    Dataflow Diagram        A picture that depicts the datapath computation over time,
                           clock-cycle by clock-cycle (Section 2.6)
    Hardware Block Diagram A picture that depicts the structure of the datapath: the
                           components and the connections between the compo-
                           nents. (e.g., netlist or schematic)
    State Machine          A picture that depicts the behaviour of the control cir-
                           cuitry over time (Section 2.5)
    DP+Ctrl RTL code       Synthesizable HDL code that separates the datapath and
                           control into separate processes and assignments.
    Optimized RTL Code     HDL code that has been written to meet design goals (high
                           performance, low power, small, etc.)
    Implementation Code    A collection of files that include all of the information
                           needed to build the circuit: HDL program targeted for
                           a particular implementation technology (e.g. a specific
                           FPGA chip), constraint files, script files, etc.


         Note: Recomendation Spend the time up front to plan a good design on
         paper. Use dataflow diagrams and state machines to predict performance and
         area. The E&CE 327 project might appear to be sufficiently small and simple
         that you can go straight to RTL code. However, you will probably produce
         a more optimal design with less effort if you explore high-level optimizations
         with dataflow diagrams and state machines.


2.3.2 Implementation Flows

Synopsys Design Compiler and FPGA Compiler are general-purpose synthesis programs. They
have very few, if any, technology-specific algorithms. Instead, they rely on libraries to describe
technology-specific parameters of the primitive building blocks (e.g. the delay and area of individ-
ual gates, PLAs, CLBs, flops, memory arrays).
118                                                   CHAPTER 2. RTL DESIGN WITH VHDL


Mentor Graphic’s product Leonardo Spectrum, Cadence’s product BuildGates, and Synplicity’s
product Synplify are similar. In comparison, Avant! (Now owned by Synopsys) and Cadence sell
separate tools that do place-and-route and other low-level (physical design) tasks.

These general-purpose synthesis tools do not (generally) do the final stages of the design, such as
place-and-route and timing analysis, which are very specific to a given implementation technology.
The implementation-technology-specific tools generally also produce a VHDL file that accurately
models the chip. We will refer to this file as the “implementation VHDL code”.

With Synopsys and the Altera tool Quartus, we compile the VHDL code into an EDIF file for
the netlist and a TCL file for the commands to Quartus. Quartus then generates a sof (SRAM
Object File), which can be downloaded to an Altera SRAM-based FPGA. The extension of the
implementation VHDL file is often .vho, for “VHDL output”.

With the Synopsys and Xilinx tools, we compile VHDL code into a Xilinx-specific design file
(xnf — Xilinx netlist file). We then use the Xilinx tools to generate a bit file, which can be
downloaded to a Xilinx FPGA. The name of the implementation VHDL file is often suffixed with
routed.vhd.


Terminology: “Behavioural” and “Structural”         . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..


         Note: behavioural and structural models The phrases “behavioural model”
         and “structural model” are commonly used for what we’ll call “high-level
         models” and “synthesizable models”. In most cases, what people call struc-
         tural code contains both structural and behavioural code. The technically cor-
         rect definition of a structural model is an HDL program that contains only
         component instantiations and generate statements. Thus, even a program with
         c <= a AND b; is, strictly speaking, behavioural.


2.3.3 Design Flow: Datapath vs Control vs Storage

2.3.3.1 Classes of Hardware

Each circuit tends to be dominated by either its datapath, control (state machine) or storage (mem-
ory).
• Datapath
  – Purpose: compute output data based on input data
  – Each “parcel” of input produces one “parcel” of output
  – Examples: arithmetic, decoders
2.3.3 Design Flow: Datapath vs Control vs Storage                                             119


• Storage
  – Purpose: hold data for future use
  – Data is not modified while stored
  – Examples: register files, FIFO queues
• Control
  – Purpose: modify internal state based on inputs, compute outputs from state and inputs
  – Mostly individual signals, few data (vectors)
  – Examples: bus arbiters, memory-controllers
All three classes of circuits (datapath, control, and storage) follow the same generic design flow
(Figure2.1) and use dataflow diagrams, hardware block diagrams, and state machines. The differ-
ences in the design flows appear in the relative amount of effort spent on each type of description
and the order in which the different descriptions are used. The differences are most pronounced
in the transition from the high-level model to the model that separates the datapath and control
circuitry.


2.3.3.2 Datapath-Centric Design Flow




  High-Level Model


   Modify
                   Dataflow
  Analyze

                         Modify
   Block Diagram                      State Machine
                         Analyze



                         DP+Ctrl RTL Code




                              Figure 2.2: Datapath-Centric Design Flow
120                                                     CHAPTER 2. RTL DESIGN WITH VHDL


2.3.3.3 Control-Centric Design Flow




  High-Level Model


   Modify
               State Machine
  Analyze

              Modify
                         Dataflow Diagram
             Analyze

                          Modify
                                      Block Diagram
                         Analyze



                                               DP+Ctrl RTL Code


                               Figure 2.3: Control-Centric Design Flow



2.3.3.4 Storage-Centric Design Flow

In E&CE 327, we won’t be discussing storage-centric design. Storage-centric design differs from
datapath- and control-centric design in that storage-centric design focusses on building many repli-
cated copies of small cells.
Storage-centric designs include a wide range of circuits, from simple memory arrays to compli-
cated circuits such as register files, translation lookaside buffers, and caches. The complicated
circuits can contain large and very intricate state machines, which would benefit from some of the
techniques for control-centric circuits.



2.4 Algorithms and High-Level Models
For designs with significant control flow, algorithms can be described in software languages, flow-
charts, abstract state machines, algorithmic state machines, etc.
For designs with trivial control flow (e.g. every parcel of input data undergoes the same computa-
tion), data-dependency graphs (section 2.4.2) are a good way to describe the algorithm.
For designs with a small amount of control flow (e.g. a microprocessor, where a single decision is
made based upon the opcode) a set of data-dependency graphs is often a good choice.
2.4.1 Flow Charts and State Machines                                                                           121


                Software executes in series;
                   hardware executes in parallel
When creating an algorithmic description of your hardware design, think about how you can repre-
sent parallelism in the algorithmic notation that you are using, and how you can exploit parallelism
to improve the performance of your design.


2.4.1 Flow Charts and State Machines

Flow charts and various flavours of state machines are covered well in many courses. Generally
everything that you’ve learned about these forms of description are also applicable in hardware
design.
In addition, you can exploit parallelism in state machine design to create communicating finite state
machines. A single complex state machine can be factored into multiple simple state machines that
operate in parallel and communicate with each other.


2.4.2 Data-Dependency Graphs

In software, the expression: (((((a + b) + c) + d) + e) + f) takes the same amount
of time to execute as: (a + b) + (c + d) + (e + f).
But, remember: hardware runs in parallel. In algorithmic descriptions, parentheses can guide
parallel vs serial execution.
Datadependency graphs capture algorithms of datapath-centric designs.
Datapath-centric designs have few, if any, control decisions: every parcel of input data undergroes
the same computation.

                       Serial                                           Parallel
               (((((a+b)+c)+d)+e)+f)                              (a+b)+(c+d)+(e+f)
                   a       b       c       d       e       f


                       +
                               +                                   a       b       c       d       e       f


                                       +                               +               +               +
                                               +                               +
                                                       +                                       +

              5 adders on longest path (slower)                3 adders on longest path (faster)
                 5 adders used (equal area)                       5 adders used (equal area)
122                                                   CHAPTER 2. RTL DESIGN WITH VHDL


2.4.3 High-Level Models

There are many different types of high-level models, depending upon the purpose of the model
and the characteristics of the design that the model describes. Some models may capture power
consumption, others performance, others data functionality.

High-level models are used to estimate the most important design metrics very early in the design
cycle. If power consumption is more important that performance, then you might write high-
level models that can predict the power consumption of different design choices, but which has
no information about the number of clock cycles that a computation takes, or which predicts the
latency inaccurately. Conversely, if performance is important, you might write clock-cycle accurate
high-level models that do not contain any information about power consumption.

Conventionally, performance has been the primary design metric. Hence, high-level models that
predict performance are more prevalent and more well understood than other types of high-level
models. There are many research and entrepreneurial opportunities for people who can develop
tools and/or languages for high-level models for estimating power, area, maximum clock speed,
etc.

In E&CE 327 we will limit ourselves to the well-understood area of high-level models for perfor-
mance prediction.
2.5. FINITE STATE MACHINES IN VHDL                                                                                                                       123


2.5 Finite State Machines in VHDL
2.5.1 Introduction to State-Machine Design

2.5.1.1 Mealy vs Moore State Machines

Moore Machines       ..................................................................... .



                                                                                                                   s0/0

                                                                                                               a            !a

• Outputs are dependent upon only the state
                                                                                                        s1/1                  s2/0
• No combinational paths from inputs to outputs



                                                                                                                   s3/0




Mealy Machines      . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..



                                                                                                                     s0

                                                                                                            a/1             !a/0
• Outputs are dependent upon both the state and the in-
  puts                                                                                                    s1                     s2
• Combinational paths from inputs to outputs
                                                                                                             /0             /0

                                                                                                                     s3




2.5.1.2 Introduction to State Machines and VHDL

A state machine is generally written as a single clocked process, or as a pair of processes, where
one is clocked and one is combinational.
124                                                     CHAPTER 2. RTL DESIGN WITH VHDL


Design Decisions      ..................................................................... .

•   Moore vs Mealy (Sections 2.5.2 and 2.5.3)
•   Implicit vs Explicit (Section 2.5.1.3)
•   State values in explicit state machines: Enumerated type vs constants (Section 2.5.5.1)
•   State values for constants: encoding scheme (binary, gray, one-hot, ...) (Section 2.5.5)


VHDL Constructs for State Machines           ..................................................


The following VHDL control constructs are useful to steer the transition from state to state:

•   if ... then ...            else                 • loop
•   case                                            • next
•   for ... loop                                    • exit
•   while ... loop


2.5.1.3 Explicit vs Implicit State Machines

There are two broad styles of writing state machines in VHDL: explicit and implicit. “Explicit”
and “implicit” refer to whether there is an explicit state signal in the VHDL code. Explicit state
machines have a state signal in the VHDL code. Implicit state machines do not contain a state
signal. Instead, they use VHDL processes with multiple wait statements to control the execution.
In the explicit style of writing state machines, each process has at most one wait statement. For
the explicit style of writing state machines, there are two sub-categories: “current state” and “cur-
rent+next state”.
In the explicit-current style of writing state machines, the state signal represents the current state
of the machine and the signal is assigned its next value in a clocked process.
In the explicit-current+next style, there is a signal for the current state and another signal for the
next state. The next-state signal is assigned its value in a combinational process or concurrent state-
ment and is dependent upon the current state and the inputs. The current-state signal is assigned
its value in a clocked process and is just a flopped copy of the next-state signal.
For the implicit style of writing state machines, the synthesis program adds an implicit register to
hold the state signal and combinational circuitry to update the state signal. In Synopsys synthesis
tools, the state signal defined by the synthesizer is named multiple wait state reg.
In Mentor Graphics, the state signal is named STATE VAR
We can think of the VHDL code for implicit state machines as having zero state signals, explicit-
current state machines as having one state signal (state), and explicit-current+next state ma-
chines as having two state signals (state and state next).
2.5.2 Implementing a Simple Moore Machine                                                         125


As with all topics in E&CE 327, there are tradeoffs between these different styles of writing state
machines. Most books teach only the explicit-current+next style. This style is the style closest to
the hardware, which means that they are more amenable to optimization through human interven-
tion, rather than relying on a synthesis tool for optimization. The advantage of the implicit style is
that they are concise and readable for control flows consisting of nested loops and branches (e.g.
the type of control flow that appears in software). For control flows that have less structure, it
can be difficult to write an implicit state machine. Very few books or synthesis manuals describe
multiple-wait statement processes, but they are relatively well supported among synthesis tools.

Because implicit state machines are written with loops, if-then-elses, cases, etc. it is difficult to
write some state machines with complicated control flows in an implicit style. The following
example illustrates the point.
                                                     a
                                          s0/0                s2/0
                                                 !a


                                          !a     a
                                          s3/0                s1/1


         Note:      The terminology of “explicit” and “implicit” is somewhat standard,
         in that some descriptions of processes with multiple wait statements describe
         the processes as having “implicit state machines”.
         There is no standard terminology to distinguish between the two explicit styles:
         explicit-current+next and explicit-current.


2.5.2 Implementing a Simple Moore Machine


                     s0/0
                                                         entity simple is
                 a          !a                             port (
                                                             a, clk : in std_logic;
              s1/1           s2/0                            z : out std_logic
                                                           );
                                                         end simple;

                     s3/0
126                                    CHAPTER 2. RTL DESIGN WITH VHDL


2.5.2.1 Implicit Moore State Machine

                                                   Flops      3
architecture moore_implicit_v1a of simple is       Gates      2
begin                                              Delay 1 gate
  process
  begin
    z <= ’0’;
    wait until rising_edge(clk);
    if (a = ’1’) then
      z <= ’1’;
    else
      z <= ’0’;
    end if;
    wait until rising_edge(clk);
    z <= ’0’;
    wait until rising_edge(clk);
  end process;
end moore_implicit;




                                                                    !a
                                                             s2/0
2.5.2 Implementing a Simple Moore Machine                     127


2.5.2.2 Explicit Moore with Flopped Output

architecture moore_explicit_v1 of simple is
  type state_ty is (s0, s1, s2, s3);          Flops       3
  signal state : state_ty;                    Gates     10
begin                                         Delay 3 gates
  process (clk)
  begin
    if rising_edge(clk) then
      case state is
        when s0 =>
          if (a = ’1’) then
            state <= s1;
            z     <= ’1’;
          else
            state <= s2;
            z     <= ’0’;
          end if;
        when s1 | s2 =>
          state <= s3;
          z     <= ’0’;
        when s3 =>
          state <= s0;
          z     <= ’1’;
      end case;
    end if;
  end process;
end moore_explicit_v1;
128                                           CHAPTER 2. RTL DESIGN WITH VHDL


2.5.2.3 Explicit Moore with Combinational Outputs

                                                          Flops       2
architecture moore_explicit_v2 of simple is               Gates       7
  type state_ty is (s0, s1, s2, s3);                      Delay 4 gates
  signal state : state_ty;
begin
  process (clk)
  begin
    if rising_edge(clk) then
      case state is
        when s0 =>
          if (a = ’1’) then
            state <= s1;
          else
            state <= s2;
          end if;
        when s1 | s2 =>
          state <= s3;
        when s3 =>
          state <= s0;
      end case;
    end if;
  end process;
  z <=   ’1’ when (state = s1)
    else ’0’;
end moore_explicit_v2;
2.5.2 Implementing a Simple Moore Machine                                                129


2.5.2.4 Explicit-Current+Next Moore with Concurrent Assignment

                                                                   Flops 2
architecture moore_explicit_v3 of simple is                        Gates 7
  type state_ty is (s0, s1, s2, s3);                               Delay 4
  signal state, state_nxt : state_ty;
begin
  process (clk)
  begin
    if rising_edge(clk) then
      state <= state_nxt;
    end if;
  end process;
  state_nxt <=   s1 when (state = s0) and (a = ’1’)
            else s2 when (state = s0) and (a = ’0’)
            else s3 when (state = s1) or (state = s2)
            else s0;
  z <=   ’1’ when (state = s1)
    else ’0’;
end moore_explicit_v3;

The hardware synthesized from this architecture is the same as that synthesized from moore explicit v2,
which is written in the current-explicit style.
130                                           CHAPTER 2. RTL DESIGN WITH VHDL


2.5.2.5 Explicit-Current+Next Moore with Combinational Process

architecture moore_explicit_v4 of simple is                 For this architecture, we
  type state_ty is (s0, s1, s2, s3);                        change the selected assign-
  signal state, state_nxt : state_ty;                       ment to state into a combi-
begin                                                       national process using a case
  process (clk)                                             statement.
  begin
    if rising_edge(clk) then                                 Flops 2
      state <= state_nxt;                                    Gates 7
    end if;                                                  Delay 4
  end process;
  process (state, a)                                        The       hardware    synthe-
  begin                                                     sized     from this archi-
    case state is                                           tecture    is the same as
      when s0 =>                                            that      synthesized   from
        if (a = ’1’) then                                   moore     explicit v2 and
          state_nxt <= s1;                                  v3.
        else
          state_nxt <= s2;
        end if;
      when s1 | s2 =>
        state_nxt <= s3;
      when s3 =>
        state_nxt <= s0;
    end case;
  end process;
  z <=   ’1’ when (state = s1)
    else ’0’;
end moore_explicit_v4;
2.5.3 Implementing a Simple Mealy Machine                                                    131


2.5.3 Implementing a Simple Mealy Machine

Mealy machines have a combinational path from inputs to outputs, which often violates good
coding guidelines for hardware. Thus, Moore machines are much more common. You should
know how to write a Mealy machine if needed, but most of the state machines that you design will
be Moore machines.

This is the same entity as for the simple Moore state machine. The behaviour of the Mealy machine
is the same as the Moore machine, except for the timing relationship between the output (z) and
the input (a).


                     s0
                                                entity simple is
               a/1        !a/0                    port (
                                                    a, clk : in std_logic;
              s1               s2                   z : out std_logic
                                                  );
                /0        /0                    end simple;

                     s3
132                                                   CHAPTER 2. RTL DESIGN WITH VHDL


2.5.3.1 Implicit Mealy State Machine

         Note:     An implicit Mealy state machine is nonsensical.

In an implicit state machine, we do not have a state signal. But, as the example below illustrates,
to create a Mealy state machine we must have a state signal.

An implicit style is a nonsensical choice for Mealy state machines. Because the output is depen-
dent upon the input in the current clock cycle, the output cannot be a flop. For the output to be
combinational and dependent upon both the current state and the current input, we must create a
state signal that we can read in the assignment to the output. Creating a state signal obviates the
advantages of using an implicit style of state machine.

                                                                       Flops       4
architecture implicit_mealy of simple is                               Gates       8
  type state_ty is (s0, s1, s2, s3);                                   Delay 2 gates
  signal state : state_ty;
begin
  process
  begin
    state <= s0;
    wait until rising_edge(clk);
    if (a = ’1’) then
      state <= s1;
    else
      state <= s2;
    end if;
    wait until rising_edge(clk);
    state <= s3;
    wait until rising_edge(clk);
  end process;
  z <=   ’1’ when (state = s0) and a = ’1’
    else ’0;
end mealy_implicit;
                                                                                          /0




                                                                                                      !a/0
                                                                                                 s2
2.5.3 Implementing a Simple Mealy Machine              133


2.5.3.2 Explicit Mealy State Machine

                                             Flops 2
architecture mealy_explicit of simple is     Gates 7
  type state_ty is (s0, s1, s2, s3);         Delay 3
  signal state : state_ty;
begin
  process (clk)
  begin
    if rising_edge(clk) then
      case state is
        when s0 =>
          if (a = ’1’) then
            state <= s1;
          else
            state <= s2;
          end if;
        when s1 | s2 =>
          state <= s3;
        when others =>
          state <= s0;
      end case;
    end if;
  end process;
  z <=   ’1’ when (state = s0) and a = ’1’
    else ’0’;
end mealy_explicit;
134                                                    CHAPTER 2. RTL DESIGN WITH VHDL


2.5.3.3 Explicit-Current+Next Mealy

                                                                         Flops 2
architecture mealy_explicit_v2 of simple is                              Gates 4
  type state_ty is (s0, s1, s2, s3);                                     Delay 3
  signal state, state_nxt : state_ty;
begin
  process (clk)
  begin
    if rising_edge(clk) then
      state <= state_nxt;
    end if;
  end process;
  state_nxt <=   s1 when (state = s0) and a = ’1’
            else s2 when (state = s0) and a = ’0’
            else s3 when (state = s1) or (state = s2)
            else s0;
  z <=   ’1’ when (state = s0) and a = ’1’
    else ’0’;
end mealy_explicit_v2;

For the Mealy machine, the explicit-current+next style is smaller than the the explicit-current style.
In contrast, for the Moore machine, the two styles produce exactly the same hardware.
2.5.4 Reset                                                                                                                        135


2.5.4 Reset

All circuits should have a reset signal that puts the circuit back into a good initial state. However,
not all flip flops within the circuit need to be reset. In a circuit that has a datapath and a state
machine, the state machine will probably need to be reset, but datapath may not need to be reset.

There are standard ways to add a reset signal to both explicit and implicit state machines.

It is important that reset is tested on every clock cycle, otherwise a reset might not be noticed, or
your circuit will be slow to react to reset and could generate illegal outputs after reset is asserted.


Reset with Implicit State Machine        . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . . .. . . . . ...


With an implicit state machine, we need to insert a loop in the process and test for reset after each
wait statement.

Here is the implicit Moore machine from section 2.5.2.1 with reset code added in bold.

architecture moore_implicit of simple is
begin
  process
  begin
    init : loop                      -- outermost loop
      z <= ’0’;
      wait until rising_edge(clk);
      next init when (reset = ’1’); -- test for reset
      if (a = ’1’) then
        z <= ’1’;
      else
        z <= ’0’;
      end if;
      wait until rising_edge(clk);
      next init when (reset = ’1’); -- test for reset
      z <= ’0’;
      wait until rising_edge(clk);
      next init when (reset = ’1’); -- test for reset
  end process;
end moore_implicit;
136                                                    CHAPTER 2. RTL DESIGN WITH VHDL


Reset with Explicit State Machine       ......................................................


Reset is often easier to include in an explicit state machine, because we need only put a test for
reset = ’1’ in the clocked process for the state.
The pattern for an explicit-current style of machine is:
  process (clk) begin
    if rising_edge(clk) then
      if reset = ’1’ then
        state <= S0;
      else
        if ... then
          state <= ...;
        elif ... then
          ... -- more tests and assignments to state
        end if;
      end if;
    end if;
  end process;
Applying this pattern to the explicit Moore machine from section 2.5.2.3 produces:
architecture moore_explicit_v2 of simple is
  type state_ty is (s0, s1, s2, s3);
  signal state : state_ty;
begin
  process (clk)
  begin
    if rising_edge(clk) then
      if (reset = ’1’) then
        state <= s0;
      else
        case state is
           when s0 =>
            if (a = ’1’) then
              state <= s1;
            else
              state <= s2;
            end if;
          when s1 | s2 =>
            state <= s3;
          when s3 =>
            state <= s0;
        end case;
      end if;
    end if;
  end process;
  z <=   ’1’ when (state = s1)
    else ’0’;
end moore_explicit_v2;
2.5.5 State Encoding                                                                           137


The pattern for an explicit-current+next style is:

  process (clk) begin
    if rising_edge(clk) then
      if reset = ’1’ then
        state_cur <= reset state;
      else
        state_cur <= state_nxt;
      end if;
    end if;
  end process;


2.5.5 State Encoding

When working with explicit state machines, we must address the issue of state encoding: what
bit-vector value to associate with each state?

With implicit state machines, we do not need to worry about state encoding. The synthesis program
determines the number of states and the encoding for each state.


2.5.5.1 Constants vs Enumerated Type

Using an enumerated type, the synthesis tools chooses the encoding:

  type state_ty is (s0, s1, s2, s3);
  signal state : state_ty;
Using constants, we choose the encoding:

  type state_ty is std_logic_vector(1 downto 0);
  constant s0 : state_ty := "11";
  constant s1 : state_ty := "10";
  constant s2 : state_ty := "00";
  constant s3 : state_ty := "01";
  signal state : state_ty;


Providing Encodings for Enumerated Types             ............................................


Many synthesizers allow the user to provide hints on how to encode the states, or allow the user to
provide explicitly the desire encoding. These hints are done either through VHDL attributes
or special comments in the code.
138                                                      CHAPTER 2. RTL DESIGN WITH VHDL


Simulation     ............................................................................


When doing functional simulation with enumerated types, simulators often display waveforms
with “pretty-printed” values rather than bits (e.g. s0 and s1 rather than 11 and 10). However,
when simulating a design that has been mapped to gates, the enumerated type dissappears and you
are left with just bits. If you don’t know the encoding that the synthesis tool chose, it can be very
difficult to debug the design.

However, this opens you up to potential bugs if the enumerated type you are testing grows to
include more values, which then end up unintentionally executing your when other branch,
rather than having a special branch of their own in the case statement.


Unused Values       ....................................................................... .


If the number of values you have in your datatype is not a power of two, then you will have some
unused values that are representable.

For example:

  type state_ty is std_logic_vector(2 downto 0);
  constant s0 : state_ty := "011";
  constant s1 : state_ty := "000";
  constant s2 : state_ty := "001";
  constant s3 : state_ty := "011";
  constant s4 : state_ty := "101";
  signal state : state_ty;
This type only needs five unique values, but can represent eight different values. What should we
do with the three representable values that we don’t need? The safest thing to do is to code your
design so that if an illegal value is encountered, the machine resets or enters an error state.


2.5.5.2 Encoding Schemes
• Binary: Conventional binary counter.
• One-hot: Exactly one bit is asserted at any time.
• Modified one-hot: Altera’s Quartus synthesizer generates an almost-one-hot encoding where the
  bit representing the reset state is inverted. This means that the reset state is all ’O’s and all other
  states have two ’1’s: one for the reset state and one for the current state.
• Gray: Transition between adjacent values requires exactly one bit flip.
• Custom: Choose encoding to simplify combinational logic for specific task.
2.6. DATAFLOW DIAGRAMS                                                                          139


Tradeoffs in Encoding Schemes        ....................................................... .

• Gray is good for low-power applications where consecutive data values typically differ by 1 (e.g.
  no random jumps).
• One-hot usually has less combinational logic and runs faster than binary for machines with up
  to a dozen or so states. With more than a dozen states, the extra flip-flops required by one-hot
  encoding become too expensive.
• Custom is great if you have lots of time and are incredibly intelligent, or have deep insight into
  the guts of your design.
         Note: Don’t care values When we don’t care what is the value of a signal we
         assign the signal ’-’, which is “don’t care” in VHDL. This should allow the
         synthesis tool to use whatever value is most helpful in simplifying the Boolean
         equations for the signal (e.g. Karnaugh maps). In the past, some groups in
         E&CE 327 have used ’-’ quite succesfuly to decrease the area of their design.
         However, a few groups found that using ’-’ increased the size of their design,
         when they were expecting it to decrease the size. So, if you are tweaking your
         design to squeeze out the last few unneeded FPGA cells, pay close attention as
         to whether using ’-’ hurts or helps.


2.6 Dataflow Diagrams
2.6.1 Dataflow Diagrams Overview
• Dataflow diagrams are data-dependency graphs where the computation is divided into clock
  cycles.
• Purpose:
  – Provide a disciplined approach for designing datapath-centric circuits
  – Guide the design from algorithm, through high-level models, and finally to register transfer
    level code for the datapath and control circuitry.
  – Estimate area and performance
  – Make tradeoffs between different design options
• Background
  – Based on techniques from high-level synthesis tools
  – Some similarity between high-level synthesis and software compilation
  – Each dataflow diagram corresponds to a basic block in software compiler terminology.
140                                                          CHAPTER 2. RTL DESIGN WITH VHDL


                  a       b            c            d            e            f


                      +
                          x1

                                   +
                                       x2

                                                +
                                                    x3

                                                             +
                                                                 x4

                                                                          +
                                                                          z


      Data-dependency graph for z = a + b + c + d + e + f

                  a       b            c            d            e            f


                      +
                              x1

                                   +
                                           x2

                                                +
                                                        x3

                                                             +
                                                                     x4

                                                                          +
                                                                          z


        Dataflow diagram for z = a + b + c + d + e + f
2.6.1 Dataflow Diagrams Overview                                                141



      a       b        c        d        e        f


          +
              x1                                      Horizontal lines mark
                                                      clock cycle boundaries
                   +
                       x2

                            +
                                x3

                                     +
                                         x4

                                              +
                                              z



The use of memory arrays in dataflow diagrams is described in section 2.11.4.
142                                                                              CHAPTER 2. RTL DESIGN WITH VHDL


2.6.2 Dataflow Diagrams, Hardware, and Behaviour

Primary Input      ....................................................................... .

                                                                                             Behaviour
                                                                                              clk

Dataflow Diagram                                          Hardware                                 i              α                β

   i                                                                                              x              −               α
                                                     i                    x


   x


Register Input    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..


Dataflow Diagram                                                                              Behaviour
                                                                                              clk
 i                                                     Hardware
                                                       i                                          i              α                β
                                                                            x
                                                                                                  x              −               α
      x


Register Signal    ........................................................................

Dataflow Diagram                                          Hardware                            Behaviour
 i1 i2                                                                                        clk
                                        i1
                                                                                        x       i1               α                β               γ
                                                                +




      +                                                                                         i2               α                β               γ
                                        i2
                                                                                                  x              −                −               α
       x


Combinational-Component Output                            ................................................... .


Dataflow Diagram                                                                              Behaviour
                                                         Hardware                             clk
 i1 i2
                                           i1                                                   i1               α                β               γ
                                                                                   x
                                                                                                                 α                β               γ
                                                                  +




      +                                                                                         i2

      x                                    i2                                                     x              −               α                β
2.6.3 Dataflow Diagram Execution                                                                                                    143


2.6.3 Dataflow Diagram Execution

Execution with Registers on Both Inputs and Outputs                   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..



      a       b        c        d        e        f
                                                        0             0 1 2 3 4 5 6
                                                                clk
      x1
          +                                             1
                                                                a
                                                                x1
                   +
                  x2
                                                        2
                                                                x2
                                                                x3
                            +
                           x3
                                                        3
                                                                x4
                                                                x5
                                     +
                                    x4
                                                        4
                                                                z

                                              +
                                             x5
                                                        5

                                              z
                                                        6




Execution Without Output Registers                    ...................................................



      a       b        c        d        e        f
                                                        0             0 1 2 3 4 5 6
                                                                clk
       +
      x1
                                                        1
                                                                a
                                                                x1
                   +
                  x2
                                                        2
                                                                x2
                                                                x3
                            +
                           x3
                                                        3
                                                                x4
                                                                x5
                                     +
                                    x4
                                                        4
                                                                z

                                              +
                                             x5
                                                        5

                                              z
144                                                                   CHAPTER 2. RTL DESIGN WITH VHDL


2.6.4 Performance Estimation

Performance Equations       . . . . .. . . . . .. . . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . ..


                                             1
                          Performance ∝
                                         TimeExec
                              TimeExec = Latency × ClockPeriod


   Definition Latency: Number of clock cycles from inputs to outputs. A combinational
     circuit has latency of zero. A single register has a latency of one. A chain of n
     registers has a latency of n.


There is much more information on performance in chapter3, which is devoted to performance.


Performance of Dataflow Diagrams                  ................................................... .

• Latency: count horizontal lines in diagram
• Min clock period (Max clock speed) limited by longest path in a clock cycle


2.6.5 Area Estimation
• Maximum number of blocks in a clock cycle is total number of that component that are needed
• Maximum number of signals that cross a cycle boundary is total number of registers that are
   needed
• Maximum number of unconnected signal tails in a clock cycle is total number of inputs that
   are needed
• Maximum number of unconnected signal heads in a clock cycle is total number of outputs
   that are needed
The information above is only for estimating the number of components that are needed. In fact,
these estimates give lower bounds. There might be constraints on your design that will force you
to use more components (e.g., you might need to read all of your inputs at the same time).

Implementation-technology factors, such as the relative size of registers, multiplexers, and datapath
components, might force you to make tradeoffs that increase the number of datapath components
to decrease the overall area of the circuit.

Of particular relevance to FPGAs:
• With some FPGA chips, a 2:1 multiplexer has the same area as an adder.
• With some FPGA chips, a 2:1 multiplexer can be combined with an adder into one FPGA cell
  per bit.
2.6.6 Design Analysis                                                                                                                                          145


• In FPGAs, registers are usually “free”, in that the area consumed by a circuit is limited by the
   amount of combinational logic, not the number of flip-flops.
In comparison, with ASICs and custom VLSI, 2:1 multiplexers are much smaller than adders, and
registers are quite expensive in area.


2.6.6 Design Analysis
    a       b               c           d           e           f


        +
            x1                                                                                       num inputs                         6
                        +                                                                            num outputs                        1
                            x2
                                                                                                     num registers                      6
                                    +
                                        x3                                                           num adders                         1
                                                +                                                    min clock period                   delay through flop and one adder
                                                    x4
                                                                                                     latency                            6 clock cycles
                                                            +
                                                            z




2.6.7 Area / Performance Tradeoffs

        one add per clock cycle                                                       two adds per clock cycle
                                                                                        a       b        c        d        e        f
                a       b           c           d           e           f                                                               0
                                                                             0

                    +                                                        1              +
                                                                                                x1
                        x1                                                                                                              1
                                +                                            2                       +
                                                                                                         x2
                                    x2

                                            +                                3                                +
                                                                                                                  x3
                                                x3                                                                                      2
                                                        +                    4                                         +
                                                                                                                           x4
                                                            x4

                                                                    +        5                                                  +       3
                                                                                                                                x5
                                                                    x5
                                                                             6                                                  z
                                                                    z                                                                   4


                    Note:                       In the “Two-add” design, half of the last clock cycle is wasted.


Two Adds per Clock Cycle                                                    .............................................................
146                                                             CHAPTER 2. RTL DESIGN WITH VHDL

      a       b        c        d        e        f
                                                      0         0 1 2 3 4 5 6
                                                          clk
          +
              x1                                          a
                                                      1
                                                          x1
                   +
                       x2                                 x2
                                                          x3
                            +
                                x3                        x4
                                                      2
                                                          x5
                                     +
                                         x4               z

                                              +       3
                                              x5
                                              z
                                                      4
2.6.7 Area / Performance Tradeoffs                                                                                          147


Design Comparison           .................................................................. .

                            One add per clock cycle                         Two adds per clock cycle
                                                                            a       b        c        d        e        f
                        a       b        c        d        e        f                                                       0
                                                                        0

                            +                                           1       +
                                                                                    x1
                                x1                                                                                          1
                                     +                                  2                +
                                                                                             x2
                                         x2

                                              +                         3                         +
                                                                                                      x3
                                                  x3                                                                        2
                                                       +                4                                  +
                                                                                                               x4
                                                           x4

                                                                +       5                                           +       3
                                                                                                                    x5
                                                                x5
                                                                        6                                           z
                                                                z                                                           4
 inputs                                   6                                                   6
 outputs                                  1                                                   1
 registers                                6                                                   6
 adders                                   1                                                   2
 clock period                        flop + 1 add                                         flop + 2 add
 latency                                  6                                                   4


  Question:     Under what circumstances would each design option be fastest?


  Answer:
     time = latency * clock period

     compare execution times for both options
          T1 = 6 × (T f + Ta )
      T2 = 4 × (T f + 2 × Ta )

     One-add will be faster when T1 < T2 :
      6 × (T f + Ta )   <       4 × (T f + 2 × Ta )
         6T f + 6Ta     <       4T f + 8Ta
                2T f    <       2Ta
                  Tf    <       Ta

     Sanity check: If add is slower than flop, then want to minimize the number of
     adds. One-add has fewer adds, so one-add will be faster when add is slower
     than flop.
148                                                  CHAPTER 2. RTL DESIGN WITH VHDL


2.7 Design Example: Massey
We’ll go through the following artifacts:

   1. requirements
   2. algorithm
   3. dataflow diagram
   4. high-level models
   5. hardware block diagram
   6. RTL code for datapath
   7. state machine
   8. RTL code for control


Design Process     ....................................................................... .


   1. Scheduling (allocate operations to clock cycles)
   2. I/O allocation
   3. First high-level model
   4. Register allocation
   5. Datapath allocation
   6. Connect datapath components, insert muxes where needed
   7. Design implicit state machine
   8. Optimize
   9. Design explicit-current state machine
 10. Optimize


2.7.1 Requirements
  Functional requirements:
     • Compute the sum of six 8-bit numbers: output = a + b + c + d + e + f
     • Use registers on both inputs and outputs
  Performance requirements:
     • Maximum clock period: unlimited
     • Maximum latency: four
  Cost requirements:
     • Maximum of two adders
2.7.2 Algorithm                                                                                       149


        • Small miscellaneous hardware (e.g. muxes) is unlimited
        • Maximum of three inputs and one output
        • Design effort is unlimited

                    Note:     In reality multiplexers are not free. In FPGAs, a 2:1 mux is more ex-
                    pensive than a full-adder. A 2:1 mux has three inputs while an adder has only
                    two inputs (the carry-in and carry-out signals usually use the special “verti-
                    cal” connections on the FPGA cell). In FPGAs, sharing an adder between two
                    signals can be more expensive than having two adders. In a “generic-gate”
                    technology, a multiplexor contains three two-input gates, while a full-adder
                    contains fourteen two-input gates.


2.7.2 Algorithm

We’ll use parentheses to group operations so as to maximize our opportunities to perform the work
in parallel:
  z = (a + b) + (c + d) + (e + f)

This results in the following data-dependency graph:
    a       b                  c           d               e       f


        +                              +                       +
                       +
                                                   +


2.7.3 Initial Dataflow Diagram
        a       b          c       d


            +                  +               e       f


                      +                            +
                                       +
                                       z

This dataflow diagram violates the require-
ment to use at most three inputs.
150                                                            CHAPTER 2. RTL DESIGN WITH VHDL


2.7.4 Dataflow Diagram Scheduling

We can potentially optimize the inputs, outputs, area, and performance of a dataflow diagram by
rescheduling the operations, that is allocating the operations to different clock cycles.

Parallel algorithms have higher performance and greater scheduling flexibility than serial algo-
rithms

Serial algorithms tend to have less area than parallel algorithms


                       Serial                                        Parallel
               (((((a+b)+c)+d)+e)+f)                           (a+b)+(c+d)+(e+f)
                   a       b       c       d       e       f


                       +
                               +                                a       b       c       d       e       f


                                       +                            +               +               +
                                               +                            +
                                                       +                                    +
2.7.4 Dataflow Diagram Scheduling                                                                                                                         151


Scheduling to Optimize Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
                         Original parallel                                                 Parallel after scheduling
                                 a       b         c       d         e       f                       a       b               c       d


                                     +                 +                 +                               +                       +                   e       f


                                             +                                                                       +                                   +
                                                               +                                                                             +

 inputs                                           6                                                                           4
 outputs                                          1                                                                           1
 registers                                        6                                                                           4
 adders                                           3                                                                           2
 clock period                                flop + 1 add                                                                 flop + 1 add
 latency                                          3                                                                           3


Scheduling to Optimize Inputs                       .........................................................

                                                                                         a       b


                                                                                             +                   c       d


                                                                                                                     +                   e       f
Rescheduling the dataflow diagram from the
parallel algorithm reduced the area from                                                             +                                       +
three adders to two. However, it still vio-
lates the restriction of a maximum of three                                                                                  +
inputs. We can reschedule the operations to
keep the same area, but reduce the number                                                                                    z

of inputs.

The tradeoff is that reducing the number of
inputs causes an increase in the latency from
four to five.
A latency of five violates the design requirement of a maximum latency of four clock cycles. In
comparing the dataflow diagram above with the design requirements, we notice that the require-
ments allow a clock cycle that includes two additions and three inputs.
152                                                                CHAPTER 2. RTL DESIGN WITH VHDL

                                                                       a       b               c


                                                                           +
                                                                                   x1

                                                                                           +                   d               e
                                                                                               x2

It appears that the parallel algorithm will not                                                            +
lead us to a design that satisfies the require-                                                                 x3

ments.                                                                                                                     +                   f
                                                                                                                               x4
We revisit the algorithm and try a serial al-                                                                                              +
gorithm:
                                                                                                                                           z
  z = ((((a + b) + c) + d) + e) + f


The corresponding dataflow diagram is
shown to the right.


2.7.5 Optimize Inputs and Outputs

When we rescheduled the parallel algorithm, we rescheduled the input values. This requires rene-
gotiating the schedule of input values with our environment. Sometimes the environment of our
circuit will be willing to reschedule the inputs, but in other situations the environment will impose
a non-negotiable schedule upon us.
If you are currently storing all inputs and can change environment’s behaviour to delay sending
some inputs, then you can reduce the number of inputs and registers.
We will illustrate this on both the one-add and the two-add designs.
                       One-add before I/O opt                                           One-add after I/O opt
                   a       b        c        d        e        f               a       b


                       +                                                           +                   c
                           x1                                                           x1

                                +                                                                  +                   d
                                    x2                                                                 x2

                                         +                                                                         +                   e
                                             x3                                                                        x3

                                                  +                                                                                +                   f
                                                      x4                                                                               x4

                                                           +                                                                                       +
                                                           z                                                                                       z

 inputs                                  6                                                                         2
 regs                                    6                                                                         2
2.7.5 Optimize Inputs and Outputs                                                                                                                                 153


                    Two-add before I/O opt                                                            Two-add after I/O opt
                a       b           c            d            e            f                  a       b           c


                    +                                                                             +
                        x1                                                                            x1

                                +                                                                             +                d            e
                                        x2                                                                            x2

                                             +                                                                             +
                                                     x3                                                                            x3

                                                          +                                                                             +                f
                                                                  x4                                                                            x4

                                                                       +                                                                             +
                                                                       z                                                                             z

 inputs                                      6                                                                             2
 regs                                        6                                                                             2


Design Comparison Between One and Two Add                                               ....................................... .

                                    One-add after I/O opt                                                         Two-add after I/O opt
                            a       b                                                                     a       b            c


                                +                c                                                            +
                                    x1                                                                            x1

                                             +                d                                                            +                d            e
                                                 x2                                                                            x2

                                                          +                e                                                            +
                                                              x3                                                                            x3

                                                                       +            f                                                                +            f
                                                                           x4                                                                            x4

                                                                                +                                                                             +
                                                                                z                                                                             z

 inputs                                           2                                                                             3
 outputs                                          1                                                                             1
 registers                                        2                                                                             3
 adders                                           1                                                                             2
 clock period                                flop + 1 add                                                                   flop + 2 add
 latency                                          6                                                                             4
154                                                                                  CHAPTER 2. RTL DESIGN WITH VHDL


Hardware Recipe for Two-Add                           . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..

                                                                                      Based on the dataflow diagram, we can de-
We return now to the two-add design, with
                                                                                      termine the hardware resources required for
the dataflow diagram:
      a       b        c
                                                                                      the datapath.

          +                                                                            Table 2.2: Hardware Recipe for Two-Add
              x1

                   +            d        e
                       x2
                                                                                        inputs                                3
                            +                                                           adders                                2
                                x3
                                                                                        registers                             3
                                     +            f                                     output                                1
                                         x4
                                                                                        registered inputs                   YES
                                              +                                         registered outputs                  YES
                                                                                        clock cycles from inputs to outputs   4
                                              z




2.7.6 Input/Output Allocation

Our first step after settling on a hardware recipe is I/O allocation, because that determines the
interface between our circuit and the outside world.

From the hardware recipe, we know that we need only three inputs and one output. However, we
have six different input values. We need to allocate these input values to input signals before we
can write a high-level model that performs the computation of our design.

Based on the input and output information in the hardware recipe, we can define our entity:

  entity massey is
    port (
      clk        : in std_logic;
      i1, i2, i3 : in unsigned(7 downto 0);
      o1         : out unsigned(7 downto 0)
    );
  end massey;
2.7.6 Input/Output Allocation                                                           155




                                                               i1   i2    i3
      i1 i2
      a b       i3
                c


       +
           x1

                +        i2
                         d        i3
                                  e
                    x2

                         +                                      +
                             x3

                                  +         i2
                                            f
                                      x4

                                           +                        +

                                           z
                                           o1                                  o1


      Figure 2.4: Dataflow diagram and hardware block diagram with I/O port allocation

Based upon the dataflow diagram after I/O           architecture hlm_v1 of massey is
allocation, we can write our first high-level         ...internal signal decls...
model (hlm v1).                                      process begin
                                                       wait until rising_edge(clk);
In the high-level model the entire circuit will        a <= i1;
be implemented in a single process. For                b <= i2;
larger circuits it may be beneficial to have            c <= i3;
separate processes for different groups of             wait until rising_edge(clk);
signals.                                               x2 <= (a + b) + c;
                                                       d <= i2;
In the high-level model, the code between              e <= i3;
wait statements describes the work that is             wait until rising_edge(clk);
done in a clock cycle.                                 x4 <= (x2 + d) + e;
                                                       f <= i2;
The hlm v1 architecture uses an implicit               wait until rising_edge(clk);
state machine.                                         z <= (x4 + f);
                                                     end process;
Because the process is clocked, all of the           o1 <= z;
signals that are assigned to in the process are    end hlm_v1;
registers. Combinational signals would need
to be done using concurrent assignments or
combinational processes.
156                                                         CHAPTER 2. RTL DESIGN WITH VHDL


2.7.7 Register Allocation

The next step after I/O allocation could be either register allocation or datapath allocation. The
benefit of doing register allocation first is that it is possible to write VHDL code after register
allocation is done but before datapath allocation is done, while the inverse (datapath done but
register allocation not done) does not make sense if written in a hardware description language.
In this example, we will do register allocation before datapath allocation, and show the resulting
VHDL code.


                                                                         i1        i2   i3
      i1 i2
      a b       i3
                c
      r1 r2     r3

       +
           x1

                +          i2
                           d        i3
                                    e
                    x2
                    r1     r2       r3                            r1          r2        r3

                         +                                                +
                             x3

                                  +         i2
                                            f
                                      x4
                                      r1    r2
                                           +                                       +
                                           r3
                                           z
                                           o1                                                o1

                                                  architecture hlm_v2 of massey is
                                                    ...internal signal decls...
                                                    process begin
                                                      wait until rising_edge(clk);
                                                      r1 <= i1;
                                                      r2 <= i2;
                                  i1 a                r3 <= i3;
                                  i2 b, d, f          wait until rising_edge(clk);
 I/O Allocation
                                  i3 c, e             r1 <= (r1 + r2) + r3;
                                  o1 z                r2 <= i2;
                                  r1 a, x2, x4        r3 <= i3;
 Register Allocation              r2 b, d, f          wait until rising_edge(clk);
                                  r3 c, e             r1 <= (r1 + r2) + r3;
                                                      r2 <= i2;
                                                      wait until rising_edge(clk);
                                                      r3 <= (r1 + r2);
                                                    end process;
                                                    o1 <= r3;
                                                  end hlm_v2;

                         Figure 2.5: Block diagram after I/O and register allocation
2.7.8 Datapath Allocation                                                                       157


2.7.8 Datapath Allocation

In datapath allocation, we allocate each of the data operations in the dataflow diagram to one of
the datapath components in the hardware block diagram.


                                                                       i1        i2   i3
     i1 i2
     a b           i3
                   c
     r1 r2         r3
    a1   +
             x1

             a2   +        i2
                           d         i3
                                     e
                      x2
                      r1   r2        r3                          r1         r2        r3

                      a1   +                                          a1   +
                               x3

                               a2   +          i2
                                               f
                                         x4
                                         r1    r2
                                         a1   +
                                                                            a2   +
                                              r3
                                              z
                                              o1                                           o1

                                                      architecture hlm_dp of massey is
                                                        ...internal signal decls...
                                                        process begin
                                                          wait until rising_edge(clk);
                                                          r1 <= i1;
                                                          r2 <= i2;
                                    i1    a               r3 <= i3;
                                    i2    b, d, f         wait until rising_edge(clk);
 I/O Allocation
                                    i3    c, e            r1 <= a2;
                                    o1    z               r2 <= i2;
                                    r1    a, x2, x4       r3 <= i3;
 Register Allocation                r2    b, d, f         wait until rising_edge(clk);
                                    r3    c, e            r1 <= a2;
                                    a1    x1, x3, z       r2 <= i2;
 Datapath Allocation                                      wait until rising_edge(clk);
                                    a2    x2, x4
                                                          r3 <= a1;
                                                        end process;
                                                        a1 <= r1 + r2;
                                                        a2 <= a1 + r3;
                                                        o1 <= r3;
                                                      end hlm_dp;

                  Figure 2.6: Block diagram after I/O, register, and datapath allocation
158                                                               CHAPTER 2. RTL DESIGN WITH VHDL


2.7.9 Datapath for DP+Ctrl Model

We will now evolve from an implicit state machine to an explicit state machine. The first step is to
label the states in the dataflow diagram and then construct tables to find the values for chip-enable
and mux-select signals.

S0     i1 i2
       a b          i3
                    c
       r1 r2        r3
      a1   +
S1             x1

               a2   +          i2
                               d      i3
                                      e
                        x2
                        r1    r2      r3
                        a1   +
S2                               x3

                                 a2   +         i2
                                                f
                                          x4
                                          r1    r2
S3                                        a1   +
                                               r3
S0                                             z
                                               o1




Datapath for DP+Ctrl Model                      ......................................................... .


          r1                     r2          r3                          a1               a2
 S0   ce=1 , d=i1            ce=1 , d=i2 ce=1 , d=i3         S0    src1=–, src2=–   src1=–, src2=–
 S1   ce=1 , d=a2            ce=1 , d=i2 ce=1 , d=i3         S1   src1=r1, src2=r2 src1=a1, src2=r3
 S2   ce=1 , d=a2            ce=1 , d=i2 ce=–, d=–           S2   src1=r1, src2=r2 src1=a1, src2=r3
 S3    ce=–, d=–              ce=–, d=– ce=1 , d=a1          S3   src1=r1, src2=r2 src1=–, src2=–


Choose Don’t-Care Values                   .............................................................


          r1                     r2         r3                           a1                a2
 S0   ce=1, d=i1             ce=1, d=i2 ce=1, d=i3           S0   src1=r1, src2=r2   src1=a1, src2=r3
 S1   ce=1, d=a2             ce=1, d=i2 ce=1, d=i3           S1   src1=r1, src2=r2   src1=a1, src2=r3
 S2   ce=1, d=a2             ce=1, d=i2 ce=1, d=i3           S2   src1=r1, src2=r2   src1=a1, src2=r3
 S3   ce=1, d=a2             ce=1, d=i2 ce=1, d=a1           S3   src1=r1, src2=r2   src1=a1, src2=r3
2.7.9 Datapath for DP+Ctrl Model                                                      159


Simplify   ............................................................................. .

       r1  r2 = i2  r3
 S0   d=i1         d=i3
                                                      a1                a2
 S1   d=a2         d=i3
                                               src1=r1, src2=r2   src1=a1, src2=r3
 S2   d=a2         d=i3
 S3   d=a2         d=a1


VHDL Code      ......................................................................... .


architecture explicit_v1 of massey is
  signal
  type state_ty is std_logic_vector(3 downto 0);
  constant s0 : state_ty := "0001"; constant s1 : state_ty := "0010";
  constant s2 : state_ty := "0100"; constant s3 : state_ty := "1000";
  signal state : state_ty;
begin
160                                                   CHAPTER 2. RTL DESIGN WITH VHDL


  ----------------------                         ----------------------
  -- r1                                          -- combinational datapath
  process (clk) begin                            a_1 <= r_1 + r_2;
    if rising_edge(clk) then                     a_2 <= a_1 + r_3;
      if state = S0 then                         o_1 <= r_3;
        r_1 <= i_1;
      else                                      ----------------------
        r_1 <= a_2;                             -- state machine
      end if;                                   process (clk) begin
    end if;                                       if rising_edge(clk) then
  end process;                                      if reset = ’1’ then
  ----------------------                              state <= S0;
  -- r_2                                            else
  process (clk) begin                                 case state is
    if rising_edge(clk) then                            when S0 => state <=            S1;
      r_2 <= i_2;                                       when S1 => state <=            S2;
    end if;                                             when S2 => state <=            S3;
  end process;                                          when S3 => state <=            S0;
                                                      end case;
  ----------------------                            end if;
  -- r_3                                          end if;
  process (clk) begin                           end process;
    if rising_edge(clk) then                  end explicit_v1;
      if state = S3 then
        r_3 <= a_1;
      else
        r_3 <= i_3;
      end if;
    end if;
  end process



2.7.10 Peephole Optimizations

Peephole optimizations are localized optimizations to code, in that they affect only a few lines of
code. In hardware design, peephole optimizations are usually done to decrease the clock period,
although some optimizations might also decrease area. There are many different types of opti-
mizations, and many optimizations that designers do by hand are things that you might expect a
synthesis tool to do automatically.

In a comparison such as: state = S0, when we use a one-hot state encoding, we need com-
pare only one of the bits of the state. The comparison can be simplified to: state(0) = ’1’.
Without this optimization, many synthesis tools will produce hardware that tests all of the bits of
the state signal. This increases the area, because more bits are required as inputs to the compari-
son, and increases the clock period because the wider comparison leads to a tree-like structure of
combinational logic, or an increased number of FPGA cells.
2.7.10 Peephole Optimizations                                                               161


In this example, we will take advantage of our state encoding to optimize the code for r 1, r 3,
and the state machine.
-- r_1                                         -- r_1 (optimized)
process (clk) begin                            process (clk) begin
  if rising_edge(clk) then                       if rising_edge(clk) then
    if state = S0 then                             if state(0) = ’1’ then
      r_1 <= i_1;                                    r_1 <= i_1;
    else                                           else
      r_1 <= a_2;                                    r_1 <= a_2;
    end if;                                        end if;
  end if;                                        end if;
end process;                                   end process;


The code for r 2 remains unchanged.
-- r_3                                         -- r_3 (optimized)
process (clk) begin                            process (clk) begin
  if rising_edge(clk) then                       if rising_edge(clk) then
    if state = S3 then                             if state(3) then
      r_3 <= a_1;                                    r_3 <= a_1;
    else                                           else
      r_3 <= i_3;                                    r_3 <= i_3;
    end if;                                        end if;
  end if;                                        end if;
end process;                                   end process;


                                               -- state machine (optimized)
-- state machine
                                               -- NOTE: "st" = "state"
process (clk) begin
                                               process (clk) begin
  if rising_edge(clk) then
                                                 if rising_edge(clk) then
    if reset = ’1’ then
                                                   if reset = ’1’ then
      state <= S0;
                                                     st <= S0;
    else
                                                   else
      case state is
                                                     for i in 0 to 3 loop
        when S0 => state <=           S1;
                                                       st( (i+1)mod4 ) <= st(i);
        when S1 => state <=           S2;
                                                     end loop;
        when S2 => state <=           S3;
                                                   end if;
        when S3 => state <=           S0;
                                                 end if;
      end case;
    end if;
  end if;
end process;
162                                                  CHAPTER 2. RTL DESIGN WITH VHDL


The hardware-block diagram that corresponds to the tables and VHDL code is:

reset
         State(0)      State(1)     State(2)     State(3)
                                                                        i1         i2   i3




                                                                  r1          r2        r3


                                                                       a1   +

                                                                              a2   +

                                                                                             o1




2.8 Design Example: Vanier
We’ll go through the following artifacts:

   1. requirements
   2. algorithm
   3. dataflow diagram
   4. high-level models
   5. hardware block diagram
   6. RTL code for datapath
   7. state machine
   8. RTL code for control


Design Process      ....................................................................... .


   1. Scheduling (allocate operations to clock cycles)
   2. I/O allocation
   3. First high-level model
2.8.1 Requirements                                                          163


   4. Register allocation
   5. Datapath allocation
   6. Connect datapath components, insert muxes where needed
   7. Design implicit state machine
   8. Optimize
   9. Design explicit-current state machine
 10. Optimize


2.8.1 Requirements
• Functional requirements: compute the following formula:
  output = (a × d) + c + (d × b) + b
• Performance requirement:
  – Max clock period: flop plus (2 adds or 1 multiply)
  – Max latency: 4
• Cost requirements
  – Maximum of two adders
  – Maximum of two multipliers
  – Unlimited registers
  – Maximum of three inputs and one output
  – Maximum of 5000 student-minutes of design effort
• Registered inputs and outputs


2.8.2 Algorithm

                                                    a       d       b   c

Create a data-dependency graph for the algo-
rithm.                                                                  +
NOTE: if draw data-dep graph in alphabetical
order, it’s ugly. Lesson is to think about layout                   +
and possibly re-do the layout to make it simple
and easy to understand before proceeding.
                                                                +
                                                                z
164                                                   CHAPTER 2. RTL DESIGN WITH VHDL


2.8.3 Initial Dataflow Diagram

                                                        a        d             b       c
Schedule operations into clock cycles. Use an
“as soon as possible” schedule, obeying perfor-
mance requirement of a maximum clock period                                          +
of one multiply or two additions. In this initial
diagram, we ignore the resource requirements.                                 +
This allows us to establish a lower bound on
the latency, which gives us the maximum per-
formance that we can hope to achieve.
                                                                        +
                                                                        z


2.8.4 Reschedule to Meet Requirements

We have four inputs, but the requirements allow a maximum of three. We need to move one input
into the second clock cycle. We want to choose an input that can be delayed by one clock cycle
without violating a requirement and with minimal degradation of performance (clock period and
latency).

If delaying an input by a clock cycle causes a requirement to be violated, we can often reschedule
the operations to remove the violation. So, we sometimes create an intermediate dataflow diagram
that violates a requirement, then reschedule the operations to bring the dataflow diagram back into
compliance.

The critical path is from d and b, through a multiplier, the middle adder, the final adder, and then
out through z. Because the inputs d and b are on the critical path, it would be preferable to choose
another input (either a or c) as the input to move into the second clock cycle.

If we move c, we will move the first addition in the second clock cycle, which will force us to use
three adders, which violates our resource requirement of a maximum of two adders.

                                                                 d             b       c


By process of elimination, we have settled on a
as our input to be delayed. This causes one of
                                                        a                            +
the multiply operations to be moved into second
clock cycle, which is good because it reduces                                 +
our resources from two multipliers to just one.
                                                                        +
                                                                        z
2.8.5 Optimize Resources                                                                       165


                                                                d             b        c


Moving a into the second clock cycle has caused
a clock period violation, because our clock pe-
                                                            a                         +
riod is now a register, a multiply, and an add.
This forces us to add an additional clock cycle,                              +
which gives us a latency of four.
                                                                       +
                                                                       z


2.8.5 Optimize Resources

                                                                d             b


                                                            a                     c
We can exploit the additional clock cycle to
reschedule our operations to reduce the number
of inputs from three to two. The disadvantage is
                                                                               +
that we have increased the number of registers
from four to five.                                                             +
                                                                       +
                                                                       z
Two side comments:
• Moving the second addition from the third clock cycle to the second will not improve the per-
   formance or the area. The number of adders will remain at two, the number of registers will
   remain at five, and the clock period will remain at the maximum of a multiply or two additions.
• In hindsight, if we had chosen originally to move c, rather than a into the second clock cycle,
   we would likely have produced this same dataflow diagram. After moving c, we would see
   the resource violation of three adders in the second clock cycle. This violation would cause us
   to add a third clock cycle, and given us an opportunity to move a into the second clock cycle.
   The lesson is that there are usually several different ways to approach a design problem, and it
   is infeasible to predict which approach will result in the best design. At best, we have many
   heuristics, or “rules of thumb”, that give us guidelines for techniques that usually work well.
Having finalized our input/output scheduling, we can write our entity. Note: we will add a reset
signal later, when we design the state machine to control the datapath.
166                                    CHAPTER 2. RTL DESIGN WITH VHDL


entity vanier is
  port (
    clk      : in std_logic;
    i_1, i_2 : in std_logic_vector(15 downto 0);
    o_1      : out std_logic_vector(15 downto 0)
  );
end vanier;
2.8.6 Assign Names to Registered Values                                                     167


2.8.6 Assign Names to Registered Values

We must assign a name to each registered value. Optionally, we may also assign names to com-
binational values. Registers require names, because in VHDL each register (except implicit state
registers) is associated with a named signal. Combinational signals do not require names, be-
cause VHDL allows anonymous (unnamed) combinational signals. For example, in the expression
(a+b)+c we do not need to provide a name for the sum of a and b.

                                                              d             b
                                                              x1           x2

                                                          a                     c
                                                         x3         x4          x5
If a single value spans multiple clock cycles, it
only needs to be named once. In our example                                  +
x 1, x 2, and x 4 each cross two boundaries.
                                                              x6            x7
                                                                           +
                                                                     +
                                                                    x8
                                                                    z
168                                                   CHAPTER 2. RTL DESIGN WITH VHDL


2.8.7 Input/Output Allocation

Now that we have names for all of our registered signals, we can allocate input and output ports to
signals.

After the input and output ports have been allocated to signals, we can write our first model. We
use an implicit state machine and define only the registered values. In each state, we define the
values of the registered values that are computed in that state.

                                       architecture hlm_v1 of vanier is
           i1         i2                 signal x_1, x_2, x_3, x_4, x_5, x_6,
           d          b
                                            x_7, x_8 : unsigned(15 downto 0);
           x1        x2                begin
                                         process begin
      i1
      a                    i2
                           c               ------------------------------
                                           wait until rising_edge(clk);
   x3           x4         x5
                                           ------------------------------
                       +                   x_1 <= unsigned(i_1);
                                           x_2 <= unsigned(i_2);
           x6          x7                  ------------------------------
                     +                     wait until rising_edge(clk);
                                           ------------------------------
                +                          x_3 <= unsigned(i_1);
                                           x_4 <= x_1(7 downto 0) * x_2(7 downto 0);
                x8
                 z                         x_5 <= unsigned(i_2);
                o1
                                           ------------------------------
                                           wait until rising_edge(clk);
                                           ------------------------------
                                           x_6 <= x_3(7 downto 0) * x_1(7 downto 0);
                                           x_7 <= x_2 + x_5;
                                           ------------------------------
                                           wait until rising_edge(clk);
                                           ------------------------------
                                           x_8 <= x_6 + (x_4 + x_7);
                                         end process;
                                         o_1 <= std_logic_vector(x_8);
                                       end hlm_v1;
2.8.7 Input/Output Allocation                                                              169


                                                           0     1      2     3      4      5
                                                     i1
                                                     i2
                                                     x1
               i1
               d            i2
                            b             0          x2
               x1          x2                        x3
                                          1
          i1
          a                      i2
                                 c                   x4
                                                     x5
          x3        x4           x5
                             +            2          x6
                                                     x7
               x6            x7                      x8
                           +
                                          3
                     +                                     0     1      2     3      4      5
                    x8
                     z                               i1
                    o1                    4
                                                     i2
                                                     r1
                                                     r2
                                                     r3
                                                     r4
                                                     r5


The model hlm v1 is synthesizable. If we are happy with the clock speed and area, we can stop
now! The remaining steps of the design process seek to optimize the design by reducing the area
and clock period. For area, we will reduce the number of registers, datapath components, and
multiplexers. Reducing the clock period will occur as we reduce the number of multiplexers and
potentially perform peephole (localized) optimizations, such as Boolean simplification.
170                                                 CHAPTER 2. RTL DESIGN WITH VHDL


2.8.8 Tangent: Combinational Outputs

To demonstrate a high-level model where the output is combinational, we modify hlm v1 so that
the output is combinational, rather than a register (see hlm v1c). To make the output (x 8) com-
binational, we move the assignment to x 8 out of the main clocked process and into a concurrent
statement.

architecture hlm_v1c of vanier is
  signal x_1, x_2, x_3, x_4, x_5, x_6, x_7
            : unsigned(15 downto 0);
begin
  process begin
    ------------------------------                                        i1
                                                                          d           i2
                                                                                      b
    wait until rising_edge(clk);
                                                                          x1          x2
    ------------------------------
    x_1 <= unsigned(i_1);                                            i1                    i2
                                                                     a                     c
    x_2 <= unsigned(i_2);
    ------------------------------                                  x3         x4          x5
    wait until rising_edge(clk);
    ------------------------------
                                                                                        +
    x_3 <= unsigned(i_1);
    x_4 <= x_1(7 downto 0) * x_2(7 downto 0);                             x6           x7
    x_5 <= unsigned(i_2);
                                                                                      +
    ------------------------------
    wait until rising_edge(clk);                                                +
    ------------------------------
    x_6 <= x_3(7 downto 0) * x_1(7 downto 0);                                  z
                                                                               o1
    x_7 <= x_2 + x_5;
  end process;
  o_1 <= std_logic_vector(x_6 + (x_4 + x_7));
end hlm_v1c;
2.8.9 Register Allocation                                                                         171


2.8.9 Register Allocation

Our previous model (hlm v1) uses eight registers (x 1. . . x 8). However, our analysis of the
dataflow diagrams says that we can implement the diagram with just five registers. Also, the code
for hlm v1 contains two occurrences of the multiplication symbol (*) and three occurrences of the
addition symbol (+). Our analysis of the dataflow diagram showed that we need only one multiply
and two adds. In hlm v1 we are relying on the synthesis tool to recognize that even though the
code contains two multiplies and three adds, the hardware needs only one multiply and two adds.

Register allocation is the task of assigning each of our registered values to a register signal. Dat-
apath allocation is the task of assigning each datapath operation to a datapath component. Only
high-level synthesis tools (and software compilers) do register allocation. So, as hardware design-
ers, we are stuck with the task of doing register allocation ourselves if we want to further optimize
our design. Some register-transfer-level synthesis tools do datapath allocation. If your synthesis
tool does datapath allocation, it is important to learn the idioms and limitations of the tool so that
you can write your code in a style that allows the tool to do a good job of allocation and optimiza-
tion. In most cases where area or clock speed are important design metrics, design engineers do
datapath allocation by hand or ad-hoc software and spreadsheets.

We will now step through the tasks of register allocation and datapath allocation. In our eight-
register model, each register holds a unique value — we do not reuse registers. To reduce the
number of registers from eight to five, we will need to reuse registers, so that a register potentially
holds different values in different clock cycles.

When doing register allocation, we assign a register to each signal that crosses a clock cycle bound-
ary. When creating the hardware block diagram, we will need to add multiplexers to the inputs of
modules that are connected to multiple registers. To reduce the number of multiplexers, we try to
allocate the same registers to the same inputs of the same type of module. For example, x 7 is an
input to an adder, we allocate r 5 to x 7, because r 5 was also an input to an adder in another
clock cycle. Also in the third clock cycle, we allocate r 2 to x 6, because in the second clock
cycle, the inputs to an adder were r 2 and r 5. In the last clock cycle, we allocate r 5 to x 8,
because previously r 5 was used as the output of r 2 + r 5.

We update our model to reflect register allocation, by replacing the signals for registered values
(x 1. . . x 8) with the registers r 1. . . r 5.
172                                                   CHAPTER 2. RTL DESIGN WITH VHDL


                                       architecture hlm_v2 of vanier is
        i1           i2                  signal r_1, r_2, r_3, r_4, r_5
        d            b
                                                : unsigned(15 downto 0);
        r1
        x1          r2
                    x2                 begin
                                         process begin
   i1
   a                      i2
                          c                ------------------------------
  r3         r4           r5
                                           wait until rising_edge(clk);
  x3         x4           x5
                                           ------------------------------
                       +                   r_1 <= unsigned(i_1);
                                           r_2 <= unsigned(i_2);
        r2
        x6            r5
                      x7                   ------------------------------
                     +                     wait until rising_edge(clk);
                                           ------------------------------
              +                            r_3 <= unsigned(i_1);
                                           r_4 <= r_1(7 downto 0) * r_2(7 downto 0);
             r5
             x8                            r_5 <= unsigned(i_2);
              z
             o1
                                           ------------------------------
                                           wait until rising_edge(clk);
                                           ------------------------------
                                           r_2 <= r_3(7 downto 0) * r_1(7 downto 0);
                                           r_5 <= r_2 + r_5;
                                           ------------------------------
                                           wait until rising_edge(clk);
                                           ------------------------------
                                           r_5 <= r_2 + (r_4 + r_5);
                                         end process;
                                         o_1 <= std_logic_vector(r_5);
                                       end hlm_v2;

Both of our models so far (hlm v1 and hlm v2) have used implicit state machines. The optimiza-
tion from hlm v1 to hlm v2 was done to reduce the number of registers by performing register
allocation. Most of the remaining optimizations require an explicit state machine. We will con-
struct an explicit state machine using a methodical procedure that gradually adds more information
to the dataflow diagram. The first step in this procedure is to datapath allocation, which is similar
to register allocation, except that we allocate datapath components to datapath operations, rather
than allocate registers to names.

To control the datapath, we need to provide the following signals for registers and datapath com-
ponents:
registers chip-enable and mux-select signals
datapath components instruction (e.g. add, sub, etc for ALUs) and mux-select

After we determine the chip-enable, mux-select, and instruction signals, and then calculate what
value each signal needs in each clock cycle, we can build the explicit state machine to control the
datapath.

After we build the state machine, we will add a reset to the design.
2.8.10 Datapath Allocation                                                                                173


2.8.10 Datapath Allocation

                                                                     i1
                                                                     d                   i2
                                                                                         b
                                                                    r1
                                                                    x1                   r2
                                                                                         x2
In datapath allocation, we allocate an adder (ei-              i1
                                                               a          m1                   i2
                                                                                               c
ther a1 or a2) to each addition operation and a
                                                               r3
                                                              x3               r4
                                                                               x4             r5
                                                                                              x5
multiplier (either m1 or m2) to each multiplica-              m1
tion operation. As with register allocation, we                                      a1   +
attempt to reduce the number of multiplexers
will be required by connecting the same data-                        r2
                                                                     x6                   r5
                                                                                          x7
path component to the same register in multiple                                     a2   +
clock cycles.
                                                                          a1   +
                                                                               r5
                                                                               x8
                                                                               z
                                                                               o1



2.8.11 Hardware Block Diagram and State Machine

To build an explicit state machine, we first determine what states we need. In this circuit, we need
four states, one for each clock cycle in the dataflow diagram. If our algorithmic description had
included control flow, such as loops and branches, then it becomes more difficult to determine the
states that are needed.

We will use four states: S0..S3, where S0 corresponds to the first clock cycle (during which the
input is read) and S3 corresponds to the last clock cycle.


2.8.11.1 Control for Registers

To determine the chip enable and mux select signals for the registers, we build a table where each
state corresponds to a row and each register corresponds to a column.

For each register and each state, we note whether the register loads in a new value (ce) and what
signal is the source of the loaded data (d).


                    r1              r2              r3                    r4                    r5
               ce        d     ce        d     ce        d          ce         d          ce         d
       S0      1         i1    1         i2    –         –          –           –         –           –
       S1      0         –     0          –    1         i1         1          m1         1          i2
       S2      –         –     1         m1    –         –          0           –         1          a1
       S3      –         –     –          –    –          –         –           –         1          a1
       S3      –         –     –          –    –          –         –           –         1          a1
174                                                      CHAPTER 2. RTL DESIGN WITH VHDL


Eliminate unnecessary chip enables and muxes.
• A chip enable is needed if a register must hold a single value for multiple clock cycles (ce=0).
• A multiplexer is needed if a register loads in values from different sources in different clock
  cycles.
The register simplifications are as follows:
r1 Chip-enable, because S1 has ce=0. No multiplexer, because i1 is the only input.
r2 Chip-enable, because S1 has ce=0. Multiplexer to choose between i2 and m1.
r3 No chip enable, no multiplexer. The register r3 simplifies to be just r3=i1 without a mul-
    tiplexer or chip-enable, because there is only one state where we care about its behaviour
    (S1) — all of the other states are don’t cares for both chip enable and mux.
r4 Chip-enable, because S2 has ce=0. No multiplexer, because m1 is the only input.
r5 No chip-enable, because do not have any states with ce=0. Multiplexer between i2 and a1.

The simplified register table is shown below. For registers that do not have multiplexers, we show
their input on the top row. For registers that need neither a chip enable nor a mux (e.g. r3), we
write the assignment in the first row and leave the other rows blank.


                           r1=i1              r2        r3=i1 r4=m1        r5
                             ce        ce          d            ce         d
                      S0     1         1           i2            –          –
                      S1     0         0            –            1         i2
                      S2     –         1           m1            0         a1
                      S3     –         –            –            –         a1


The chip-enable and mux-select signals that are needed for the registers are: r1 ce, r2 ce,
r2 sel, r4 ce, and r5 sel.


2.8.11.2 Control for Datapath Components

Analogous to the table for registers, we build a table for the datapath components. Each of our
components has two inputs (src1 and src2). Each component performs a single operation (either
addition or multiplication), so we do not need to define operation or instruction signals for the
datapath components.


                                       a1              a2          m1
                                src1        src2   src1 src2   src1 src2
                           S0     –           –      –    –      –     –
                           S1     –           –      –    –     r1    r2
                           S2    r2          r5      –    –     r3    r1
                           S3    r2          a2     r4    r5     –     –
2.8.11 Hardware Block Diagram and State Machine                                                    175


Based on the table above, the adder a1 will need a multiplexer for src2. The multiplier m1 will
need two multiplexers: one for each input.

Note that the operands to addition and multiplication are commutative, so we can choose which
signal goes to src1 and which to src2 so as to minimize the need for multiplexers.

We notice that for m1, we can reduce the number of multiplexers from 2 to 1 by swapping the
operands in the second clock cycle. This makes r1 the only source of operands for the src1 input.
This optimization is reflected in the table below.


                                         a1              a2          m1
                                  src1        src2   src1 src2   src1 src2
                             S0     –           –      –    –      –     –
                             S1     –           –      –    –     r1    r2
                             S2    r2          r5      –    –     r1    r3
                             S3    r2          a2     r4    r5     –     –


The mux-select signals for the datapath components are: a1 src2 sel and m1 src2 sel.


2.8.11.3 Control for State

We need to control the transition from one state to the next. For this example, the transition is very
simple, each state transitions to its successor: S0 → S1 → S2 → S3 → S0....


2.8.11.4 Complete State Machine Table

The state machine table is shown below. Note that the state signal is a register; the table shows the
next value of the signal.


      r1 ce r2 ce r2 sel r4 ce                       r5 sel a1 src2 sel m1 src2 sel state
 S0     1     1     i2     –                            –         –          –       S1
 S1     0     0      –     1                           i2         –          r2      S2
 S2     –     1     m1     0                           a1        r5          r3      S3
 S3     –     –      –     –                           a1        a2          –       S0


We now choose instantiations for the don’t care values so as to simplify the circuitry. Different
state encodings will lead to different simplifications. For fully-encoded states, Karnaugh maps are
helpful in doing simplifications. For a one-hot state encoding, it is usually better to create situations
where conditions are based upon a single state. The reason for this heuristic with one-hot encodings
will be clear when we get to explicit v2.
176                                                    CHAPTER 2. RTL DESIGN WITH VHDL


r1 ce We first choose 0 as the don’t care instantiation, because that leaves just one state where
    we need to load. Additionally, it is conceptually cleaner to do an assignment in just the one
    clock cycle where we care about the value, rather than not do an assignment in the one clock
    cycle where we must hold the value. (At the end of the don’t care allocation, we’ll revisit
    this decision and change our mind.)
r2 ce We choose 1 for S3, so that we have just one state where we do not do a load. If we
    had chosen 0 for r2ce in S3, we would have two states where we do a load and two where
    we do not load. If we were using fully-encoded states, this even separation might have left
    us with a very nice Karnaugh map; or it might have left us with a Karnaugh map that has a
    checkerboard pattern, which would not simplify. This helps illustrate why state encoding is
    a difficult problem.
r2 sel We choose m1 arbitrarily. The choice of i2 would have also resulted in three assign-
    ments from one signal and one assignment from the other signal.
r4 ce We choose 0 as we did for r1 ce.
r5 sel Choose a1 so that we have three assignments from the same signal and just one assign-
    ment from the other signal.
a1 src2 Choose a2 arbitrarily.
m1 src2 Choose r3 arbitrarily.
r1 ce (again) We examine r1 ce and r2 ce and see that if we choose 1 for the don’t care
    instantiation of r1 ce, we will have the same choices for both chip enables. This will
    simplify our state machine. Also, r4 ce is the negation of r2 ce, so we can use just an
    inverter to control r4 ce.

      r1 ce r2 ce r2 sel r4 ce                r5 sel a1 src2 sel m1 src2 sel state
 S0     1     1     i2     0                    a1        a2          r3      S1
 S1     0     0     m1     1                    i2        a2          r2      S2
 S2     1     1     m1     0                    a1        r5          r3      S3
 S3     1     1     m1     0                    a1        a2          r3      S0


2.8.12 VHDL Code with Explicit State Machine

VHDL code can be written directly from the tables and the dataflow diagram that shows register
allocation, input allocation, and datapath allocation. As a simplification, rather than write explicit
signals for the chip-enable and mux-select signals, we use select and conditional assignment state-
ments that test the state in the condition.

We chose a one-hot encoding of the state, which usually results in small and fast hardware for state
machines with sixteen or fewer states.
2.8.12 VHDL Code with Explicit State Machine                        177


architecture explicit_v1 of vanier is
  signal r_1, r_2, r_3, r_4, r_5 : std_logic_vector(15 downto 0);
  type state_ty is std_logic_vector(3 downto 0);
  constant s0 : state_ty := "0001";
  constant s1 : state_ty := "0010";
  constant s2 : state_ty := "0100";
  constant s3 : state_ty := "1000";
  signal state : state_ty;
178                                               CHAPTER 2. RTL DESIGN WITH VHDL


begin                                                      ----------------------
  ----------------------                                   -- r_5
  -- r_1                                                   process (clk) begin
  process (clk) begin                                        if rising_edge(clk) then
    if rising_edge(clk) then                                   if state = S1 then
      if state != S1 then                                        r_5 <= i_2;
        r_1 <= i_1;                                            else
      end if;                                                    r_5 <= a_1;
    end if;                                                    end if;
  end process;                                               end if;
  ----------------------                                   end process;
  -- r_2                                                   ----------------------
  process (clk) begin                                      -- combinational datapath
    if rising_edge(clk) then                               with state select
      if state != S1 then                                    a1_src2 <= r_5 when S2,
        if state = S0 then                                              a_2 when others;
          r_2 <= i_2;                                      with state select
        else                                                 m1_src2 <= r_2 when S1
          r_2 <= m_1;                                                   r_3 when others;
        end if;                                            a_1 <= a_2 + a1_src2;
      end if;                                              a_2 <= r_4 + r_5;
    end if;                                                m_1 <= r_1 * m1_src2;
  end process;                                             o_1 <= r_5;
  ----------------------                                   ----------------------
  -- r_3                                                   -- state machine
  process (clk) begin                                      process (clk) begin
    if rising_edge(clk) then                                 if rising_edge(clk) then
      r_3 <= i_1;                                              if reset = ’1’ then
    end if;                                                      state <= S0;
  end process;                                                 else
  ----------------------                                         case state is
  -- r_4                                                           when S0 => state <= S1;
  process (clk) begin                                              when S1 => state <= S2;
    if rising_edge(clk) then                                       when S2 => state <= S3;
      if state = S1 then                                           when S3 => state <= S0;
        r_4 <= m_1;                                              end case;
      end if;                                                  end if;
    end if;                                                  end if;
  end process;                                             end process;
                                                           ----------------------
                                                         end explicit_v1;

The hardware-block diagram that corresponds to the tables and VHDL code is:
2.8.13 Peephole Optimizations                                                                       179


                                                             i1             i2


S0              i1
                d                   i2
                                    b
                r1
                x1                  r2
                                    x2
S1
           i1
           a         m1                  i2
                                         c
           r3
          x3              r4
                          x4             r5
                                         x5
S2        m1
                                a1   +
                                                        r1             r2        r3            r5
                r2
                x6                   r5
                                     x7
                               a2   +
S3
                                                                                                          o1
                     a1   +
                          r5
                          x8                                  m1
                           z
S0

                                                                  r4


                                                                  a2   +


                                                                                      a1   +



2.8.13 Peephole Optimizations

We will illustrate several peephole optimizations that take advantage of our state encoding.
-- r_1                                           -- r_1 (optimized)
process (clk) begin                              process (clk) begin
  if rising_edge(clk) then                         if rising_edge(clk) then
    if state != S1 then                              if state(1) = ’0’ then
      r_1 <= i_1;                                      r_1 <= i_1;
    end if;                                          end if;
  end if;                                          end if;
end process;                                     end process;


Analogous optimizations can be used when comparing against multiple states:
180                                                  CHAPTER 2. RTL DESIGN WITH VHDL

-- r_2                                          -- r_2 (optimized)
process (clk) begin                             process (clk) begin
  if rising_edge(clk) then                        if rising_edge(clk) then
    if state != S1                                  if state(1) = ’0’ then
      if state = S0 then                              if state(0) = ’1’ then
        r_2 <= i_2;                                     r_2 <= i_2;
      else                                            else
        r_2 <= m_1;                                     r_2 <= m_1;
      end if;                                         end if;
    end if;                                         end if;
  end if;                                         end if;
end process;                                    end process;


Next-state assignment for a one-hot state machine can be done with a simple shift register:
                                                -- state machine (optimized)
-- state machine
                                                -- NOTE: "st" = "state"
process (clk) begin
                                                process (clk) begin
  if rising_edge(clk) then
                                                  if rising_edge(clk) then
    if reset = ’1’ then
                                                    if reset = ’1’ then
      state <= S0;
                                                      st <= S0;
    else
                                                    else
      case state is
                                                      for i in 0 to 3 loop
        when S0 => state <=           S1;
                                                        st( (i+1) mod 4 ) <= st( i );
        when S1 => state <=           S2;
                                                      end loop;
        when S2 => state <=           S3;
                                                    end if;
        when S3 => state <=           S0;
                                                  end if;
      end case;
                                                end process;
    end if;
  end if;
end process;
2.8.13 Peephole Optimizations                                                    181


    The resulting optimized code is shown on the next page.

architecture explicit_v2 of vanier is
  signal r_1, r_2, r_3, r_4, r_5 : std_logic_vector(15 downto 0);
  type state_ty is std_logic_vector(3 downto 0);
  constant s0 : state_ty := "0001"; constant s1 : state_ty := "0010";
  constant s2 : state_ty := "0100"; constant s3 : state_ty := "1000";
  signal state : state_ty;
begin                                                     ----------------------
  ----------------------                                  -- r_5
  -- r_1                                                  process (clk) begin
  process (clk) begin                                       if rising_edge(clk) then
    if rising_edge(clk) then                                  if state(1) = ’1’ then
      if state(1) = ’0’ then                                    r_5 <= i_2;
        r_1 <= i_1;                                           else
      end if;                                                   r_5 <= a_1;
    end if;                                                   end if;
  end process;                                              end if;
  ----------------------                                  end process;
  -- r_2                                                  ----------------------
  process (clk) begin                                     -- combinational datapath
    if rising_edge(clk) then                              a1_src2 <=   r_5 when state(2) = ’1’
      if state(1) = ’0’ then                                      else a_2;
        if state(0) = ’1’ then                            m1_src2 <=   r_2 when state(1)= ’1’
          r_2 <= i_2;                                             else r_3;
        else                                              a_1 <= a_2 + a1_src2;
          r_2 <= m_1;                                     a_2 <= r_4 * r_5;
        end if;                                           m_1 <= r_1 * m1_src2;
      end if;                                             o_1 <= r_5;
    end if;                                               ----------------------
  end process;                                            -- state machine
  ----------------------                                  process (clk) begin
  -- r_3                                                    if rising_edge(clk) then
  process (clk) begin                                         if reset = ’1’ then
    if rising_edge(clk) then                                    state <= S0;
      r_3 <= i_1;                                             else
    end if;                                                     for i in 0 to 3 loop
  end process;                                                    state( (i+1) mod 4) <=
  ----------------------                                                        state(i);
  -- r_4                                                        end loop;
  process (clk) begin                                         end if;
    if rising_edge(clk) then                                end if;
      if state(1) = ’1’ then                              end process;
        r_4 <= m_1;                                       ----------------------
      end if;                                           end explicit_v1;
    end if;
  end process;
182                                                    CHAPTER 2. RTL DESIGN WITH VHDL


2.8.14 Notes and Observations

Our functional requirements were written as:

output = (a × d) + (d × b) + b + c

Alternatively, we could have achieved exactly the same functionality with the functional require-
ments written as (the two statements are mathematically equivalent):

output = (a × d) + b + (d × b) + c

The naive data dependency graph for the alternative formulation is much messier than the data
dependency graph for the original formulation:

                   Original                                          Alternative
           (a × d) + (d × b) + b + c                           (a × d) + c + (d × b) + b

   a        d             b        c                   a d     b             c



                                 +
                          +                                  +             +
                   +                                                +
                   z                                                 z
An observation: it can be helpful to explore several equivalent formulations of the mathematical
equations while constructing the data dependency graph. A mathematical formulation that places
occurrences of the same identifier close to each other often results in a simpler data dependency
graph. The simpler the data dependency graph, the easier it will be to identify helpful optimizations
and efficient schedules.
2.9. PIPELINING                                                                                               183


2.9 Pipelining
Pipelining is one of the most common and most effective performance optimizations in hardware.
Pipelining is used in systems ranging from simple signal-processing filters to high-performance
microprocessors. Pipelining increases performance by overlapping the execution of multiple in-
structions or parcels of data, analogous to the way that multiple cars flow through an automobile
assembly line.
Pipelines are difficult to design and verify, because subtle bugs can arise from the interactions
between instructions flowing through the pipeline. There are intended interactions, which must
happen correctly, and there might be unintended interactions which constitute bugs. Computer
architects categorize the interactions between instructions according to three principles: struc-
tural hazards, control hazards, and data hazards. Our examples will all be pure datapath pipelines
without any data or control dependencies between parcels of data. This eliminates most of the
complexities of implementing pipelines correctly.


2.9.1 Introduction to Pipelining

Review of unpipelined dataflow diagram                             .............................................. .


As a quick review of an unpipelined (also called “sequential”) dataflow diagram we revisit the
one-add example from section 2.6.3.
         a       b
                                                              0
       r1         r2
     add1    +             c                                  1
             r1             r2
             add1      +             d                        2               0 1 2 3 4 5 6 7 8 9 10 11
                       r1             r2                                clk
                       add1      +             e              3         a     α           β
                                 r1             r2                      r1        α α α α α β β β β
                                 add1      +             f    4         z                 α         β
                                           r1            r2
                                           add1      +        5

                                                     z

The key feature to notice, in comparison to a pipelined dataflow diagram, is that the second parcel
(β) begins execution only after the first parcel (α) has finished executing.


Pipelined dataflow diagram                          ............................................................


In a pipeline, each stage is a separate circuit, in that we cannot reuse the same component in mul-
tiple stages. When drawing a pipelined dataflow diagram, we effectively have multiple dataflow
184                                                                                                                      CHAPTER 2. RTL DESIGN WITH VHDL


diagrams: one for each stage. As a notational shorthand to avoid drawing multiple dataflow di-
agrams, we introduce a new bit of notation: a double line denotes a boundary between stages.
We perform scheduling, resource allocation, and all of the other design steps individually for each
stage.

Our first example of a pipelined dataflow diagram is a fully pipelined version of the previous
example. In a fully pipelined dataflow diagram, each clock becomes a separage stage. Notationally,
we simply replace the single-line clock cycle boundaries with double-line stage boundaries.
                                               a       b
                                                                                                             0
                                            r1          r2
 stage 5 stage 4 stage 3 stage 2 stage 1




                                           add1    +             c                                           1                      0 1 2 3 4 5 6 7 8 9 10 11
                                                   r3             r4                                                       clk
                                                   add2      +              d                                2             a        α β
                                                             r5                 r5                               (stage1) r1          α β
                                                             add3       +                e                   3   (stage2) r3            α β
                                                                         r7               r8                     (stage3) r5              α β
                                                                          add4       +               f       4   (stage4) r7                α β
                                                                                     r9              r10         (stage5) r9                  α β
                                                                                     add5       +            5             z                  α β

                                                                                                 z



Sequential (Unpipelined) Hardware                                                                        ....................................................


The hardware for the unpipelined dataflow diagram contains two registers, one adder, a multiplexer
and a state machine to control the multiplexer. When the data is produced by the adder at the end
of each clock cycle, it is fed back to multiplexer as a value for the next clock cycle.
                                   reset                                                                                                            i1       i2
                                             State(0)                State(1)                State(2)         State(3)         State(4)




                                                                                                                                                    r1            r2

                                                                                                                                                 add1   +

                                                                                                                                                        o1




Pipelined Hardware and VHDL Code                                                                           ..................................................
2.9.1 Introduction to Pipelining                                                                              185


The hardware for the pipelined dataflow diagram contains two registers and one adder for each
stage. The registers and adders do the same thing in each clock cycle, so there is no need for
chip-enables, multiplexers, or a state machine.
                                                                                   -- stage 1
                                                                                   process begin
             i1           i2                                                         wait until rising_edge(clk);
                                                                                     r1 <= i1;        r2 <= i2;
                                                                                   end process;
                  r1           r2
                                                                                   -- stage 2
   stage 1




                                      i3                                           process begin
             add1   +                                                                wait until rising_edge(clk);
                           r3              r4                                        r3 <= r1 + r2;   r4 <= i3;
   stage 2




                                                                                   end process;
                                                  i4
                        add2   +                                                   -- stage 3
                                                                                   process begin
                                       r5              r6                            wait until rising_edge(clk);
   stage 3




                                                              i5
                                                                                     r5 <= r3 + r4;   r6 <= i4;
                                    add3   +                                       end process;
                                                                                   -- stage 4
                                                   r7              r8
                                                                                   process begin
   stage 4




                                                                        i6           wait until rising_edge(clk);
                                                add4   +                             r7 <= r5 + r6;   r8 <= i5;
                                                                                   end process;
                                                               r9            r10
                                                                                   -- stage 5
   stage 5




                                                                                   process begin
                                                            add5   +
                                                                                     wait until rising_edge(clk);
                                                                                     r9 <= r7 + r8;   r10 <= i6;
                                                                   o1              end process;
                                                                                   -- output
                                                                                   o1 <= r9 + r10;
The VHDL code above is designed to be easy to read by matching the structure of the hardware.
An alternative style is to be more concise by grouping all of the registered assignments in a single
clocked process as shown below. The two styles are equivalent with respect to simulation and
synthesis.

-- group all registered assignments into a single process
process begin
  wait until rising_edge(clk);
  r1 <= i1;        r2 <= i2;
  r3 <= r1 + r2;   r4 <= i3;
  r5 <= r3 + r4;   r6 <= i4;
  r7 <= r5 + r6;   r8 <= i5;
  r9 <= r7 + r8;   r10 <= i6;
end process;
o1 <= r9 + r10;
186                                                                                 CHAPTER 2. RTL DESIGN WITH VHDL


2.9.2 Partially Pipelined

The previous section illustrated a fully pipelined circuit, which means that the circuit could accept
a new parcel every clock cycle. Sometimes we want to sacrifice performance (throughput) in order
to reduce area. We can do this by having a throughput that is less than one parcel per clock-cycle
and reusing some hardware. A pipeline that has a throughput of less than one is said to be partially
pipelined.

If a pipeline is essentially two pipelines running in parallel, then it is said to be superscalar and
will usually have a throughput that is more than one parcel per clock cycle. A superscalar pipeline
that has n pipelines in parallel is said to be n-way superscalar and has a maximum throughput of n
parcels per clock cycle.
             a       b
                                                                        0
           r1         r2
          add1   +             c                                        1
stage 1




                 r1             r2
                                                                                                     0 1 2 3 4 5 6 7 8 9 10 11
                 add1      +             d                              2                  clk

                           r3             r4                                                  a      α      β
                           add2      +             e                    3   (stage1) r1                   α α β β
stage 2




                                     r3             r4                      (stage2) r3                       α α β β
                                     add2      +              f         4   (stage3) r5                           α   β β
                                               r5             r6                              z                   α     β
stage 3




                                               add3      +              5

                                                          z



Hardware for Partially Pipelined                         . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
2.9.3 Terminology                                                                                  187


                                                           i1       i2
               reset
                       State(0)    State(1)




                                                                                         stage 1
                                                           r1             r2

                                                        add1   +
                                                                         i2




                                                                                         stage 2
                                                               r3             r4

                                                         add2   +
                                                                               i2




                                                                                         stage 3
                                                                    r5              r6

                                                                add3   +
                                                                       o1




2.9.3 Terminology

  Definition Depth: The depth of a pipeline is the number of stages on the longest path
    through the pipeline.


  Definition Latency: The latency of a pipeline is measured the same as for an
    unpipelined circuit: the number of clock cycles from inputs to outputs.


  Definition Throughput: The number of parcels consumed or produced per clock cycle.


  Definition Upstream/downstream: Because parcels flow through the pipeline
    analogously to water in a stream, the terms upstream and downstream are used
    respectively to refer to earlier and later stages in the pipeline. For example, stage1 is
    upstream from stage2.


  Definition Bubble: When a pipe stage is empty (contains invalid data), it is said to
    contain a “bubble”.
188                                                    CHAPTER 2. RTL DESIGN WITH VHDL


   Question:     How do we know whether the output of the pipeline is a bubble or is valid
     data?


   Answer:
      Add one register per stage to hold valid bit. If valid=’0’; then the pipe stage
     contains a bubble.



2.10 Design Example: Pipelined Massey
In this section, we revisit the Massey example from section 2.7, but now do it with a pipelined
implementation. To allow us to implement a pipelined design, we need to relax our resource
requirements. Originally, we were allowed two adders and three inputs. For the pipeline, we will
allow ourselves six inputs and five adders. There are six input values and five additions in the
dataflow diagram, so these requirements will enable us to build a fully pipelined implementation.
If we were forced to reuse a component (e.g., a maximum of two adders), then we would need to
build a partially pipelined circuit.

To stay within the normal design rules for pipelines, we will register our inputs but not our outputs.

In summary, the requirements are:


Requirements      .........................................................................

  Functional requirements:
     • Compute the sum of six 8-bit numbers: output = a + b + c + d + e + f
     • Registered inputs, combinational outputs
  Performance requirements:
     • Maximum clock period: unlimited
     • Maximum latency: four
  Cost requirements:
     • Maximum of five adders
     • Small miscellaneous hardware (e.g. muxes) is unlimited
     • Maximum of six inputs and one output
     • Design effort is unlimited
2.10. DESIGN EXAMPLE: PIPELINED MASSEY                                                                                                            189


Initial Dataflow Diagrams             ............................................................ .


Our goal is to first maximize performance and then minimize area with the bounds of the require-
ments. To maximize performance, we want a throughput of one and a minimum clock period.
Revisiting the dataflow diagrams from the unpipelined Massey, we find the two diagrams below as
promising candidates for the pipelined Massey.

                    Original dataflow                                              Final unpipelined dataflow
        a       b        c       d                                            a       b         c


            +                +           e       f                                +
                     +                       +                                              +                d           e


                                     +                                                                   +
                                     z
                                                                                                                     +                f


                                                                                                                                 +
                                                                                                                                  z

For the unpipelined design, we rejected the original dataflow diagram because it violated the re-
source requirement of a maximum of three inputs. If we fully pipeline the design, both dataflow
diagrams will use six inputs and five adders. The first diagram uses ten registers, while the second
uses eight (remember, there is no reuse of components in a fully pipelined design). However, the
first dataflow diagram has a shorter clock period, and so will lead higher performance. Because
our primary goal is to maximime performance, we will pursue the first dataflow diagram.


Dataflow Diagram Exploration              . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..


As a variation of the first dataflow diagram, we reschedule all of inputs to be read in the first clock
cycle.
190                                                            CHAPTER 2. RTL DESIGN WITH VHDL


       Variation on original dataflow
       a       b       c       d       e       f


           +               +               +
                   +
                                   +
                                   z

The variation has the disadvantage of using one additional register. However, it has the potential
advantage of a simpler interface to the upstream environment, because all of the inputs are pro-
vided at the same time. Conversely, this rescheduling would be a disadvantage if the upstream
environment was optimized to take advantage of the fact that e and f are produced one clock cycle
later than the other values. We do not know anything about the upstream environment, and so will
reject this variation, because it increases the number of registers that we need.

As we said before, to maximize performace, we will fully pipeline the design, so every clock cycle
boundary becomes a stage boundary. At this time, we also add a valid bit to keep track of whether
a stage has a bubble or valid parcel.       Pipelined dataflow diagram
                                                   a       b       c       d                   i_valid


                                                       +               +           e       f


                                                               +                       +
                                                                               +
                                                                               z               o_valid




VHDL Code          ......................................................................... .


For this simple example, there are no further optimizations, and can write the VHDL code directly
from the dataflow diagram.
2.10. DESIGN EXAMPLE: PIPELINED MASSEY                         191


-- stage 1
process begin
  wait until rising_edge(clk);
  r1 <= i1; r2 <= i2; r3 <= i3;   r4 <= i4;   v1 <= i_valid;
end process;
a1 <= r1 + r2; a2 <= r3 + r4;
-- stage 2
process begin
  wait until rising_edge(clk);
  r5 <= a1; r6 <= a2; r7 <= i5;   r8 <= i6;   v2 <= v1;
end process;
a3 <= r5 + r6; a4 <= r7 + r8;
-- stage 3
process begin
  wait until rising_edge(clk);
  r9 <= a3; r10 <= a4;                        v3 <= v2;
end process;
a5 <= r9 + r10;
-- outputs
z       <= a5;
o_valid <= v3;
192                                                                    CHAPTER 2. RTL DESIGN WITH VHDL


2.11 Memory Arrays and RTL Design
2.11.1 Memory Operations

Read of Memory with Registered Inputs                    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..

                                         Behaviour
                                              clk
              Hardware                        we                           -
      we       WE

       a       A          DO   do               a        αa
                      M
               DI
                                          M(αa)                 αd
      clk
                                               do        -                αd




Write to Memory with Registered Inputs                       ...............................................
                         Behaviour
                                              clk

                                               we                               -
              Hardware
       we      WE
                                                a        αa                     -
        a      A          DO     do
       di      DI
                      M                        di        αd                     -
      clk                                 M(αa)          -                αd
                                               do        -                 U




Dual-Port Memory with Registered Inputs                       ............................................ .
                                       clk
                                        we                        -

                                        a0          αa            -

 we     WE                             di0          αd            -
 a0     A0          DO0    do0
              M                         a1          βa            -
di0     DI0

 a1     A1          DO1    do1        M(αa)         -            αd
clk
                                      M(βa)              βd

                                       do0          -             U
                                       do1          -            βd
2.11.2 Memory Arrays in VHDL                                                                 193


Sequence of Memory Operations        .......................................................
   clk

    we                              -

    a0         αa         γa        -
   di0         αd        γd2        -
    a1         βa        θa         -
 M(αa)         -         αd
 M(βa)                   βd

  M(γa)                  γd1
 M(θa)                   θd

   do0                    ?        γd1
   do1         -         βd         θd




2.11.2 Memory Arrays in VHDL

2.11.2.1 Using a Two-Dimensional Array for Memory

A memory array can be written in VHDL as a two-dimensional array:

  subtype data is std_logic_vector(7 downto 0);
  type data_vector is array( natural range <> ) of data;
  signal mem : data_vector(31 downto 0);
These two-dimensional arrays can be useful in high-level models and in specifications. However,
it is possible to write code using a two-dimensional array that cannot be synthesized. Also, some
synthesis tools (including Synopsys Design Compiler and FPGA Compiler) will synthesize two-
dimensional arrays very inefficiently.

The example below illustrates: lack of interface protocol, combinational write, multiple write
ports, multiple read ports.
194                                                   CHAPTER 2. RTL DESIGN WITH VHDL

architecture main of mem_not_hw is
  subtype data is std_logic_vector(7 downto 0);
  type data_vector is array( natural range <> ) of data;
  signal mem : data_vector(31 downto 0);
begin
  y <= mem( a );
  mem( a ) <= b;               -- comb read
  process (clk) begin
    if rising_edge(clk) then
      mem( c ) <= w;           -- write port #1
    end if;
  end process;
  process (clk) begin
    if rising_edge(clk) then
      mem( d ) <= v;           -- write port #2
    end if;
  end process;
  u <= mem( e );               -- read port #2
end main;


2.11.2.2 Memory Arrays in Hardware

                                                                  WE
Most simple memory arrays are single- or dual-         WE         A0    DO0
ported, support just one write operation at a time,    A    DO    DI0

and have an interface protocol using a clock and       DI         A1    DO1

write-enable.
2.11.2 Memory Arrays in VHDL                                                                 195


2.11.2.3 VHDL Code for Single-Port Memory Array

package mem_pkg is
  subtype data is std_logic_vector(7 downto 0);
  type data_vector is array( natural range <> ) of data;
end;

entity mem is
  port (
    clk : in std_logic;
    we   : in std_logic                --               write enable
    a    : in unsigned(4 downto 0);    --               address
    di   : in data;                    --               data_in
    do   : out data                    --               data_out
  );
end mem;
architecture main of mem is
  signal mem : data_vector(31 downto 0);
begin
  do <= mem( to_integer( a ) );
  process (clk) begin
    if rising_edge(clk) then
      if we = ’1’ then
        mem( to_integer( a ) ) <= di;
      end if;
    end if;
  end process;
end main;
The VHDL code above is accurate in its behaviour and interface, but might be synthesized as
distributed memory (a large number of flip flops in FPGA cells), which will be very large and very
slow in comparison to a block of memory.

Synopsys synthesis tools implement each bit in a two-dimensional array as a flip-flop.

Each FPGA and ASIC vendors supplies libraries of memory arrays that are smaller and faster than
a two-dimensional array of flip flops. These libraries exploit specialized hardware on the chips to
implement the memory.

         Note:     To synthesize a reasonable implementation of a memory array with
         Synopsys, you must instantiate a vendor-supplied memory component.

Some other synthesis tools, such as Xilinx XST, can infer memory arrays from two-dimensional
arrays and synthesize efficient implementations.
196                                                                               CHAPTER 2. RTL DESIGN WITH VHDL


Recommended Design Process with Memory                                       .......................................... .


   1. high-level model with two-dimensional array
   2. two-dimensional array packaged inside memory entity/architecture
   3. vendor-supplied component


2.11.2.4 Using Library Components for Memory

Altera    ............................................................................... .


Altera uses “MegaFunctions” to implement RAM in VHDL. A MegaFunction is a black-box de-
scription of hardware on the FPGA. There are tools in Quartus to generate VHDL code for RAM
components of different sizes. In E&CE 327 we will provide you with the VHDL code for the
RAM components that you will need in Lab-3 and the Project.

The APEX20KE chips that we are using have dedicated SRAM blocks called Embedded System
Blocks (ESB). Each ESB can store 2048 bits and can be configured in any of the following sizes:


                                         Number of Elements Word Size (bits)
                                                      2048                 1
                                                      1024                 2
                                                       512                 4
                                                       256                 8
                                                       128               16


Xilinx   . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . ..


Use component instantiation to get these components

 ram16x1s 16 × 1 single ported memory
 ram16x1d 16 × 1 dual-ported memory
Other sizes are also available, consult the datasheet for your chip.
2.11.2 Memory Arrays in VHDL                                                                   197


2.11.2.5 Build Memory from Slices

If the vendor’s libraries of memory components do not include one that is the correct size for your
needs, you can construct your own component from smaller ones.


       WriteEn
          Addr
 DataIn[W-1..0]
DataIn[2W-1..2]
            Clk

                         WE                                WE

                         A      DO                         A      DO

                         DI   NxW                          DI   NxW



                                                                       DataOut[W-1..0]
                                                                       DataOut[2W-1..W]


                   Figure 2.7: An N×2W memory from N×W components


        WriteEn
                                     WE
    Addr[logN]
 Addr[logN-1..0]                     A      DO

         DataIn                      DI   NxW
            Clk



                                     WE

                                     A      DO

                                     DI   NxW



                                                  0   1




                                                 DataOut


                    Figure 2.8: A 2N×W memory from N×W components
198                                       CHAPTER 2. RTL DESIGN WITH VHDL


A 16×4 Memory from 16×1 Components   ............................................. .

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity ram16x4s is
  port (
    clk, we : in std_logic;
    data_in : in std_logic_vector(3 downto 0);
    addr     : in unsigned(3 downto 0);
    data_out : out std_logic_vector(3 downto 0)
  );
end ram16x4s;


architecture main of ram16x4s is
  component ram16x1s
    port (d               : in std_logic;   -- data in
          a3, a2, a1, a0 : in std_logic;    -- address
          we              : in std_logic;   -- write enable
          wclk            : in std_logic;   -- write clock
          o               : out std_logic   -- data out
     );
  end component;
begin
  mem_gen:
  for i in 0 to 3 generate
    ram : ram16x1s
      port map (
        we   => we,
        wclk => clk,
        ----------------------------------------------
        -- d and o are dependent on i
        a3 => addr(3),    a2 => addr(2),
        a1 => addr(1),    a0 => addr(0),
        d => data_in(i),
        o => data_out(i)
        ----------------------------------------------
      );
   end generate;
end main;
2.11.3 Data Dependencies                                                                      199


2.11.2.6 Dual-Ported Memory

Dual ported memory is similar to single ported memory, except that it allows two simultaneous
reads, or a simultaneous read and write.

When doing a simultaneous read and write to the same address, the read will usually not see the
data currently being written.


  Question:     Why do dual-ported memories usually not support writes on both ports?


  Answer:
     What should your memory do if you write different values to the same
    address in the same clock cycle?


2.11.3 Data Dependencies

Definition of Three Types of Dependencies       ..............................................


There are three types of data dependencies. The names come from pipeline terminology in com-
puter architecture.


                    M[i]   :=              M[i]   :=                  := M[i]
                           :=                     :=                  :=


                           := M[i]         M[i]   :=           M[i]   :=
                   Read after Write     Write after Write     Write after Read
                  (True dependency)    (Load dependency)     (Anti dependency)


Instructions in a program can be reordered, so long as the data dependencies are preserved.
200                                                       CHAPTER 2. RTL DESIGN WITH VHDL


Purpose of Dependencies      ............................................................. .


                                 W0       R3 := ......

 WAW ordering prevents W0
 from happening after W1
                                 W1       R3 := ......         producer
                                                     RAW ordering prevents R1
                                                     from happening before W1
                                 R1       ... := ... R3 ...    consumer
 WAR ordering prevents W2
 from happening before R1
                                 W2       R3 := ......




Each of the three types of memory dependencies (RAW, WAW, and WAR) serves a specific purpose
in ensuring that producer-consumer relationships are preserved.


Ordering of Memory Operations         .......................................................

                      M[3] M[2] M[1] M[0]
                       30   20   10   0
      M[2] := 21            21                           M[2] := 21               M[2] := 21
      M[3] := 31                                         B    := M[0]             B    := M[0]
      A    := M[2]                                       A    := M[2]             A    := M[2]
      B    := M[0]                                       M[3] := 31               M[3] := 31
      M[3] := 32                                         M[3] := 32               C    := M[3]

      M[0] := 01                                         M[0] := 01               M[3] := 32
      C    := M[3]                                       C    := M[3]             M[0] := 01

      Initial Program with Dependencies             Valid Modification      Valid (or Bad?) Modification


  Answer:
     Bad modification: M[3] := 32 must happen before C := M[3].
2.11.4 Memory Arrays and Dataflow Diagrams                                                     201


2.11.4 Memory Arrays and Dataflow Diagrams

Legend for Dataflow Diagrams          .........................................................


    name
                  name           name           name (rd)     name(wr)

 Input port Output port State signal        Array read      Array write


Basic Memory Operations          .............................................................


                                  mem    data       addr
  mem addr

    mem(rd)                             mem(wr)

                    mem
      data
             (anti-dependency)           mem
     data := mem[addr];           mem[addr] := data;
       Memory Read                 Memory Write
Dataflow diagrams show the dependencies between operations. The basic memory operations are
similar, in that each arrow represents a data dependency.

There are a few aspects of the basic memory operations that are potentially surprising:
• The anti-dependency arrow producing mem on a read.
• Reads and writes are dependent upon the entire previous value of the memory array.
• The write operation appears to produce an entire memory array, rather than just updating an
  individual element of an existing array.
Normally, we think of a memory array as stationary. To do a read, an address is given to the array
and the corresponding data is produced. In datalfow diagrams, it may be somewhat suprising to
see the read and write operations consuming and producing memory arrays.

Our goal is to support memory operations in dataflow diagrams. We want to model memory oper-
ations similarly to datapath operations. When we do a read, the data that is produced is dependent
upon the contents of the memory array and the address. For write operations, the apparent depen-
dency on, and production of, an entire memory array is because we do not know which address
in the array will be read from or written to. The antidependency for memory reads is related to
Write-after-Read dependencies, as discussed in Section 2.11.3. There are optimizations that can
be performed when we know the address (Section 2.11.4).
202                                                      CHAPTER 2. RTL DESIGN WITH VHDL


Dataflow Diagrams and Data Dependencies           . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..



  Algo: mem[wr addr] := data in;                        Algo: mem[wr addr]                            :=      data in;
        data out       := mem[rd addr];                       data out                                :=      mem[rd addr];
      mem     data_in wr_addr                              mem           data_in wr_addr
                                                                                                                            rd_addr


             mem(wr)                   rd_addr
                                                                       mem(wr)                             mem(rd)


                         mem(rd)                                           mem                             data_out
                                                           Optimization when rd addr = wr addr

                  mem    data_out
             Read after Write

  Algo: mem[wr1 addr]         :=     data1;
        mem[wr2 addr]         :=     data2;
  mem     data1    wr1_addr


        mem(wr)         data2      wr2_addr


                     mem(wr)


                        mem
         Write after Write
2.11.4 Memory Arrays and Dataflow Diagrams                                                  203



  Algo: mem[wr1 addr]           :=    data1;
        mem[wr2 addr]           :=    data2;
            mem       data2     wr2_addr



                     mem(wr)


         data1   wr1_addr



      mem(wr)


         mem
      Scheduling option when
       wr1 addr = wr2 addr


  Algo: rd data      :=              mem[rd addr];   Algo: rd data          :=   mem[rd addr];
        mem[wr addr] :=              wr data;              mem[wr addr]     :=   wr data;
   mem                rd_addr                         mem              rd_addr wr_data wr_addr




                                                            mem(rd)           mem(wr)
           mem(rd)               wr_data wr_addr


                                                             rd_data             mem
                              mem(wr)
                                                      Optimization when rd addr = wr addr

           rd_data               mem
                 Write after Read
204                                                             CHAPTER 2. RTL DESIGN WITH VHDL


2.11.5 Example: Memory Array and Dataflow Diagram


 mem       data_in wr_addr
  M          21         2



       1   M(wr)       31        3



                  2   M(wr)             2               0




                            3   M(rd)       4   M(rd)            32         3



 1     M[2] := 21                A               B          5   M(wr)      01        0
 2     M[3] := 31

 3     A      := M[2]                                                 6   M(wr)             3
 4     B      := M[0]
 5     M[3] := 32
                                                                                7   M(rd)
 6     M[0] := 01
 7     C      := M[3]
                                                                           M         C


                  Figure 2.9: Memory array example code and initial dataflow diagram

The dependency and anti-dependency arrows in dataflow diagram in Figure2.9 are based solely
upon whether an operation is a read or a write. The arrows do not take into account the address
that is read from or written to.

In figure2.10, we have used knowledge about which addresses we are accessing to remove un-
needed dependencies. These are the real dependencies and match those shown in the code fragment
for figure2.9. In figure2.11 we have placed an ordering on the read operations and an ordering on
the write operations. The ordering is derived by obeying data dependencies and then rearranging
the operations to perform as many operations in parallel as possible.
2.11.5 Example: Memory Array and Dataflow Diagram                                                  205




           M       0        21 2         31 3                   M       0        21 2         31 3



           M(rd)           M(wr)       M(wr)                1   M(rd)       1   M(wr)    2   M(wr)

               B                                                    B

               01 0             2        32 3                       01 0             2        32 3

           M(wr)           M(rd)       M(wr)                4   M(wr)       2   M(rd)    3   M(wr)


                            A                3                                   A                3

                                        M(rd)                                            3   M(rd)


               M                         C                          M                         C


Figure 2.10: Memory array with minimal dependencies Figure 2.11: Memory array with orderings

           M

                   0        21 2


       1   M(rd)       1   M(wr)

               B

                   2        31 3

       2   M(rd)       2   M(wr)

               A

                            32 3
                       3   M(wr)



                   3        01 0
       3   M(rd)       4   M(wr)


               C            M


                                    Figure 2.12: Final version of Figure2.9

    Put as many parallel operations into same clock cycle as allowed by resources. Preserve depencies
    by putting dependent operations in separate clock cycles.
206                                                 CHAPTER 2. RTL DESIGN WITH VHDL


2.12 Input / Output Protocols
An important aspect of hardware design is choosing a input/output protocol that is easy to im-
plement and suits both your circuit and your environment. Here are a few simple and common
protocols.



rdy

data

ack



                        Figure 2.13: Four phase handshaking protocol

Used when timing of communication between producer and consumer is unpredictable. The dis-
advantage is that it is cumbersome to implement and slow to execute.



clk

valid

data



                               Figure 2.14: Valid-bit protocol

A low overhead (both in area and performance) protocol. Consumer must always be able to accept
incoming data. Often used in pipelined circuits. More complicated versions of the protocol can
handle pipeline stalls.



clk

start

data_in

done

data_out



                              Figure 2.15: Start/Done protocol

A low overhead (both in area and performance) protocol. Useful when a circuit works on one piece
of data at a time and the time to compute the result is unpredictable.
2.13. EXAMPLE: MOVING AVERAGE                                                                       207


2.13 Example: Moving Average
In this section we will design a circuit that performs a moving average as it receives a stream of
data. When each new data item is received, the output is the average of the four most recently
received data.
                            Time 0 1 2 3 4 5 6 7 8 9 10
                          i_data    2 3 5 6 6 0 2 2 5 3 1




                          o_avg              4 5 4 3




2.13.1 Requirements and Environmental Assumptions
   1. Input data is sent sporadically, with at least 2 clock cycles of bubbles (invalid data) between
      valid data.
   2. When the input data is valid, the signal i valid is asserted for exactly one clock cycle.
   3. Input data will be 8-bit signed numbers.
   4. When output data is ready, o valid shall be asserted.
   5. The output data (o avg) shall be the average of the four most recently received input data.
      Output numbers shall be truncated to integer values.


2.13.2 Algorithm

We begin by exploring the mathematical behaviour of the system. To simplify the analysis at this
abstract level, we ignore bubbles and time. We focus only the valid data. If we had an input stream
of data xi (e.g., xi is the value of the ith valid data of i data, the equation for the output would be:


                                   avgi = (xi−3 + xi−2 + xi−1 + xi )/4


To simplify our analysis of the equation, we decompose the computation into computing the sum
of the four most recent data and dividing the sum by four:


                                   sumi = xi−3 + xi−2 + xi−1 + xi
                                   avgi = sumi /4

We look at the equation of sum over several iterations to try to identify patterns that we can use to
optimize our design:
208                                                     CHAPTER 2. RTL DESIGN WITH VHDL




                                    sum5 = x2 + x3 + x4 + x5
                                    sum6 = x3 + x4 + x5 + x6
                                    sum7 = x4 + x5 + x6 + x7

We see that part of the calculations that are done for index i are the same as those for i + 1:


                                   sum5 = x2 + (x3 + x4 + x5 )
                                   sum6 = (x3 + x4 + x5 ) + x6
                                        = sum5 − x2 + x6

We check a few more samples and conclude that we can generalize the above for index i as:


                                    sumi = sumi−1 − xi−4 + xi
                                    avgi = sumi /4

The equation for sumi is dependent on xi and xi−4 , therefore we need the current input value and we
need to store the four most recent input data. These four most recent data form a sliding window:
each time we receive valid data, we remove the oldest data value (xi−4 ) and insert the new data (xi ).

Summary of system behaviour deduced from exploring requirements and algorithm:

   1. Define a signal new for the value of i data each time that i valid is ’1’.
   2. Define a memory array M to store a sliding window of the four most recent values of i data.
   3. Define a signal old for the oldest data value from the sliding window.
   4. Update sumi with sumi−1 – oldi + newi


Sliding Window       .......................................................................


There are two principal ways to implement a sliding window:
shift-register Each time new data is loaded, all of the registers are loaded with the data in the
      register to their right or left and the leftmost or rightmost register is loaded with new data:
      R[0] = new and R[i] = R[i − 1].
circular buffer Once a data value is loaded into the buffer, the data remains in the same location
     until it is overwritten. When new a value is loaded, the new value overwrites the oldest value
     in the buffer. None of the other elements in the buffer change. A state machine keeps track
     of the position (address) of the oldest piece of data. The state machine increments to point
     to the next register, which now holds the oldest piece of data.
2.13.2 Algorithm                                                                                 209


                                                           old        M[0..3]       new
                                                                                     ε
 old     M[3]    M[2]    M[1]     M[0]   new                        α β γ δ
  α        α       β       γ       δ     ε                  α
                                                                                      ζ
                                                                    ε β γ δ
   β       β       γ      δ        ε      ζ                 β
                                                                                     η
   γ       γ       δ       ε       ζ      η                         ε ζ γ δ
                                                            γ
                                                                                      ι
   δ       δ       ε      ζ        η      ι                         ε ζ η δ
                                                            δ
   ε       ε       ζ      η        ι      κ                                          κ
                                                                    ε ζ η ι
   ζ       ζ       η       ι       κ      λ                 ε
                                                                                     λ
                 Shift register                                     κ ζ η ι
                                                            ζ
                                                                  Circular Buffer
The circular buffer design is usually preferable, because only one element changes value per clock
cycle. This allows the buffer to implemented with a memory array rather than a set of regis-
ters. Also, by having only element change value, power consumption is reduced (fewer capacitors
charging and discharging).

We have only four items to store, so we will use registers, rather than a memory array. For less than
sixteen items, registers are generally cheaper. For sixteen items, the choice between registers and
a memory array is highly dependent on the design goals (e.g. speed vs area) and implementation
technology.

Now that we have designed the storage module, we see that rather than a write-enable and address
signal, the actual signals we need are four chip-enable signals. This suggests that we should use a
one-hot encoding for the index of the oldest element in the circular buffer.

Because we have a one-hot encoding for the index, we do not use normal multiplexers to select
which register to read from. Normal multiplexers take a binary-encoded select signal. Instead, we
will use a 4:1 decoded mux, which is just four AND gates followed by a 4-input OR gate. Because
the data is 8-bits wide, each of the AND gates and the OR gate are 8-bits wide.
210                                                   CHAPTER 2. RTL DESIGN WITH VHDL


          8
      d                                                           M[0] 8
                                                       D    Q
  we                                    ce[0]
                    idx[0]                             CE
 addr

                                                                  M[1] 8
                                                       D    Q
                                        ce[1]
                    idx[1]                             CE
                                                                                               8
                                                                                                   q

                                                                  M[2] 8
                                                       D    Q
                                        ce[2]
                    idx[2]                             CE




                                                                  M[3] 8
                                                       D    Q
                                        ce[3]
                    idx[3]                             CE




                  Register array with chip-enables and decoded multiplexer


2.13.3 Pseudocode and Dataflow Diagrams

There are three different notations that we use to describe the behaviour of hardware systems
abstractly: mathematical equations (for datapath centric designs), state machines (for control-
dominated designs), and pseudocode (for algorithms or designs with memory). Our pseudocode is
similar to three-address assembly code: each line of code has a target variable, an operation, and
one or two operand variables (e.g., C = A + B). The name “three address” comes from the fact
that there are three addresses, or variables, in each line.

We use the three-address style of pseudocode, because each line of pseudocode then corresponds
to a single datapath operation in the dataflow diagram. This gives us greater flexibility to optimize
the pseudocode by rescheduling operations.

From the three-address pseudocode, we will construct dataflow diagrams.

As an aside, in constrast to three-address languages, some assembly languages for extremely small
processors are limited to two addresses. The target must be the same as one of the operands (e.g.,
A = A + B).
2.13.3 Pseudocode and Dataflow Diagrams                                                           211


First Pseudocode       ......................................................................


For the first pseudocode, we do not restrict ourselves to three-addresses. In the second version of
the code, we decompose the first line into two separate lines that obey the three-address restriction.
             Pseudo pseudocode
                                                            Real 3-address pseudocode
new         =   i_data
                                                   new         =   i_data
old         =   M[idx]
                                                   old         =   M[idx]
sum         =   sum - old + new
                                                   tmp         =   sum - old
M[idx]      =   new
                                                   sum         =   tmp + new
idx         =   idx rol 1
                                                   M[idx]      =   new
o_avg       =   sum/4
                                                   idx         =   idx rol 1
                                                   o_avg       =   sum/4



Data-Dependency Graph           ............................................................. .


To begin to understand what the hardware might be, we draw a data-dependency graph for the
pseudocode.

 sum       M idx i_data
                           new


                Rd

                 old
                                 Wr
                                              1
    tmp


                (wired shift)

            sum        o_avg      M         idx



Optimizing the Data-Dependency Graph              .............................................. .


In our design work so far, we have ignored bubbles and time. As we evolve from the pseudocode
to a datadependency graph and then to a dataflow graph, we will include the effect of the bubbles
in our analysis.
212                                                   CHAPTER 2. RTL DESIGN WITH VHDL


In the datadependency graph we observe that we have two arithmetic operations: subtraction and
addition. The requirements guarantee that there are at least two clock cycles of bubbles between
each parcel of valid data, so we have the ability to reuse hardware.

In contrast, we would not be able to reuse hardware if either we had to accept new data in each
clock cycle or we needed a fully pipelined circuit. If we had to accept new data in each clock
cycle, and were not pipelined, then the work would need to be completed in a single clock cycle. If
the design was to be fully pipelined, then each parcel of data would stay in each stage for exactly
one clock cycle: there would be no opportunity for a parcel to visit a stage twice, and hence no
opportunity for reuse.

For our design, where we are attempting to reuse hardware, we hypothesize that a single adder/subtracter
is cheaper than a separate adder and a subtracter. We would like to combine the two lines:

tmp         = sum - old
sum         = tmp + new
Looking at the data-dependency graph, we see that old is coming from memory and new is
coming from either a register or combinational logic. We cannot allocate new and old to the
same hardware, because new and old are not the same type of hardware: new is an array of
registers and old is a register. So, we will need a multiplexer for the second operand, to choose
between reading from old or and new. A multiplexer might also be required for the first operand
to choose between sum and tmp. But, both of these signals are regular signals, so we might be
able to allocate both sum and tmp to the same register or datapath output, and hence avoid a
multiplexer for the first operand. We will make decide how to deal with the first operand when we
do register and datapath allocation.

We remove the need for a multiplexer for the second operand by reading new from memory. To
accomplish this, we re-write the pseudocode so that we first write i data to memory, and then
read new from memory. The three versions of the pseudocode below show the transformations.
The datadependency graph is for the third version of the pseudocode.
2.13.3 Pseudocode and Dataflow Diagrams                                                            213


                                                  Data-dependency graph after removing new
Remove intermediate signal old
new         =   i_data                             sum       M idx                i_data
tmp         =   sum - M[idx]
sum         =   tmp + new
M[idx]      =   new
idx         =   idx rol 1
                                                                 Rd
o_avg       =   sum/4
Optimize code byreading new from memory                            old
tmp         =   sum - M[idx]                                                          Wr
M[idx]      =   i_data
new         =   M[idx]
sum         =   tmp + new                                                Rd
idx         =   idx rol 1                                                                        1
o_avg       =   sum/4
                                                       tmp                  new
Remove intermediate signal new
tmp         =   sum - M[idx]                                          (wired shift)
M[idx]      =   i_data
sum         =   tmp + M[idx]
idx         =   idx rol 1                                      sum       o_avg        M        idx
o_avg       =   sum/4



Dataflow Diagram        .....................................................................


To construct a dataflow diagram, we divide the data-dependency graph into clock cycles. Because
we are using registers rather than a memory array, we can schedule the first read and first write
operations in the same clock cycle, even though they use the same address. In contrast, with
memory arrays it generally is risky to rely on the value of the output data in the clock cycle in
which we are doing a write (Section 2.11.1).

We need a second clock cycle for the second read from memory.

We now explore two options: with and without a third clock cycle; both are shown below. The
difference between the two options is whether the signals idx and sum refers to the output of
registers or the combinational datapath units (sum being the output of the adder/subtracter and
idx being the output of a rotation). With a latency of three clock cycles, idx is a registered
signal. With a latency of two clock cycles, idx and sum are combinational.

It is a bit misleading to describe the rotate-left unit for idx as combinational, because it is simply
a wire connecting one flip-flop to another. However, conceptually and for correct behaviour, it
is helpful to think of the rotation unit as a block of combinational circuitry. This allows us to
distinguish between the output of the idx register and the input to the register (which is the output
of the rotation unit). Without this distinction, we might read the wrong value of idx and be
out-of-sync by one clock cycle.
214                                                         CHAPTER 2. RTL DESIGN WITH VHDL


         Latency of three clock cycles                         Latency of two clock cycles
       M     i_data           idx    sum                     M   i_data            idx    sum
 S0                                                   S0
               Wr                             Rd                    Wr                             Rd

 S1                                                   S1


                         Rd                                                   Rd
                                                       1                                                     1


 S2                                                   S0
                                      (wired shift)                                        (wired shift)

 S0
              M                sum       o_avg        idx          M                 sum      o_avg        idx


From a performance point of view, a latency of two is somewhat preferable. By keeping our latency
low, there may be another module that will benefit by having an additional clock cycle in which to
do its work. The counter argument is that we have two clock cycles of bubbles, which means that
we can tolerate a latency of up to three without a need to pipeline. We’ll be efficient engineers and
try to achieve a latency of two.

The two dataflow diagrams appear to be very similar, but in the dataflow diagram with a latency of
two, a multiplexer will be needed for the address signal of the circular buffer. In S0, the address
input to the circular buffer is the output of the rotator. In S1, the address is the output of a register.

To eliminate the need for a multiplexer on the address input to the circular buffer, we move the
rotation from S0 to S1, so that the address is always a registered signal.
2.13.3 Pseudocode and Dataflow Diagrams                                                      215


                    Latency of two clock cycles with registered address
                           M   i_data          idx   sum
                    S0
                                 Wr                            Rd
                                                                        1

                    S1


                                          Rd



                    S0
                                                       (wired shift)


                                 M               sum      o_avg        idx


Register and Datapath Allocation      ......................................................


Register allocation is simple: idx and sum are each allocated to registers with their same names
(e.g., idx and sum) on the first clock cycle boundary. For the second boundary, we similarly
allocate idx to the register idx. This leaves us with the register sum for the output of the
adder/subtracter.


Datapath Allocation      ...................................................................


Datapath allocation is even simpler than register allocation: we have one adder/subtracter (as1)
and one rotate-left (rol).
216                                                                 CHAPTER 2. RTL DESIGN WITH VHDL


        M     i_data            idx      sum
 S0
                Wr         idx        sum           Rd
                                                                1

 S1
                                       as1            rol


                           Rd          sum               idx



                                as1
 S0
                                            (wired shift)


               M                      sum      o_avg        idx


2.13.4 Control Tables and State Machine

From the dataflow diagram, we construct a control table. For memory (M) we need: write enable,
address, and data input columns. For registers (idx, sum) we need chip enable and data input
columns. For datapath components we need data inputs, plus a control signal to determine whether
as1 does addition or subtraction. We name the signal as1.sub, where a value of true means to
do a subtraction and false means do addition.

We proceed in two steps, first ignoring bubbles, then extending our design to handle bubbles.

              Register control table                                     Datapath control table
                M             idx              sum                            as1             rol
         we     addr d      ce d             ce d                        sub src1 src2    src1 src2
   S0    1       idx x      0 –              1 as1                  S0    0   M sum         –     –
   S1    0       idx –      1 rol            1 as1                  S1    1 sum M         idx     1

                                            Optimized control table
                                                              Static assignments in control table
        M     idx    as1                                       M.addr     =   idx
        we     ce    sub                                       M.d        =   x
 S0     1      1      0                                        idx.d      =   rol
 S1     0      0      1                                        sum.d      =   as1
                                                               as1.src1   =   sum
                                                               as1.src2   =   M
2.13.4 Control Tables and State Machine                                                                                                                   217


Control Table and Bubbles                . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..


If the circuit always had valid parcels arriving in every other clock cycle, then we could proceed
directly from our dataflow diagram and optimized control table to VHDL code. However, the
indeterminate number of bubbles complicates the design of our state machine.

We add an idle mode to our state machine. The circuit is in idle mode when there is not a valid
parcel in the circuit. By “idle”, we mean that all write enable signals are turned off, chip enable
signals are turned off, and the state machine does not change state. The state machine for the
control table must resume in state S0 when i valid becomes true.

In the optimized control table, sum does need a chip enable, but with the addition of ide mode, we
will need to use a chip enable with sum.

The multiplexers for the datapath components are unaffected by the addition of idle mode. When
the circuit is in idle mode, the registers do not load new data, and so the behaviour of the datapath
components is unconstrained.

The final control table is below.

          Almost final control table                                    Static assignments
                                                                       M.addr                   =    idx
                 M      idx        sum         as1                     M.d                      =    x
                 we      ce         ce         sub                     idx.d                    =    rol
         S0      1       0          1           0                      sum.d                    =    as1
         S1      0       1          1           1                      as1.src1                 =    sum
                                                                       as1.src2                 =    M
        idle     0       0          0           –
                Final control table

                 M       idx       sum          as1
                 we       ce        ce          sub
          S0     1        0         1            0
          S1     0        1         1            1
         idle    0        0         0            0


State Machine       . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . . .. . . . . ...


The state machine start in idle, transitions to S0 when i valid is true, then goes to S1 in the next
clock cycle, and then goes to idle.

We will use a modified one-hot encoding and use the valid-bit signals to hold the state. From the
dataflow diagram we see that the latency through the circuit is two clock cycles. We need two valid
bit registers and will have three valid-bit signals: i valid (input, no register needed), valid1
(register), o valid (register). For the state encoding, we will use i valid and valid1.
218                                                     CHAPTER 2. RTL DESIGN WITH VHDL


                                        i valid valid1
                                    S0     1       0
                                    S1     0       1
                                   idle    0       0

Updating the control table to show the state encoding gives us:

                             Final control table with state encoding


                                    state          M     idx   sum     as1
                              i valid valid1       we     ce    ce     sub
                          S0     1        0        1      0     1       0
                          S1     0        1        0      1     1       1
                         idle    0        0        0      0     0       0

Using the state encoding and the final control table, we write equations for the write-enable signals,
chip-enable signals, and the adder/subtracter control signal.

M.we       =   i_valid
idx.ce     =   valid1
sum.ce     =   i_valid OR valid1
as1.sub    =   valid1
2.13.5 VHDL Code                                                                    219


2.13.5 VHDL Code


-- valid bits
process begin
  wait until rising_edge(clk);
  valid1 <= i_valid;
  o_valid <= valid1;
end process;
-- idx
process begin
  wait until rising_edge(clk);
  if reset = ’1’ then
    idx <= "0001";
  else
    if valid1 = ’1’ then
      idx <= idx rol 1;
    end if;
  end if;
end process;

-- sliding window
process begin
  wait until rising_edge(clk);
  for i in 3 downto 0 loop
    if (i_valid = ’1’) and (idx(i) = ’1’) then
      M(i) <= i_data;
    end if;
  end loop;
end process;
mem_out <=   M(0) when idx(0) = ’1’
        else M(1) when idx(1) = ’1’
        else M(2) when idx(2) = ’1’
        else M(3);
-- add sub
add_sub <=   sum - mem_out when valid1 = ’1’
        else sum + mem_out;
-- sum
process begin
  wait until rising_edge(clk);
  if i_valid = ’1’ or valid1 = ’1’ then
    sum <= add_sub;
  end if;
end process;


Hardware   .............................................................................
220                                         CHAPTER 2. RTL DESIGN WITH VHDL


      i_valid                            i_data



                                    A CE
 valid1
                                              M
                        CE

                (wired shift) idx
                                            add/sub


                                             CE
                                           sum
                                    (wired shift)
      o_valid                                 o_avg
2.14. DESIGN PROBLEMS                                                                         221


2.14 Design Problems
P2.1 Synthesis

This question is about using VHDL to implement memory structures on FPGAs.


P2.1.1 Data Structures

If you have to write your own code (i.e. you do not have a library of memory components or a
special component generation tool such as LogiBlox or CoreGen), what datastructures in VHDL
would you use when creating a register file?


P2.1.2 Own Code vs Libraries

When using VHDL for an FPGA, under what circumstances is it better to write your own VHDL
code for memory, rather than instantiate memory components from a library?


P2.2 Design Guidelines

While you are grocery shopping you encounter your co-op supervisor from last year. She’s now
forming a startup company in Waterloo that will build digital circuits. She’s writing up the de-
sign guidelines that all of their projects will follow. She asks for your advice on some potential
guidelines.

What is your response to each question?
What is your justification for your answer?
What are the tradeoffs between the two options?


   0. Sample Should all projects use silicon chips, or should all use biological chips, or should
      each project choose its own technique?
      Answer: All projects should use silicon based chips, because biological chips don’t
      exist yet. The tradeoff is that if biological chips existed, they would probably con-
      sume less power than silicon chips.

   1. Should all projects use an asynchronous reset signal, or should all use a synchronous reset
      signal, or should each project choose its own technique?

   2. Should all projects use latches, or should all projects use flip-flops, or should each project
      choose its own technique?
222                                                  CHAPTER 2. RTL DESIGN WITH VHDL


  3. Should all chips have registers on the inputs and outputs or should chips have the inputs
     and outputs directly connected to combinational circuitry, or should each project choose
     its own technique? By “register” we mean either flip-flops or latches, based upon your
     answer to the previous question. If your answer is different for inputs and outputs, explain
     why.

  4. Should all circuit modules on all chips have flip-flops on the inputs and outputs or should
     chips have the inputs and outputs directly connected to combinational circuitry, or
     should each project choose its own technique? By “register” we mean either flip-flops or
     latches, based upon your answer to the previous question. If your answer is different for
     inputs and outputs, explain why.

  5. Should all projects use tri-state buffers, or should all projects use multiplexors, or should
     each project choose its own technique?


P2.3 Dataflow Diagram Optimization

Use the dataflow diagram below to answer problems P2.3.1 and P2.3.2.

       a b         c




         f

               f                d       e


                            g

                                    f




                        g




P2.3.1 Resource Usage

List the number of items for each resource used in the dataflow diagram.
P2.4 Dataflow Diagram Design                                                                      223


P2.3.2 Optimization

Draw an optimized dataflow diagram that improves the performance and produces the same output
values. Or, if the performance cannot be improved, describe the limiting factor on the preformance.

NOTES:

   • you may change the times when signals are read from the environment

   • you may not increase the resource usage (input ports, registers, output ports, f components,
     g components)

   • you may not increase the clock period


P2.4 Dataflow Diagram Design

Your manager has given you the task of implementing the following pseudocode in an FPGA:

if is_odd(a + d)
  p = (a + d)*2 + ((b + c) - 1)/4;
 else
  p = (b + c)*2 + d;



   NOTES: 1)      You must use registers on all input and output ports.
          2)      p, a, b, c, and d are to be implemented as 8-bit signed signals.
          3)      A 2-input 8-bit ALU that supports both addition and subtraction takes 1
                  clock cycle.
             4)   A 2-input 8-bit multiplier or divider takes 4 clock cycles.
             5)   A small amount of additional circuitry (e.g. a NOT gate, an AND gate, or a
                  MUX) can be squeezed into the same clock cycle(s) as an ALU operation,
                  multiply, or divide.
             6)   You can require that the environment provides the inputs in any order and
                  that it holds the input signals at the same value for multiple clock cycles.


P2.4.1 Maximum Performance

What is the minimum number of clock cycles needed to implement the pseudocode with a circuit
that has two input ports?

What is the minimum number of ALUs, multipliers, and dividers needed to achieve the minimum
number of clock cycles that you just calculated?
224                                                 CHAPTER 2. RTL DESIGN WITH VHDL


P2.4.2 Minimum area

What is the minimum number of datapath storage registers (8, 6, 4, and 1 bit) and clock cycles
needed to implement the pseudocode if the circuit can have at most one ALU, one multiplier, and
one divider?


P2.5 Michener: Design and Optimization

Design a circuit named michener that performs the following operation: z = (a+d) + ((b -
c) - 1)

NOTES:
  1. Optimize your design for area.
  2. You may schedule the inputs to arrive at any time.
  3. You may do algebraic transformations of the specification.



P2.6 Dataflow Diagrams with Memory Arrays

 Component                        Delay
 Register                          5 ns
 Adder                            25 ns
 Subtracter                       30 ns
 ALU with +, −, >, =, −, AND, XOR 40 ns
 Memory read                      60 ns
 Memory write                     60 ns
 Multiplication                   65 ns
 2:1 Multiplexor                   5 ns
NOTES:
  1. The inputs of the algorithms are a and b.
  2. The outputs of the algorithms are p and q.
  3. You must register both your inputs and outputs.
  4. You may choose to read your input data values at any time and produce your outputs at any
     time. For your inputs, you may read each value only once (i.e. the environment will not send
     multiple copies of the same value).
  5. Execution time is measured from when you read your first input until the latter of producing
     your last output or the completion of writing a result to memory
  6. M is an internal memory array, which must be implemented as dual-ported memory with one
     read/write port and one read port.
  7. M supports synchronous write and asynchronous read.
P2.7 2-bit adder                                                                           225


  8. Assume all memory address and other arithmetic calculations are within the range of repre-
     sentable numbers (i.e. no overflows occur).
  9. If you need a circuit not on the list above, assume that its delay is 30 ns.
 10. You may sacrifice area efficiency to achieve high performance, but marks will be deducted
     for extra hardware that does not contribute to performance.


P2.6.1 Algorithm 1

Algorithm

q    = M[b];
M[a] = b;
p    = M[b+1] * a;
Assuming a ≤ b, draw a dataflow diagram that is optimized for the fastest overall execution
time.


P2.6.2 Algorithm 2

q    = M[b];
M[a] = q;
p    = (M[b-1]) * b) + M[b];
Assuming a > b, draw a dataflow diagram that is optimized for the fastest overall execution
time.


P2.7 2-bit adder

This question compares an FPGA and generic-gates implementation of 2-bit full adder.


P2.7.1 Generic Gates

Show the implementation of a 2 bit adder using NAND, NOR, and NOT gates.
226                                                  CHAPTER 2. RTL DESIGN WITH VHDL


P2.7.2 FPGA

Show the implementation of a 2 bit adder using generic FPGA cells; show the equations for the
lookup tables.
                                   c_in

                                                                          sum[0]
                    a[0]
                    b[0]
                                   comb                           R
                                                             D        Q

                                                             CE

                                                                  S




                                      carry_1


                                                                          sum[1]
                    a[1]
                    b[1]
                                   comb                           R
                                                             D        Q

                                                             CE

                                                                  S




                                   c_out




P2.8 Sketches of Problems
  1. calculate resource usage for a dataflow diagram (input ports, output ports, registers, datapath
     components)
  2. calculate performance data for a dataflow diagram (clock period and number of cycles to
     execute (CPI))
  3. given a dataflow diagram, calculate the clock period that will result in the optimum perfor-
     mance
  4. given an algorithm, design a dataflow diagram
  5. given a dataflow diagram, design the datapath and finite state machine
  6. optimize a dataflow diagram to improve performance or reduce resource usage
  7. given fsm diagram, pick VHDL code that “best” implements diagram — correct behaviour,
     simple, fast hardware — or critique hardware
Chapter 3

Performance Analysis and Optimization

3.1 Introduction
Hennessey and Patterson’s Quantitative Computer Achitecture (textbook for E&CE 429) has good
information on performance. We will use some of the same definitions and formulas as Hennessey
and Patterson, but we will move away from generic definitions of performance for computer sys-
tems and focus on performance for digital circuits.



3.2 Defining Performance

                                                    Work
                                   Performance =
                                                    Time

You can double your performance by:

  doing twice the work in the same amount of time

OR doing the same amount of work in half the time


Benchmarking      ....................................................................... .


Measuring time is easy, but how do we accurately measure work?

The game of benchmarketing is finding a definition of work that makes your system appear to get
the most work done in the least amount of time.




                                            227
228                        CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION


             Measure of Work                 Measure of Performance
             clock cycle          MHz
             instruction          MIPs
             synthetic program    Whetstone, Dhrystone, D-MIPs (Dhrystone MIPs)
             real program         SPEC
             travel 1/4 mile      drag race


The Spec Benchmarks are among the most respected and accurate predictions of real-world per-
formance.


  Definition SPEC: Standard Performance Evaluation Corporation MISSION: “To
    establish, maintain, and endorse a standardized set of relevant benchmarks and
    metrics for performance evaluation of modern computer systems
    http://www.spec.org.”


The Spec organization has different benchmarks for integer software, floating-point software, web-
serving software, etc.



3.3 Comparing Performance
3.3.1 General Equations

Equation for “Big is n% greater than Small”:

                                               Big − Small
                                       n% =
                                                  Small

For the above equation, it can be difficult to remember whether the denominator is the larger
number or the smaller number. To see why Small is the only sensible choice, consider the situation
where a is 100% greater than b. This means that the difference between a and b is 100% of
something. Our only variables are a and b. It would be nonsensical for the difference to be a,
because that would mean: a − b = a. However, if a − b = b, then for a to be 100% greater than b
simply means that a = 2b.

Using “n% greater” formula, the phrase “The performance of A is n% greater than the performance
of B” is:

                                   Performance A − Performance B
                            n% =
                                          Performance B
3.3.2 Example: Performance of Printers                                                        229


Performance is inversely proportional to time:
                                                          1
                                      Performance =
                                                        Time

Substituting the above equation into the equation for “the performance of A is n% greater than the
performance of B” gives:

                                             TimeB − TimeA
                                      n% =
                                                TimeA

In general, the equation for a fast system to be “n%” faster than a slow system is:

                                             TSlow − TFast
                                      n% =
                                                 TFast

Another useful formula is the average time to do one of k different tasks, each of which happens
%i of the time and takes an amount of time Ti to do each time it is done .

                                                  k
                                       TAvg = ∑ (%i)(Ti )
                                                 i=1


We can measure the performance of practically anything (cars, computers, vacuum cleaners, print-
ers....)


3.3.2 Example: Performance of Printers

          Black and White Colour
 printer1      9ppm       6ppm
 printer2     12ppm       4ppm


   Question:     Which printer is faster at B&W and how much faster is it?


   Answer:
230                     CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION


      BW Performance    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
                                                            TSlow − TFast
                                   n% faster =
                                                                TFast


                                                                1
                                        BW1 =
                                                              9ppm
                                                      = 0.1111min/page

                                                                1
                                        BW2 =
                                                              12ppm
                                                      = 0.0833min/page

                                                        TSlow − TFast
                              BWFaster =
                                                            TFast
                                                        BW1 − BW2
                                                      =
                                                            BW2
                                                        0.1111 − 0.08333
                                                      =
                                                            0.08333
                                                      = 33%faster


Performance for Different Tasks           ...................................................... .



  Question: If average workload is 90% BW and 10% Colour, which printer is faster
    and how much faster is it?
3.3.2 Example: Performance of Printers                                                                                                         231


  Answer:

                         TAvg1 = %BW × BW1 + %C × C1

                                      = (0.90 × 0.1111) + (0.10 × 0.1667)

                                      = 0.1167min/page


                         TAvg2 = %BW × BW2 + %C × C2

                                      = (0.90 × 0.0833) + (0.10 × 0.2500)

                                      = 0.1000min/page


                                        TSlow − TFast
                    AvgFaster =
                                            TFast
                                        Avg1 − Avg2
                                      =
                                            Avg2
                                        0.1167 − 0.1000
                                      =
                                            0.1000
                                      = 16.7%faster


Optimizing Performance    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..



  Question: If we want to optimize printer1 to match performance of printer2, should
    we optimize BW or Colour printing?


  Answer:


     Colour printing is slower, so appears that can save more time by optimizing
     colour printing.

     However, look at extreme case of optimizing colour printing to be
     instantaneous for P1:
232                      CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION


                                 0.150m/p


                                 0.100m/p


                                 0.050m/p


                                 0.000m/p
                                                 P1       P2


      Even if make colour printing instantaneous for printer 1 and kept same for
      printer 2, printer 1 would not be measurably faster.

      Amdahl’s law “Make the common case fast.”


       Optimizations need to take into account
           both run time and frequency of
                     occurrence.
      We should optimize black and white printing.


  Question: If you have to fire all of the engineers because your stock price plummeted,
    how can you get printer1 to be faster than printer2?


         Note:     This question was actually humorous during the high-tech bubble of
         2000...


  Answer:


      Hire more marketing people!

      Notice that colour printing on printer 1 is faster than on printer 2. So,
      marketing suggests that people are increasing the percentage of printing that
      is done in colour.


  Question: Revised question: what percentage of printing must be done in colour for
    printer1 to beat printer2?
3.4. CLOCK SPEED, CPI, PROGRAM LENGTH, AND PERFORMANCE                         233


  Answer:


                              TAvg1 ≤ TAvg2

              %BW × BW1 + %C × C1 ≤ %BW × BW2 + %C × C2


                              %BW = 1 − %C


            (1 − %C) × BW1 + %C × C1 ≤ (1 − %C) × BW2 + %C × C2

             BW1 + %C × (C1 − BW1) ≤ BW2 + %C × (C2 − BW2)

                                               BW1 − BW2
                                %C ≥
                                           BW1 − BW2 + C2 − C1
                                                    0.1111 − 0.0833
                                %C ≥
                                           0.1111 − 0.0833 + 0.2500 − 0.1667
                                %C ≥ 0.25


3.4 Clock Speed, CPI, Program Length, and Performance
3.4.1 Mathematics

                        CPI             Cycles per instruction
                        NumInsts        Number of instructions
                        ClockSpeed      Clock speed
                        ClockPeriod     Clock period



                      Time = NumInsts × CPI × ClockPeriod

                      Time = NumInsts×CPI
                              ClockSpeed


3.4.2 Example: CISC vs RISC and CPI

                                         Clock Speed SPECint
                      AMD Athlon              1.1GHz     409
                      Fujitsu SPARC64        675MHz      443
234                        CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION


The AMD Athlon is a CISC microprocessor (it uses the IA-32 instruction set). The Fujitsu
SPARC64 is a RISC microprocessor (it uses Sun’s Sparc instruction set). Assume that it requires
20% more instructions to write a program in the Sparc instruction set than the same program re-
quires in IA-32.


  Question:     Which of the two processors has higher performance?


  Answer:
     SPECint, SPECfp, and SPEC are measures of performance. Therefore, the
    higher the SPEC number, the higher the performance. The Fujitsu SPARC64
    has higher performance


  Question:     What is the ratio between the CPIs of the two microprocessors?


  Answer:


      We will use a as the subscript for the Athlon and s as the subscript for the
      Sparc.


                Time = NumInsts×CPI
                        ClockSpeed
                       Time × ClockSpeed
                 CPI =
                            NumInsts
                         ClockSpeed
                 CPI =
                       Perf × NumInsts
                CPIA       ClockSpeedA                 PerfS × NumInstsS
                     =                             ×
                CPIS     PerfA × NumInstsA               ClockSpeedS
                            ClockSpeedA =       1.1
                            ClockSpeedS =       0.675
                                    PerfA =     409
                                    PerfS =     443
                               NumInstsS =      1.2 × NumInstsA
                                    1.1              443 × 1.2 × NumInstsA
                       =                         ×
                             409 × NumInstsA                  0.675
                       = 2.1

                       = 110%more
3.4.3 Effect of Instruction Set on Performance                                                    235


      Executing the average Athlon instruction requires 110% more clock cycles
      than executing the average Sparc instruction.

      Stated more awkwardly: executing the average Athlon instruction requires
      210% of the clock cycles required to execute the average Sparc instruction.


   Question:     Can you determine the absolute (actual) CPI of either microprocessor?


   Answer:
      To determine the absolute CPI, we would need to know the actual number of
     instructions execute by at least one of the processors.


3.4.3 Effect of Instruction Set on Performance

Your group designs a microprocessor and you are considering adding a fused multiply-accumulate
to the instruction set. (A fused multiply accumulate is a single instruction that does both a multiply
and an addition. It is often used in digital signal processing.)

Your studies have shown that, on average, half of the multiply operations are followed by an add
instruction that could be done with a fused multiply-add.

Additionally, you know:

           cpi            %
 ADD 0.8 CPIavg         15%
 MUL 1.2 CPIavg          5%
 Other 1.0 CPIavg       80%
You have three options:


option 1 : no change

option 2 : add the MAC instruction, increase the clock period by 20%, and MAC has the same
     CPI as MUL.

option 3 : add the MAC instruction, keep the clock period the same, and the CPI of a MAC is
     50% greater than that of a multiply.


   Question:     Which option will result in the highest overall performance?
236                      CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION


  Answer:


                                        NumInsts × CPI
                                Time =
                                         ClockSpeed
                                         ClockSpeed
                                 Perf =
                                        NumInsts × CPI


      We need to find NumInsts, CPI, and ClockSpeed for each of the three
      options. Option 1 is the baseline, so we will define values for variables in
      Options 2 and 3 in terms of the Option 1 variables.
      Options 2 and 3 will have the same number of instructions. Half of the
      multiply instructions are followed by an add that can be fused.
      In questions that involve changing both CPI and NumInsts, it is often easiest
      to work with the product of CPI and NumInsts, which represents the total
      number of clock cycles needed to execute the program. Additionally, set the
      problem up with an imaginary program of 100 instructions on the baseline
      system.

                       NumMAC2 =        0.5 × NumMul1
                               =        0.5 × 5
                               =        2.5
                       NumMUL2 =        0.5 × NumMul1
                               =        0.5 × 5
                               =        2.5
                       NumADD2 =        NumAdd1 − 0.5 × NumMul1
                               =        15 − 0.5 × 5
                               =        12.5

      Find the total number of clock cycles for each option.

      Cycles1 =     NumMUL1 × CPIMUL + NumADD1 × CPIADD + NumOth1 × CPIOth
              =     (5 × 1.2) + (15 × 0.8) + (80 × 1.0)
              =     98
      Cycles2 =     (NumMAC2 × CPIMAC ) + (NumMUL2 × CPIMUL )
                    +(NumADD2 × CPIADD ) + (NumOth2 × CPIOth )
                =   (2.5 × 1.2) + (2.5 × 1.2) + (12.5 × 0.8) + (80 × 1.0)
                =   96
      Cycles3   =   (NumMAC3 × CPIMAC ) + (NumMUL3 × CPIMUL )
                    +(NumADD3 × CPIADD ) + (NumOth3 × CPIOth )
                =   (2.5 × (1.5 × 1.2)) + (2.5 × 1.2) + (12.5 × 0.8) + (80 × 1.0)
                =   97.5
3.4.4 Effect of Time to Market on Relative Performance                                     237


     Calculate performance for each option using the formula:

                                                      1
                           Performance =
                                            Cycles × ClockPeriod


                              Performance1 =       1/(98 × 1)
                                           =       1/98
                              Performance2 =       1/(96 × 1.2)
                                           =       1/115
                              Performance3 =       1/(97.5 × 1)
                                           =       1/97.5

     The third option is the fastest.


3.4.4 Effect of Time to Market on Relative Performance

Assume that performance of the average product in your market segment doubles every 18 months.

You are considering an optimization that will improve the performance of your product by 7%.


  Question: If you add the optimization, how much can you allow your schedule to slip
    before the delay hurts your relative performance compared to not doing the
    optimization and launching the product according to your current schedule?


  Answer:

                           P(t) = performance at time t
                                 = P0 × 2t/18
                      From problem statement:
                           P(t) = 1.07 × P0
                       Equate two equations for P(t), then solve for t.
                      1.07 × P0 = P0 × 2t/18
                          2t/18 = 1.07
                           t/18 = log2 1.07
                              t = 18 × (log2 1.07)
                                    log x
                      Use: logb x =
                                    log b
                                          log 1.07
                                 = 18 ×
                                            log 2
                                 = 1.76months
238                        CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION


3.4.5 Summary of Equations

Time to perform a task:

                                               NumInsts × CPI
                                    Time =
                                                ClockSpeed

Average time to do one of k different tasks:

                                                  k
                                       TAvg = ∑ (%i)(Ti )
                                                 i=1


Performance:

                                                        Work
                                     Performance =
                                                        Time

Speedup:

                                                       TSlow
                                       Speedup =
                                                       TFast

TFast is n% faster than TSlow:

                                                 TSlow − TFast
                                  n% faster =
                                                     TFast

Performance at time t if performance increases by factor of k every n units of time:


                                     Perf (t) = Perf (0) × kt/n
3.5. PERFORMANCE ANALYSIS AND DATAFLOW DIAGRAMS                                               239


3.5 Performance Analysis and Dataflow Diagrams
3.5.1 Dataflow Diagrams, CPI, and Clock Speed

One of the challenges in designing a circuit is to choose the clock speed. Increasing the clock
speed of a circuit might not improve its performance. In this section we will work through several
example dataflow diagrams to pick a clock speed for the circuit and schedule operations into clock
cycles.
When partitioning dataflow diagrams into clock cycles, we need to choose a clock period. Choos-
ing a clock period affects many aspects of the design, not just the overall performance. Different
design goals might put conflicting pressure on the clock period: some goals will tend toward short
clock periods and some goals will tend toward long clock periods. For performance, not only is
clock period a poor indicator of the relative performance of two different systems, even for the
same system decreasing the clock period might not increase the performance.

  Goal                            Action                Affect
  Minimize area                   decrease clock pe-    fewer operations per clock cycle, so
                                  riod                  fewer datapath components and more
                                                        opportunities to reuse hardware
  Increase scheduling flexibil-    increase clock pe-    more flexibility in grouping operations
  ity                             riod                  in clock cycles
  Decrease percentage of clock    increase clock pe-    decreases number of flops that data tra-
  cycle spent in flops (overhead   riod                  verses through
  — time in flops is not doing
  useful work)
  Decrease time to execute an     ????                  depends on dataflow diagram
  instruction

Our general plan to find the clock period for maximum performance is:

   1. Pick clock period to be delay through slowest component + delay through flop.
   2. For each instruction, for each operation, schedule the operation in the earliest clock cycle
      possible without violating clock-period timing constraints.
   3. Calculate average time to execute an instruction as:
                                  NumInsts × CPI
       Combine:       Time =
                                    ClockSpeed
                                   k
             and: CPIavg      =   ∑ %i × CPIi
                                  i=1
                                                   k
                                  NumInsts ×      ∑ %i × CPIi
                                                  i=1
       to derive:     Time =
                                           ClockSpeed
240                        CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION


   4. If the maximum latency through dataflow diagram is greater than 1, then increase clock
      period by minimum amount needed to decrease latency by one clock period and return to
      Step 2.

   5. If the maximum latency through dataflow diagram is 1, then clock period for highest perfor-
      mance is clock period resulting in fastest Time.

   6. If possible, adjust the schedule of operations to reduce the maximum number of occurrences
      of a component per instruction per clock cycle without increasing latency for any instruction.


3.5.2 Examples of Dataflow Diagrams for Two Instructions

Circuit supports two instructions, A and B (e.g. multiply and divide). At any point in time, the
circuit is doing either A or B — it does not need to support doing A and B simultaneously.

The diagrams below show the flow for each instruction and the delay through the components
(f,g,h,i) that the instructions use.

The delay through a register is 5ns.

Each operation (A and B) occurs 50% of the time.

Our goal is to find a clock period and dataflow diagram for the circuit that will give us the highest
overall performance.


                              Instruction A             Instruction B

                            f (30ns)                 i (40ns)


                           g (50 ns)                 g (50 ns)



                           h (20 ns)



                           g (50 ns)
3.5.2 Examples of Dataflow Diagrams for Two Instructions                                        241


3.5.2.1 Scheduling of Operations for Different Clock Periods

             55ns Clock Period                                    75ns Clock Period
                                                                Instr A              Instr B
              Instr A            Instr B
             f (30ns)                                          f (30ns)          i (40ns)
  55ns                         i (40ns)
                                                      75ns

  55ns       g (50 ns)         g (50 ns)
                                                               g (50 ns)         g (50 ns)
                                                      75ns
             h (20 ns)
  55ns                                                         h (20 ns)

                                                               g (50 ns)
  55ns       g (50 ns)                                75ns



             85ns Clock Period                                    95ns Clock Period
           Instr A           Instr B                            Instr A              Instr B
                                                               f (30ns)          i (40ns)
          f (30ns)          i (40ns)
85ns                                                  95ns
                                                               g (50 ns)
          g (50 ns)                                                              g (50 ns)

          h (20 ns)                                            h (20 ns)
85ns                        g (50 ns)
          g (50 ns)                                   95ns     g (50 ns)



       155ns Clock Period
           Instr A           Instr B

           f (30ns)          i (40ns)

          g (50 ns)
                            g (50 ns)
155ns
          h (20 ns)

          g (50 ns)




3.5.2.2 Performance Computation for Different Clock Periods

  Question:      Which clock speed will result in the highest overall performance?
242                       CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION


  Answer:
      Clock Period       CPIA     CPIB                    Tavg
          55ns            4        2      55 × (0.5 × 4 + 0.5 × 2)        =   165
          75ns            3        2      75 × (0.5 × 3 + 0.5 × 2)        =   187.5
          85ns            2        2      85 × (0.5 × 2 + 0.5 × 2)        =   170
          95ns            2        1      95 × (0.5 × 2 + 0.5 × 1)        =   143 ←−
         155ns            1        1     155 × (0.5 × 1 + 0.5 × 1)        =   155


3.5.2.3 Example: Two Instructions Taking Similar Time

  Question: For the flow below, which clock speed will result in the highest overall
    performance?


  A    B
 30ns 40ns
 50ns 50ns
 20ns 40ns
 50ns


  Answer:


             f (30ns)                            f (30ns)     i (40ns)
      55ns                 i (40ns)
                                         75ns

      55ns   g (50 ns)    g (50 ns)
                                                 g (50 ns)    g (50 ns)
                                         75ns
             h (20 ns)
      55ns                 i (40ns)              h (20 ns)

                                                 g (50 ns)    i (40ns)
                                         75ns
      55ns   g (50 ns)



             f (30ns)      i (40ns)
      85ns
             g (50 ns)
                                                f (30ns)     i (40ns)
             h (20 ns)
      85ns                g (50 ns)      95ns
                                                g (50 ns)
             g (50 ns)                                       g (50 ns)

                                                h (20 ns)
      85ns                                                   i (40ns)
                           i (40ns)
                                         95ns   g (50 ns)
3.5.2 Examples of Dataflow Diagrams for Two Instructions                               243


                                                 f (30ns)    i (40ns)
                                         135ns
                                                 g (50 ns)
                                                             g (50 ns)
              f (30ns)       i (40ns)            h (20 ns)
      105ns                                                  i (40ns)
              g (50 ns)
                            g (50 ns)
              h (20 ns)                          g (50 ns)
                                         135ns

              g (50 ns)      i (40ns)
      105ns




     Should skip 105 ns, because it has same latency as 95 ns.

              f (30ns)       i (40ns)

              g (50 ns)
      155ns                 g (50 ns)
              h (20 ns)
                             i (40ns)
              g (50 ns)



      Clock Period        CPIA    CPIB    Tavg
         55ns              4       3       193
         75ns              3       3       225
         85ns              2       3       213
         95ns              2       2       190
         105ns             2       2     NO GAIN
         135ns             2       1       203
         155ns             1       1       155

     A clock period of 155 ns results in the highest performance.

     For a clock period of 105 ns, we did not calculate the performance, because
     we could see that it would be worse than the performance with a clock period
     of 95 ns. The dataflow diagram with a 105 ns clock period has the same
     latency as the diagram with a clock period of 95 ns. If the data flow diagram
     with the longer clock period has the same latency as the diagram with the
     shorter clock period, then the diagram with the longer clock period will have
     lower performance.


3.5.2.4 Example: Same Total Time, Different Order for A

  Question: For the flow below, which clock speed will result in the highest overall
    performance?
244                         CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION


  A    B                               Answer:
 30ns 40ns
 20ns 50ns
 50ns 40ns                                         Clock Period     CPIA    CPIB      Tavg
 50ns                                                  55ns          3       3       165ns
                                                       95ns          3       2       238ns
                                                      105ns          2       2       210ns
                                                      135ns          2       1       203ns
                                                      155ns          1       1       155ns

                                          A clock period of 155 ns results in lowest average
                                          execution time, and hence the highest
                                          performance.

                                          This is the same answer as the previous problem,
                                          but the total times for higher clock frequencies
                                          differ significantly between the two problems.


3.5.3 Example: From Algorithm to Optimized Dataflow

This question involves doing some of the design work for a circuit that implements InstP and InstQ
using the components described below.

                                                                               Component Delays
 Instruction           Algorithm             Frequence of Occurrence
                                                                               2-input Mult 40ns
 InstP       a × b × ((a × b) + (b × d) + e)          75%
                                                                               2-input Add 25ns
 InstQ             (i + j + k + l) × m                25%
                                                                               Register      5ns
NOTES
• There is a resource limitation of a maximum of 3 input ports. (There are no other resource
  limitations.)
• You must put registers on your inputs, you do not need to register your outputs.
• The environment will directly connect your outputs (its inputs) to registers.
• Each input value (a, b, c, d, e, i, j, k, l, m) can be input only once — if you need to use a value
  in multiple clock cycles, you must store it in a register.


   Question:     What clock period will result in the best overall performance?


   Answer:
3.5.3 Example: From Algorithm to Optimized Dataflow                                                                    245


     Algorithm Answers (InstP)                     ................................................ .
                  a b                        d          e
                                                                              a b                d      e


                  *           *            *                                     *             *
      a*b                                        b*d                         a*b                 b*d
             (a*b) + (b*d)         +                                                  +

                                       +         (a*b) + (b*d) + e                       +
                                                                                             (a*b) + (b*d) + e

                                   *       (a*b)*((a*b) + (b*d) + e)                  *
                                                                                          (a*b)*((a*b) + (b*d) + e)
                                                                           InstP: common subexpr elim
                  InstP data-dep graph


                  a b          d

                                                                       a b             d
                  *          *         e
                        b*d
                                                                       *             *
                                   +
                                                                     a*b               b*d
                                                                                                e
                        +         (b*d) + e                                  +
            a*b              (a*b) + (b*d) + e

                        *                                                     +
                            (a*b)*((a*b) + (b*d) + e)                              (a*b) + (b*d) + e

      InstP: alternative data dependency graph.
                                                                             *
                                                                    (a*b)*((a*b) + (b*d) + e)
      Both options have critical path of 2mults+2adds.
      First option allows three operations to be done InstP: clock=45ns, lat=4, T=200
      with just three inputs (a,b,d). Second option
      requires all four inputs to do three operations.
246                              CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION


          a b              d                        a b                       d

            *            *         e
                                                        *                 *
        a*b                b*d                     a*b                    b*d
                 +                                            +                         e


                   +                                              +
                       (a*b) + (b*d) + e                              (a*b) + (b*d) + e
                 *                                            *
                     (a*b)*((a*b) + (b*d) + e)                    (a*b)*((a*b) + (b*d) + e)

       InstP: clock=55ns, lat=3, T=165ns          InstP: clock=70ns, lat=2, T=140



                                                                      b             d

                                                                                  *         e
                                                 70ns                     b*d
          a b              d       e
                                                                  a                     +
            *            *
        a*b                b*d
                                                                      *                 (b*d) + e

                 +
                                                                          +
                                                            a*b                   (a*b) + (b*d) + e
                   +
                       (a*b) + (b*d) + e
                                                                          *
                                                                              (a*b)*((a*b) + (b*d) + e)
                 *
                     (a*b)*((a*b) + (b*d) + e)
                                                   InstP: dataflow diagram with alternative
                                                   data-dep graph.
        InstP: illegal: 4 inputs                   Adds a third clock cycle without any gain
                                                   in clock speed. From diagram, it’s clear that
                                                   it’s better to put a*b in first clock cycle and e in
                                                   second, because a*b can be done in parallel
                                                   with b*d.


      Fastest option for InstP is 70ns clock, which gives a total execution time of
      140 ns.
3.5.3 Example: From Algorithm to Optimized Dataflow                                                                  247


     Algorithm Answers (InstQ)                    .................................................
                                                                i       j           k

                                                                    +
        i       j               k l   m
                                                                                    +           l           m
            +                   +
                                                                                            +
                        +
                                                                                                            *
                                *
                                                            InstQ: alternative data-dep graph:
      InstQ: data-dep graph with max parallelism            able to do two operations with three inputs,
                                                            while first data-dep graph required four inputs
                                                            to do two operations. We are limited to three
                                                            inputs, so choose this data-dep graph for
                                                            dataflow diagrams.


                                                                        i       j       k
                    i       j    k

                        +                                                   +

                                +         l   m
                                                                                        +               l       m



                                      +                                                             +

                                              *                                                                 *

                                                               InstQ: clock=55ns, lat=3, T=165ns.
         InstQ: clock=50ns, lat=4, T=200ns.
248                          CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION


             i       j   k                             i       j       k

                 +                                         +

                         +       l   m                                 +           l       m


                             +                                                 +

                                     *                                                     *

         InstQ: clock=70ns, lat=2, T=140ns.      InstQ: irrelevant: lat did not decrease




             i       j   k
                                                           i       j       k
                 +
                                                               +
                         +       l   m
                                                                           +           l       m
                             +
                                                                                   +
                                              70ns
                                     *
                                                                                               *
         InstQ: clock=120ns, lat=1, T=120ns          InstQ



      Fastest option for InstQ is 70ns clock, which gives a total execution time of
      140 ns.

      Both InstP and InstQ need a 70ns clock period to maximize their
      performance. So, use a 70ns clock, which gives a latency of 2 clock cycles for
      both instructions.
       Fastest execution time 140ns
       Clock period            70ns
3.5.3 Example: From Algorithm to Optimized Dataflow                                    249


  Question: Find a minimal set of resources that will achieve the performance you
    calculated.


  Answer:


     Final dataflow graphs for InstP and InstQ
         a b             d
                                                          i       j   k

          *            *                                      +
       a*b              b*d
               +               e
                                                                      +       l   m

                +                                                         +
                    (a*b) + (b*d) + e          70ns
               *                                                                  *
                   (a*b)*((a*b) + (b*d) + e)

                                                      InstQ
       InstP: clock=70ns, lat=2, T=140


     Need do only one of InstP and InstQ at any time, so simply take max of each
     resource.
                      InstP InstQ System
      Inputs              3     3      3
      Outputs             1     1      1
      Registers           3     3      3
      Adders              2     2      2
      Multipliers         2     1      2
250                                   CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION


  Question:          Design the datapath and state machine for your design


  Answer:


            a b                  d                                         i       j             k
      S0    i1 i2                i3                          S0        i1 i2                 i3

            r1 r2                r3                                    r1 r2                 r3

           m1   *          m2   *                                     a1       +
      S1                                                     S1
                                        e                                                                   l     m
                 a2   +                 i2
                                                                                       a2   +           i2        i3

                r3    r1                r2                                                  r1          r2        r3


      S0             a1   +                                  S0
                                                                                                 a1    +

                 m2   *                                                                                     m2   *
                      o1                                                                                         o1
        InstP: clock=70ns, lat=2, T=140ns.                          InstQ: clock=70ns, lat=2, T=140ns.




      Control Tables            ............................................................ .
                              r1             r2             r3           m1                              m2                   a1                 a2
                      ce       mux     ce     mux      ce    mux     src1 src2                       src1 src2         src1     src2      src1        src2
       InstP S0        1         i1     1       i2     1       i3     r1    r2                        r3    a1           –       –        m1          m2
       InstP S1        1         a2     1       i2     1      m1      –     –                         r2    r3          r1       r2         –           –
       InstQ S0        1         i1     1       i2     1       i3     –     –                         a1    r3          r1       r2        a1          r3
       InstQ S1        1         a2     1       i2     1       i3     –     –                          –     –          r1       r2         –           –




      Optimize Control Table                  ................................................... .
                       r1         r2      r3             m1             m2                                  a1                a2
                      mux        mux     mux         src1 src2      src1 src2                        src1     src2     src1        src2
       InstP S0        i1         i2      i3          r1    r2       a1    r3                         r1       r2      m1          m2
       InstP S1       a2          i2     m1           r1    r2       r2    r3                         r1       r2      m1          m2
       InstQ S0        i1         i2      i3          r1    r2       a1    r3                         r1       r2       a1          r3
       InstQ S1       a2          i2      i3          r1    r2       r2    r3                         r1       r2       a1          r3
3.5.3 Example: From Algorithm to Optimized Dataflow                                                                                       251


     Write VHDL Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
     Use the optimized control table as basis for VHDL code.

         process (clk) begin
           if rising_edge(clk) then
             if state=S0 then
               r1 <= i1
             else
               r1 <= a2
             end if;
           end if;
         end process;
         process (clk) begin
           if rising_edge(clk) then
             r2 <= i2
           end if;
         end process;
         process (clk) begin
           if rising_edge(clk) then
             if inst=instP and state=S0 then
               r3 <= m1
             else
               r1 <= i3
             end if;
           end if;
         end process;
         m1 <= r1 * r2;
         m2_src1 <=   r2 when state=S0
                 else a1;
         m2 <= m2_src1 * r3;
         a1 <= r1 + r2;
         a2 <= a2_src1 + a2_src2;
         process (inst, m1, m2, a1, r3) begin
           if inst=instP then
             a2_src1 <= m1;
             a2_src2 <= m2;
           else
             a2_src1 <= a1;
             a2_src2 <= r3;
           end if;
         end process;
252                         CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION


3.6 General Optimizations
3.6.1 Strength Reduction

Strength reduction replaces one operation with another that is simpler.


3.6.1.1 Arithmetic Strength Reduction

 Multiply by a constant power of two     wired shift logical left
 Multiply by a power of two              shift logical left
 Divide by a constant power of two       wired shift logical right
 Divide by a power of two                shift logical right
 Multiply by 3                           wired shift and addition


3.6.1.2 Boolean Strength Reduction

Boolean tests that can be implemented as wires
• is odd, is even : least significant bit
• is neg, is pos : most significant bit
• NOTE: use is odd(a) rather than a(0)
By choosing your encodings carefully, you can sometimes reduce a vector comparison to a wire.

For example if your state uses a one-hot encoding, then the comparison state = S3 reduces
to state(3) = ’1’. You might expect a reasonable logic-synthesis tool to do this reduction
automatically, but most tools do not do this reduction.

When using encodings other than one-hot, Karnaugh maps can be useful tools for optimizing vector
comparisons. By carefully choosing our state assignments, when we use a full binary encoding for
8 states, the comparison:


              (state = S0 or state = S3 or state = S4) = ’1’


can be reduced from looking at 3 bits, to looking at just 2 bits. If we have a condition that is true
for four states, then we can find an encoding that looks at just 1 bit.
3.6.2 Replication and Sharing                                                                    253


3.6.2 Replication and Sharing

3.6.2.1 Mux-Pushing

Pushing multiplexors into the fanin of a signal can reduce area.
                                                  After
Before
                                                  tmp <=   b when (w = ’1’)
z <=   a + b when (w = ’1’)
                                                      else c;
  else a + c;
                                                  z   <=   a + tmp;

The first circuit will have two adders, while the second will have one adder. Some synthesis tools
will perform this optimization automatically, particularly if all of the signals are combinational.


3.6.2.2 Common Subexpression Elimination

Introduce new signals to capture subexpressions that occur multiple places in the code.
                                                  After
Before
                                                  tmp <=      a + c;
y <=     a + b + c when (w = ’1’)
                                                  y   <=      b + tmp when (w = ’1’)
  else   d;
                                                      else    d;
z <=     a + c + d when (w = ’1’)
                                                  z   <=      d + tmp when (w = ’1’)
  else   e;
                                                      else    e;

         Note: Clocked subexpressions       Care must be taken when doing common
         subexpression elimination in a clocked process. Putting the “temporary” sig-
         nal in the clocked process will add a clock cycle to the latency of the com-
         putation, because the tmp signal will be flip-flop. The tmp signal must be
         combinational to preserve the behaviour of the circuit.


3.6.2.3 Computation Replication
• To improve performance
  – If same result is needed at two very distant locations and wire delays are significant, it might
    improve performance (increase clock speed) to replicate the hardware
• To reduce area
  – If same result is needed at two different times that are widely separated, it might be cheaper to
    reuse the hardware component to repeat the computation than to store the result in a register
         Note: Muxes are not free      Each time a component is reused, multiplexors
         are added to inputs and/or outputs. Too much sharing of a component can cost
         more area in additional multiplexors than would be spent in replicating the
         component
254                         CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION


3.6.3 Arithmetic

VHDL is left-associative. The expression a + b + c + d is interpreted as (((a + b) +
c) + d). You can use parentheses to suggest parallelism.

Perform arithmetic on the minimum number of bits needed. If you only need the lower 12 bits of a
result, but your input signals are 16 bits wide, trim your inputs to 12 bits. This results in a smaller
and faster design than computing all 16 bits of the result and trimming the result to 12 bits.



3.7 Retiming

                                state S0 S1 S2 S3 S0 S1 S2 S3
                                  critical path α
   state                             a
                                     b          β
   a
                 sel                 c          γ
   b                   x           sel          1
                                y       z
                                     x          α
   c                                 y         α+γ
                                     z             α+γ
process begin
  wait until rising_edge(clk);
  if state = S1 then
    z <= a + c;
  else
    z <= b + c;
  end if;
end process;


Retimed Circuit and Waveform          ........................................................
3.7. RETIMING                                                     255


                      state S0 S1 S2 S3 S0 S1 S2 S3
  state                   a    α
                          b    β
  a
            sel           c     γ
  b               x     sel
                      y     z
                          x
  c                       y
                          z
                                 process begin
process (state) begin
                                   wait until rising_edge(clk);
  if state = S1 then
                                   if state =      then
    sel = ’1’
                                     sel = ’1’
  else
                                   else
    sel = ’1’
                                     sel = ’1’
  end if;
                                   end if;
end process;
                                 end process;
process begin
                                 process begin
  wait until rising_edge(clk);
                                   wait until rising_edge(clk);
  if sel = ’1’ then
                                   if sel = ’1’ then
    ... -- code for z
                                     ... -- code for z
  end if;
                                   end if;
end process;
                                 end process;
256                         CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION


3.8 Performance Analysis and Optimization Problems
P3.1 Farmer

A farmer is trying to decide which of his two trucks to use to transport his apples from his orchard
to the market.

Facts:
                 capacity of       speed when       speed   when
                    truck          loaded with      unloaded (no
                                      apples        apples)
 big truck   12 tonnes            15kph             38kph
 small truck 6 tonnes             30kph             70kph

 distance to market 120 km
 amount of apples 85 tonnes

NOTES:

   1. All of the loads of apples must be carried using the same truck

   2. Elapsed time is counted from beginning to deliver first load to returning to the orchard after
      the last load

   3. Ignore time spent loading and unloading apples, coffee breaks, refueling, etc.

   4. For each trip, a truck travels either its fully loaded or empty speed.


   Question: Which truck will take the least amount of time and what percentage faster
     will the truck be?


   Question: In planning ahead for next year, is there anything the farmer could do to
     decrease his delivery time with little or no additional expense? If so, what is it, if not,
     explain.
P3.2 Network and Router                                                                       257


P3.2 Network and Router

In this question there is a network that runs a protocol called BigLan. You are designing a router
called the DataChopper that routes packets over the network running BigLan (i.e. they’re BigLan
packets).
The BigLan network protocol runs at a data rate of 160 Mbps (Mega bits per second). Each BigLan
packet contains 100 Bytes of routing information and 1000 Bytes of data.
You are working on the DataChopper router, which has the following performance numbers:
 75MHz clock speed
 4     cycles for a byte of either data or header
 500   number of additional clock cycles to process the routing information
       for a packet


P3.2.1 Maximum Throughput

Which has a higher maximum throughput (as measured in data bits per second — that is only the
payload bits count as useful work), the network or your router, and how much faster is it?


P3.2.2 Packet Size and Performance

Explain the effect of an increase in packet length on the performance of the DataChopper (as
measured in the maximum number of bits per second that it can process) assuming the header
remains constant at 100 bytes.


P3.3 Performance Short Answer

If performance doubles every two years, by what percentage does performance go up every month?
This question is similar to compound growth from your economics class.


P3.4 Microprocessors

The Yme microprocessor is very small and inexpensive. One performance sacrifice the designers
have made is to not include a multiply instruction. Multiplies must be written in software using
loops of shifts and adds.
The Yme currently ships at a clock frequency of 200MHz and has an average CPI of 4.
A competitor sells the Y!v1 microprocessor, which supports exactly the same instructions as the
Yme. The Y!v1 runs at 150MHz, and the average program is 10% faster on the Yme than it is on
the Y!v1.
258                        CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION


P3.4.1 Average CPI

   Question:    What is the average CPI for the Y!v1? If you don’t have enough
     information to answer this question, explain what additional information you need
     and how you would use it?


A new version of the Y!, the Y!u2 has just been announced. The Y!u2 includes a multiply
instruction and runs at 180MHz. The Y!u2 publicity brochures claim that using their multiply
instruction, rather than shift/add loops, can eliminate 10% of the instructions in the average pro-
gram. The brochures also claim that the average performance of Y!u2 is 30% better than that of
the Y!v1.


P3.4.2 Why not you too?

   Question: Assuming the advertising claims are true, what is the average CPI for the
     Y!u2? If you don’t have enough information to answer this question, explain what
     additional information you need and how you would use it?


P3.4.3 Analysis

   Question:    Which of the following do you think is most likely and why.


   1. the Y!u2 is basically the same as the Y!v1 except for the multiply

   2. the Y!u2 designers made performance sacrifices in their design in order to include a multiply
      instruction

   3. the Y!u2 designers performed other significant optimizations in addition to creating a mul-
      tiply instruction


P3.5 Dataflow Diagram Optimization

Draw an optimized dataflow diagram that improves the performance and produces the same output
values. Or, if the performance cannot be improved, describe the limiting factor on the performance.

NOTES:
• you may change the times when signals are read from the environment
• you may not increase the resource usage (input ports, registers, output ports, f components, g
  components)
• you may not increase the clock period
P3.6 Performance Optimization with Memory Arrays                                             259


       a b       c

                                          a b
                                                                    d
        f

             f                d       e    f                    g


                          g                          c                  e


                                  f              f                  f


                                                            g


                      g


                                          After Optimization

      Before Optimization


P3.6 Performance Optimization with Memory Arrays

This question deals with the implementation and optimization for the algorithm and library of
circuit components shown below.

                     Algorithm                           Component                        Delay
q = M[b];                                                Register                          5 ns
if (a > b) then                                          Adder                            25 ns
  M[a] = b;                                              Subtracter                       30 ns
  p = (M[b-1]) * b) + M[b];                              ALU with +, −, >, =, −, AND, XOR 40 ns
else                                                     Memory read                      60 ns
  M[a] = b;                                              Memory write                     60 ns
  p    = M[b+1] * a;                                     Multiplication                   65 ns
end;                                                     2:1 Multiplexor                   5 ns
NOTES:
  1. 25% of the time, a > b
  2. The inputs of the algorithm are a and b.
  3. The outputs of the algorithm are p and q.
  4. You must register both your inputs and outputs.
  5. You may choose to read your input data values at any time and produce your outputs at any
     time. For your inputs, you may read each value only once (i.e. the environment will not send
     multiple copies of the same value).
  6. Execution time is measured from when you read your first input until the latter of producing
     your last output or the completion of writing a result to memory
260                         CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION


   7. M is an internal memory array, which must be implemented as dual-ported memory with one
      read/write port and one write port.
   8. Assume all memory address and other arithmetic calculations are within the range of repre-
      sentable numbers (i.e. no overflows occur).
   9. If you need a circuit not on the list above, assume that its delay is 30 ns.
 10. Your dataflow diagram must include circuitry for computing a > b and using the result to
     choose the value for p

Draw a dataflow diagram for each operation that is optimized for the fastest overall execution time.

NOTE: You may sacrifice area efficiency to achieve high performance, but marks will be deducted
for extra hardware that does not contribute to performance.


P3.7 Multiply Instruction

You are part of the design team for a microprocessor implemented on an FPGA. You currently im-
plement your multiply instruction completely on the FPGA. You are considering using a special-
ized multiply chip to do the multiplication. Your task is to evaluate the performance and optimality
tradeoffs between keeping the multiply circuitry on the FPGA or using the external multiplier chip.

If you use the multipliplier chip, it will reduce the CPI of the multiply instruction, but will not
change the CPI of any other instruction. Using the multiplier chips will also force the FPGA to run
at a slower clock speed.

                                         FPGA option                FPGA + MULT option
                                                                                      MULT

                                              FPGA                    FPGA



 average CPI                                   5                             ???
 % of instrs that are multiplies             10%                             10%
 CPI of multiply                              20                               6
 Clock speed                               200 MHz                         160 MHz


P3.7.1 Highest Performance

Which option, FPGA or FPGA+MULT, gives the higher performance (as measured in MIPs), and
what percentage faster is the higher-performance option?
P3.7 Multiply Instruction                                                                 261


P3.7.2 Performance Metrics

Explain whether MIPs is a good choice for the performance metric when making this decision.
262   CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION
Chapter 4

Functional Verification

4.1 Introduction
4.1.1 Purpose

The purpose of this chapter is to illustrate techniques to quickly and reliably detect bugs in datapath
and control circuits.

Section 4.5 discusses verification of datapath circuits and introduces the notions of testbench, spec-
ification, and implementation. In section 4.6 we discuss techniques that are useful for debugging
control circuits.

The verification guild website:


             http://www.janick.bergeron.com/guild/default.htm


is a good source of information on functional verification.



4.2 Overview
The purpose of functional verification is to detect and correct errors that cause a system to produce
erroneous results. The terminology for validation, verification, and testing differs somewhat from
discipline to discipline. In this section we outline some of the terminology differences and describe
the terminology used in E&CE 327. We then describe some of the reasons that chips tend to work
incorrectly.




                                                 263
264                                               CHAPTER 4. FUNCTIONAL VERIFICATION


4.2.1 Terminology: Validation / Verification / Testing
functional validation
     Comparing the behaviour of a design against the customer’s expectations. In validation, the
     “specification” is the customer. There is no specification that can be used to evaluate the
     correctness of the design (implementation).
functional verification
     Comparing the behaviour of a design (e.g. RTL code) against a specification (e.g. high-level
     model) or collection of properties
     • usually treats combinational circuitry as having zero-delay
     • usually done by simulating circuit with test vectors
     • big challenges are simulation speed and test generation
formal verification
    checking that a design has the correct behaviour for every possible input and internal state

      • uses mathematics to reason about circuit, rather than checking individual vectors of 1s and
        0s
      • capacity problems: only usable on detailed models of small circuits or abstract models of
        large circuits
      • mostly a research topic, but some practical applications have been demonstrated
      • tools include model checking and theorem proving
      • formal verification is not a guarantee that the circuit will work correctly
performance validation
     checking that implementation has (at least) desired performance
power validation
    checking that implementation has (at most) desired power
equivalence verification (checking)
     checking that the design generated by a synthesis tool has same behaviour as RTL code.
timing verification
     checking that all of the paths in a circuit fit meet the timing constraints


Hardware vs Software Terminology         ....................................................


Note: in software “testing” refers to running programs with specific inputs and checking if the
program does the right thing. In hardware, “testing” usually means “manufacturing testing”, which
is checking the circuits that come off of the manufacturing line.
4.2.2 The Difficulty of Designing Correct Chips                                                  265


4.2.2 The Difficulty of Designing Correct Chips

4.2.2.1 Notes from Kenn Heinrich (UW E&CE grad)

“Everyone should get a lecture on why their first industrial design won’t work in the field.”

Here are few reasons getting a single system to work correctly for a few minutes in a university lab
is much easier than getting thousands of systems to work correctly for months at a time in dozens
of countries around the world.

   1. You forgot to make your “unreachable” states transition to the initial (reset) state. Clock
      glitches, power surges, etc will occasionally cause your system to jump to a state that isn’t
      defined or produce an illegal data value. When this happens, your design should reset itself,
      rather than crash or generatel illegal outputs.
   2. You have internal registers that you can’t access or test. If you can set a register you must
      have some way of reading the register from outside the chip.
   3. Another chip controls your chip, and the other chip is buggy. All of your external control
      lines should be able to be disabled, so that you can isolate the source of problems.
   4. Not enough decoupling capacitors on your board. The analog world is cruel and and un-
      usual. Voltage spikes, current surges, crosstalk, etc can all corrupt the integrity of digital
      signals. Trying to save a few cents on decoupling capacitors can cause headaches and sig-
      nificant financial costs in the future.
   5. You only tested your system in the lab, not in the real world. As a product, systems will
      need to run for months in the field, simulation and simple lab testing won’t catch all of the
      weirdness of the real world.
   6. You didn’t adequately test the corner cases and boundary conditions. Every corner case is as
      important as the main case. Even if some weird event happens only once every six months,
      if you do not handle it correctly, the bug can still make your system unusable and unsellable.


4.2.2.2 Notes from Aart de Geus (Chairman and CEO of Synopsys)

More than 60% of the ASIC designs that are fabricated have at least one error, issue, or a problem
that whose severity forced the design to be reworked.

Even experienced designers have difficulty building chips that function correctly on the first pass
(figure4.1).
266                                                 CHAPTER 4. FUNCTIONAL VERIFICATION



                                       61% of new chip designs require at least one re-spin
       At least one
       error/issue/problem (61%)

    Functional logic error    (43%)
      Analog tuning issue     (20%)
     Signal integrity issue   (17%)
      Clock scheme error      (14%)
          Reliability issue   (12%)
    Mixed-signal problem      (11%)
    Uses too much power       (11%)
 Timing issue (slow paths)    (10%)
  Timing issue (fast paths)   (10%)
            IR drop issues    (7%)
           Firmware error     (4%)
           Other problem      (3%)

                                       10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Source: Aart de Geus, Chairman and CEO of Synopsys. Keynote address. Synopsys Users’
Group Meeting, Sep 9 2003, Boston USA.

                 Figure 4.1: Problems found on first-spins of new chip designs



4.3 Test Cases and Coverage
4.3.1 Test Terminology
Test case / test vector :
     A combination of inputs and internal state values. Represents one possible test of the system.
Boundary conditions / corner cases :
    A test case that represents an unusual situation on input and/or internal state signals. Corner
    cases are likely to contain bugs.
Test scenario :
      A sequence of test vectors that, together, exercise a particular situation (scenario) on a circuit.
      For example, a scenario for an elevator controller might include a sequence of button pushes
      and movements between floors.
Test suite :
      A collection of test vectors that are run on a circuit.
4.3.2 Coverage                                                                                 267


4.3.2 Coverage

To be absolutely certain that an implementation is correct, we must check every combination of
values. This includes both input values and internal state (flip flops).

If we have ni bits of inputs and ns bits in flip-flops, we have to test 2ni +ns different cases when
doing functional verification.


  Question:   If we have nc combinational signals, why don’t we have to test
    2 ni+ns+nc different cases?


  Answer:
     The value of each combinational signal is determined by the flip flops and
    inputs in its fanin. Once the values of the inputs and flip flops are known, the
    value of each combinational signal can be calculated. Thus, the
    combinational signals do not add additional cases that we need to consider.


  Definition Coverage: The coverage that a suite of tests achieves on a circuit is the
    percentage of cases that are simulated by the tests. 100% coverage means that the
    circuit has been simulated for all combinations of values for input signals and internal
    signals.


         Note: Coverage Terminology         There are many different types of coverage,
         which measure everything from percentage of cases that are exercised to num-
         ber of output values that are exercised.

There are many different commercial software programs that measure code and other types of
coverage.
   Company                    Tool                      Coverage
 Cadence           Affirma Coverage Analyzer
 Cadence           DAI Coverscan                 code, expressions, fsm
 Cadence           Codecover                     code, expressions, fsm
 Fintronic         FinCov                        code
 Summit Design     HDLScore                      code, events, variables
 Synopsys          CoverMeter                    code coverage (dead?)
 TransEDA          Verification Navigator         code and fsm
 Verisity          SureCov                       code, block, values, fsm
 Veritools         Express VCT, VeriCover        code, branch
 Aldec             Riviera                       code, block
268                                               CHAPTER 4. FUNCTIONAL VERIFICATION


4.3.3 Floating Point Divider Example

This example illustrates the difficulty of achieving significant coverage on realistic circuits.

Consider doing the functional simulation for a double precision (64-bit) floating-point divider.


                                  Given Information
             Data width                                                   64 bits
             Number of gates in circuit                                   10 000
             Number of assembly-language instructions to simulate one        100
             gate for one test case
             Number of clock cycles required to execute one assembly          0.5
             language instruction on the computer that is running the
             simulation
             Clock speed of computer that is running the simulation   1 Gigahertz


Number of Cases      ......................................................................



   Question:     How many cases must be considered?


   Answer:



                                  item bits        num values
                                  src1 64         264 = 1.8E+19
                                  src2 64         264 = 1.8E+19


       NumTestsTot = NumInputCases × NumStateCases

                       = (264 × 264 ) × (20 )

                       = 3.4E+38cases
4.3.3 Floating Point Divider Example                                                                                                     269


Simulation Run Time     . . . . .. . . . . .. . . . . .. . . . . .. . . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . ..



  Question: How long will it take to simulate all of the different possible cases using a
    single computer?


  Answer:


       1. Calculate number of seconds to simulate one test case
                                          instrs       cycles        secs
           TestTime1:1 = 10000gates × 100        × 0.5        × 1E−9
                                           gate         instr        cycle
                        = 5E−4secs
       2. Number of tests per year
                             secs      mins      hours          days
                          60      × 60      × 24       × 365.25
                             min       hour       day           year
           NumTests:1 =
                                         TestTime1:1
                          SpeedOfLight in m/s
                      ≈
                              TestTime1:1
                          3E+8secs
                      =
                          5E−4secs
                        = 6E+12cases/year
       3. Number of years to test all cases
                          NumTestsTot
           TestTimeTot =
                           NumTests:1
                            3.4E+38cases
                       =
                          6E+12cases/year
                       = 5.6E+26years


Coverage    ............................................................................ .



  Question: If you can run simulations non-stop for one year on ten computers, what
    coverage will you achieve?


  Answer:
270                                               CHAPTER 4. FUNCTIONAL VERIFICATION


        1. Number of tests per year using ten computers
            NumTests:10 = 10 × NumTests:1

                           = 10 × 6E+12cases

                           = 6E+13cases
        2. Calculate coverage achieved by running tests on ten computers for one
           year
                     NumTestsRun
            Covg =
                      NumTestsTot
                      NumTests:10
                 =
                     NumTestsTot
                     6E+13
                 =
                     3E+38
                   = 2E−25
                   = 0.000000000000000000000002%

      The message is that, even with large amounts of computing resources, it is
      difficult to achieve numerically significant coverage for realistic circuits.

      An effective functional verification plan requires carefully chosen test cases,
      so that even the miniscule amount of coverage than is realistically achievable
      catches most (all?!?!) of the bugs in the design.


Simulation vs the Real World       ......................................................... .


From Validating the Intel(R) Pentium(R) Microprocessor by Bob Bentley, Design Automation Con-
ference 2001. (Link on E&CE 327 web page.)
• Simulating the Pentium 4 Processor on a Pentium 3 Processor ran at about 15 MHz.
• By tapeout, over 200 billion simulation cycles had been run on a network of computers.
• All of these simulations represent less than two minutes of running a real processor.



4.4 Testbenches
A test bench (also known as a “test rig”, “test harness”, or “test jig”) is a collection of code used
to simulate a circuit and check if it works correctly.

Testbenches are not synthesized. You do not need to restrict yourself to the synthesizable subset of
VHDL. Use the full power of VHDL to make your testbenches concise and powerful.
4.4.1 Overview of Test Benches                                                                 271


4.4.1 Overview of Test Benches

                           testbench
                                            specification

                              stimulus                         check


                                           implementation




Implementation Circuit that you’re checking for bugs
    also known as: “design under test” or “unit under test”
Stimulus Generates test vectors
Specification Describes desired behaviour of implementation
Check Checks whether implementation obeys specification


Notes and observations      ............................................................... .

• Testbenches usually do not have any inputs or outputs.
  – Inputs are generated by stimulus
  – Outputs are analyzed by check and relevant information is printed using report statements
• Different circuits will use different stimuli, specifications, and checks.
• The roles of the specification and check are somewhat flexible.
  – Most circuits will have complex specifications and simple checks.
  – However, some circuits will have simple specifications and complex checks.
• If two circuits are supposed to have the same behaviour, then they can use the same stimuli,
  specification, and check.
• If two circuits are supposed to have the same behaviour, then one can be used as the specification
  for the other.
• Testbenches are restricted to stimulating only primary inputs and observing only primary out-
  puts. To check the behaviour of internal signals, use assertions.
272                                               CHAPTER 4. FUNCTIONAL VERIFICATION


4.4.2 Reference Model Style Testbench

 reference model testbench
                         specification

      stimulus


                        implementation




• Specification has same inputs and outputs as implementation.
• Specification is a clock-cycle accurate description of desired behaviour of implementation.
• Check is an equality test between outputs of specification and implementation.


Examples          ............................................................................ .

• Execution modules: output is sum, difference, product, quotient, etc.of inputs
• DSP filters
• Instruction decoders
         Note: “Functional specification” vs “Reference model”       Functional specifi-
         cation and reference model are often used interchangeably.


4.4.3 Relational Style Testbench

 relational testbench



      stimulus                           check



                        implementation




• Relational testbenches, or relational specifications are used when we do not want to specify the
  specific output values that the implementation must produce.
• Instead, we want to check that some relationship holds between the output and the input, or
  that some relationship holds amongst the output values (independent of the values of the input
  signals.)
• Specification is usually just wires to feed the input signals to the check.
• Check is the brains and encodes the desired behaviour of the circuit.
4.4.4 Coding Structure of a Testbench                                                         273


Examples     ............................................................................ .

• Carry-save adders: the two outputs are the sum of the three inputs, but do not specify exact
  values of each individiual output.
• Arbiters: every request is eventually granted, but do not specify in which order requests are
  granted.
• One-hot encoding: exactly one bit of vector is a ’1’, but do not specify which bit is a ’1’.
         Note: “Relational specification” vs “relational testbench” Relational speci-
         fication and relational testbench are often used interchangeably.


4.4.4 Coding Structure of a Testbench

architecture main of athabasca_tb is
  component declaration for implementation;
  other declarations
begin
  implementation instantiation;
  stimulus process;
  specification process (or component instantiation);
  check process;
end main;


4.4.5 Datapath vs Control

Datapath and control circuits tend to use different styles of testbenches.

Datapath circuits tend to be well-suited to reference-model style testbenches:
• Each set of inputs generates one set of outputs
• Each set of outputs is a function of just one set of inputs

Control circuits often pose problems for testbenches,
• Many more internal signals than outputs.
• The behaviour of the outputs provides a view into only a fragment of the current state of the
  circuit.
• It may take many clock cycles from when a bug is exercised inside the circuit until it generates
  a deviation from the correct behaviour on the outputs.
• When the deviation on the outputs is observed, it is very difficult to pinpoint the precise cause
  of the deviation (the root cause of the bug).

Assertions can be used to check the behaviour of internal signals. Control circuits tend to use
assertions to check correctness and rely on testbenches only to stimulate inputs.
274                                              CHAPTER 4. FUNCTIONAL VERIFICATION


4.4.6 Verification Tips

Suggested order of simulation for functional verification.

   1. Write high-level model.
   2. Simulate high-level model until have correct functionality and latency.
   3. Write synthesizable model.
   4. Use zero-delay simulation (uw-sim) to check behaviour of synthesizable model against
      high-level model.
   5. Optimize the synthesizable model.
   6. Use zero-delay simulation (uw-sim) to check behaviour of optimized model against high-
      level model.
   7. Use timing-simulation (uw-timsim) to check behaviour of optimized model against high-
      level model.

section 4.5 describes a series of testbenches that are particularly useful for debugging datapath
circuits in the early phases of the design cycle.



4.5 Functional Verification for Datapath Circuits
In this section we will incrementally develop a testbench for a very simple circuit: an AND gate.

Although the example circuit is trivial in size, the process scales well to very large circuits. The
process allows verification to begin as soon a circuit is simulatable, even before a complete speci-
fication has been written.


Implementation      .......................................................................


entity and2 is
  port (
    a, b : in std_logic;
    c    : out std_logic
  );
end and2;

architecture main of and2 is
begin
  c <=   ’1’ when (a = ’1’ AND b = ’1’)
    else ’0’;
end and2;
4.5.1 A Spec-Less Testbench                                                                 275


4.5.1 A Spec-Less Testbench

(NOTE: this code has been reviewed manually but has not been simulated. The concepts are
illustrated correctly, but there might be typographical errors in the code.)

First, use waveform viewer to check that implementation generates reasonable outputs for a small
set of inputs.

entity and2_tb is
end and2_tb;

architecture main_tb of and2_tb is
  component and2
    port (
      a, b : in std_logic;
      c    : out std_logic
    );
   end component;
  signal ta, tb, tc_impl : std_logic;
  signal ok : boolean;
begin
  ---------------------------------------------
  impl : and2 port map (a => ta, b => tb, c => tc_impl);
  ---------------------------------------------
  stimulus : process
  begin
    ta <= ’0’; tb <= ’0’;
    wait for 10ns;
    ta <= ’1’; tb <= ’1’;
    wait for 10ns;
  end process;
  ---------------------------------------------
end main_tb;
Use the spec-less testbench until implementation generates solid Boolean values (No X or U data)
and have checked that a few simple test cases generate correct outputs.
276                                            CHAPTER 4. FUNCTIONAL VERIFICATION


4.5.2 Use an Array for Test Vectors

Writing code to drive inputs and repetitively typing wait for 10 ns; can get tedious, so code
up test vectors in an array.

(NOTE: this code has not been checked for correctness)

architecture main_tb of and2_tb is
  ...
begin
  ...
  stimulus : process
    type test_datum_ty is record
      ra, rb : std_logic;
    end record;
    type test_vectors_ty is
      array(natural range <>) of test_datum_ty
    ;
    constant test_vectors : test_vectors_ty :=
      --   a    b
      ( ( ’0’, ’0’),
        ( ’1’, ’1’)
      );
  begin
    for i in test_vectors’low to test_vectors’high loop
      ta <= test_vectors(i).ra;
      tb <= test_vectors(i).rb;
      wait for 10 ns;
    end loop;
  end process;
end main_tb;
Use this testbench until checking the correctness of the outputs by hand using waveform viewer
becomes difficult.
4.5.3 Build Spec into Stimulus                                                                277


4.5.3 Build Spec into Stimulus

(NOTE: this code has not been checked for correctness)

After a few test vectors appear to be working correctly (via a manual check of waveforms on
simulation), begin automatically checking that outputs are correct.
• Add expected result to stimulus
• Add check process
architecture main_tb of and2_tb is
  ...
begin
  ------------------------------------------
  impl : and2 port map (a => ta, b => tb, c => tc_impl);
  ------------------------------------------
  stimulus : process
    type test_datum_ty is record
      ra, rb, rc : std_logic;
    end record;
    type test_vectors_ty is array(natural range <>) of test_datum_ty;
    constant test_vectors : test_vectors_ty :=
      --   a, b: inputs
      --   c   : expected output
      --   a    b    c
      ( ( ’0’, ’0’, ’0’),
        ( ’0’, ’1’, ’0’),
        ( ’1’, ’1’, ’1’)
      );
  begin
    for i in test_vectors’low to test_vectors’high loop
      ta      <= test_vectors(i).ra;
      tb      <= test_vectors(i).rb;
      tc_spec <= test_vectors(i).rc;
      wait for 10 ns;
    end loop;
  end process;    ------------------------------------------
  check : process (tc_impl, tc_spec)
  begin
    ok <= (tc_impl = tc_spec);
  end process;
  ------------------------------------------
end main_tb;

Use this testbench until it becomes tedious to calculate manually the correct result for each test
case.
278                                                CHAPTER 4. FUNCTIONAL VERIFICATION


4.5.4 Have Separate Specification Entity

Rather than write the specification as part of stimulus, create separate specification entity/architecture.
The specification component then calculates the expected output values.

(NOTE: if your simulation tool supports configurations, the spec and impl can share the same
entity, we’ll see this in section 4.6)
4.5.4 Have Separate Specification Entity                            279


entity and2_spec is
  ...(same as and2 entity)...
end and2_spec;

architecture spec of and2_spec is
begin
  c <= a AND b;
end spec;


architecture main_tb of and2_tb is
  component and2 ...;
  component and2_spec ...;
  signal ta, tb, tc_impl, tc_spec : std_logic;
  signal ok : boolean;
begin
  ------------------------------------------
  impl : and2 port map (a => ta, b => tb, c => tc_impl);
  spec : and2_spec port map (a => ta, b => tb, c => tc_spec);
  ------------------------------------------

  stimulus : process

    type test_datum_ty is record
      ra, rb : std_logic;
    end record;
    type test_vectors_ty is array(natural range <>) of test_datum_ty;
    constant test_vectors : test_vectors_ty :=
      --   a    b
      ( ( ’0’, ’0’),
        ( ’1’, ’1’)
      );
  begin
    for i in test_vectors’low to test_vectors’high loop
      ta      <= test_vectors(i).ra;
      tb      <= test_vectors(i).rb;
      wait for 10 ns;
    end loop;
  end process;
  ------------------------------------------
  check : process (tc_impl, tc_spec)
  begin
    ok <= (tc_impl = tc_spec);
  end process;
  ------------------------------------------
end main_tb;
280                                            CHAPTER 4. FUNCTIONAL VERIFICATION


4.5.5 Generate Test Vectors Automatically

When it becomes tedious to write out each test vector by hand, we can automaticaly compute them.
This example uses a pair of nested for loops to generate all four permutations of input values
for two signals.

architecture main_tb of and2_tb is
  ...
begin
  ...
  stimulus : process
    subtype std_test_ty of std_logic is (’0’, ’1’);
  begin
    for va in std_test_ty’low to std_test_ty’high loop
      for vb in std_test_ty’low to std_test_ty’high loop
        ta <= va;
        tb <= vb;
        wait for 10 ns;
      end loop;
    end loop;
  end process;
  ...
end main_tb;


4.5.6 Relational Specification

architecture main_tb of and2_tb is
  ...
begin
  ------------------------------------------
  impl : and2 port map (a => ta, b => tb, c => tc_impl);
  ------------------------------------------
  stimulus : process
    ...
  end process;
  ------------------------------------------
  check : process (tc_impl, tc_spec)
  begin
    ok <= NOT (tc_impl = ’1’ AND (ta =’0’ OR tb = ’0’));
  end process;
  ------------------------------------------
end main_tb;
4.6. FUNCTIONAL VERIFICATION OF CONTROL CIRCUITS                                                  281


4.6 Functional Verification of Control Circuits
Control circuits are often more challenging to verify than datapath circuits.
• Control circuits have many internal signals. Testbenches are unable access key information
   about the behaviour of a control circuit.
• Many clock cycles can elapse between when a bug causes an internal signal to have an incorrect
   value and when an output signal shows the effect of the bug.
In this section, we will explore the functional verification of state machines via a First-In First-Out
queue.

The VHDL code for the queue is on the web at:

               http://www.ece.uwaterloo.ca/˜ece327/exs/queue


4.6.1 Overview of Queues in Hardware


                                       write                 read
                                                     queue




                                     Figure 4.2: Structure of queue


     Empty                 Write 1             Write 2


                    A                            A




                                      Figure 4.3: Write Sequence
282                                            CHAPTER 4. FUNCTIONAL VERIFICATION



              Write 1             Write 2           Read 1                             Read 2


                A                   A                                A                           A
  B                                 B                 B                                  B




       Figure 4.4: A Second Example Write           Figure 4.5: Example Read Sequence



              Write 1             Write 2                 Write 1                      Write 2


                                                K                                        K
                B                   B                        B                           B
                C                   C                        C                           C
                D                   D                        D                           D
                E                   E                        E                           E
                F                   F                        F                           F
                G                   G                        G                           G
                H                   H                        H                           H
                I                   I                        I                           I
  J                                 J                        J                           J


  Figure 4.6: Write Illustrating Index Wrap      Figure 4.7: Write Illustrating Full Queue



      do_rd                                          do_rd
                        mem
      do_wr
                                                          wr_idx
                              rd_idx data_rd
data_wr wr_idx
                                                                               mem
                                                 do_wr                   WE

                                                                         A0      DO0

                                               data_wr                   DI0

                                                    rd_idx               A1      DO1   data_rd




                          empty                                     empty


               Figure 4.8: Queue Signals            Figure 4.9: Incomplete Queue Blocks


Control circuitry not shown.
4.6.2 VHDL Coding                                                       283


4.6.2 VHDL Coding

4.6.2.1 Package

Things to notice in queue package:

  1. separation of package and body

package queue_pkg is
  subtype data is std_logic_vector(3 downto 0);
  function to_data(i : integer) return data;
end queue_pkg;

package body queue_pkg is
  function to_data(i : integer) return data is
  begin
    return std_logic_vector(to_unsigned(i, 4));
  end to_data;
end queue_pkg;


4.6.2.2 Other VHDL Coding

VHDL coding techniques to notice in queue implementation:

  1. type declaration for vectors
  2. attributes
       (a) ’low, ’high, ’length,

  3. functions (reduce overall implementation and maintenance effort)
       (a) reduce redundant code
       (b) hide implementation details
       (c) (just like software engineering....)


4.6.3 Code Structure for Verification

Verification things to notice in queue implementation:

  1. instrumentation code
  2. coverage monitors
  3. assertions
284                                               CHAPTER 4. FUNCTIONAL VERIFICATION


architecture ... is
  ...
begin
  ... normal implementation ...
  process (clk)
  begin
    if rising_edge(clk) then
      ... instrumentation code ...
      prev_signame <= signame;
    end if;
  end process;
  ... assertions ...
  ... coverage monitors ...
end;


4.6.4 Instrumentation Code
•   Added to implementation to support verification
•   Usually keeps track of previous values of signals
•   Does not create hardware (Optimized away during synthesis)
•   Does not feed any output signals
•   Must use synthesizable subset of VHDL
    process (clk) begin
      if rising_edge(clk) then
        prev_rd_idx <= rd_idx;
        prev_wr_idx <= wr_idx;
        prev_do_rd <= do_rd;
        prev_do_wr <= do_wr;
      end if;
    end process;

          Note: Naming convention for instrumentation        For assertions, signals are
          named prev signame and signame, rather than next signame and
          signame as is done for state machines. This is because for assertions we
          use the prev signals as history signals, to keep track of past events. In con-
          trast, for state machines, we name the signals next, because the state machine
          computes the next values of signals.


4.6.5 Coverage Monitors

The goal of a coverage monitors is to check if a certain event is exercised in a simulation run. If a
test suite does not trigger a coverage monitor, then we probably want to add a test vector that will
trigger the monitor.
4.6.5 Coverage Monitors                                                                    285


For example, for a circuit used in a microwave oven controller, we might want to make sure that
we simulate the situation when the door is opened while the power is on.

  1. Identify important events, conditions, transitions
  2. Write instrumentation code to detect event
  3. Use report to write when event happens
  4. When run simulation, report statements will print when coverage condition detected
  5. Pipe simulation results to log file
  6. Examine log file and coverage monitors to find cases and transitions not tested by existing
     test vectors
  7. Add test vectors to exercise missing cases
  8. Idea: automate detection of missing cases using Perl script to find coverage messages in
     VHDL code that aren’t in log file
  9. Real world: most commercial synthesis tools come with add-on packages that provide dif-
     ferent types of coverage analysis
 10. Research/entrepreneurial idea: based on missing coverage cases, find new test vectors to
     exercise case


Coverage Events for Queue       ............................................................



                                  Prev                    Now

                           wr             rd
                                                    wr          rd




                                  Prev                    Now

                                          rd
                           wr                       wr          rd




                                  Prev                    Now

                           wr
                                          rd        wr          rd
286                                              CHAPTER 4. FUNCTIONAL VERIFICATION


  Question:       What events should we monitor to estimate the coverage of our functional
    tests?


  Answer:

      •   wr   idx and rd idx are far apart
      •   wr   idx and rd idx are equal
      •   wr   idx catches rd idx
      •   rd   idx catches wr idx
      •   rd   idx wraps
      •   wr   idx wraps


Coverage Monitor Template         .......................................................... .


  process (signals read)
  begin
    if (condition) then
      report "coverage: message";
    elsif (condition) ) then
        report "coverage: message";
    else
        report "error: case fall through on message"
        severity warning;
    end if;
  end process;


Coverage Monitor Code        .............................................................. .


Events related to rd idx equals wr idx.
4.6.6 Assertions                                                                          287


  process (prev_rd_idx, prev_wr_idx, rd_idx, wr_idx)
  begin
    if (rd_idx = wr_idx) then
      if ( prev_rd_idx = prev_wr_idx ) then
        report "coverage: read = write both moved";
      elsif ( rd_idx /= prev_rd_idx ) then
        report "coverage: Read caught write";
      elsif ( wr_idx /= prev_wr_idx ) then
        report "coverage: Write caught read";
      else
        report "error: case fall through on rd/wr catching"
        severity warning;
      end if;
    end if;
  end process;
Events related to rd idx wrapping.
  process (rd_idx)
  begin
    if (rd_idx = low_idx) then
      report "coverage: rd mv to low";
    elsif (rd_idx = high_idx) then
      report "coverage: rd mv to high";
    else
      report "coverage: rd mv normal";
    end if;
  end process;


4.6.6 Assertions

Assertions for Queue       ................................................................. .

  1. If rd idx changes, then it increments or wraps.
  2. If rd idx changes, then do rd was ’1’, or reset is ’1’.
  3. If wr idx changes, then it increments or wraps.
  4. If wr idx changes, then do wr was ’1’, or reset is ’1’.
  5. And many others....


Assertion Template     ....................................................................


  process (signals read) begin
    assert (required condition)
      report "error: message" severity warning;
  end process;
288                                              CHAPTER 4. FUNCTIONAL VERIFICATION


Assertions: Read Index      ................................................................


  process (rd_idx) begin
    assert ((rd_idx > prev_rd_idx) or (rd_idx = low_idx))
      report "error: rd inc" severity warning;
    assert ((prev_do_rd = ’1’) or (reset = ’1’))
      report "error: rd imp do_rd" severity warning;
  end process;


Assertions: Write Index      .............................................................. .


  process (wr_idx) begin
    assert ((wr_idx > prev_wr_idx) or (wr_idx = low_idx))
      report "error: wr inc" severity warning;
    assert ((prev_do_wr = ’1’) or (reset = ’1’))
      report "error: wr imp do_wr" severity warning;
  end process;


4.6.7 VHDL Coding Tips

Vector Type Declaration      ...............................................................


  type data_array_ty is array(natural range <>) of data;
  signal data_array     : data_array_ty(7 downto 0);


Functions    .............................................................................


  function to_idx
    (i : natural range data_array’low to data_array’high)
    return idx_ty
  is
  begin
    return to_unsigned(i, idx_ty’length);
  end to_idx;

                     Conversion to Index
         Without Function                With Function
 rd_idx <= to_unsigned(5, 3); rd_idx <= to_idx(5);
The function code is verbose, but is very maintainable, because neither the function itself nor uses
of the function need to know the width of the index vector.
4.6.8 Queue Specification                                                                                                                           289


Attributes    . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . . .. . . . . .. . . ..


  function inc_idx (idx : idx_ty) return idx_ty is
  begin
    if idx < data_array’high then
      return (idx + 1);
    else
      return (to_idx(data_array’low));
    end if;
  end inc_idx;


Feedback Loops, and Functions                       ........................................................


Coding guideline: use functions. Don’t use procedures.

           inc as fun           inc as proc
 wr_idx <= inc_idx(wr_idx); inc_idx(wr_idx);
Functions clearly distinguish between reading from a signal and writing to a signal. By examining
the use of a procedure, you cannot tell which signals are read from and which are written to. You
must examine the declaration or implementation of the procedure to determine modes of signals.

Modifying a signal within a procedure results in a tri-state signal. This is bad.


File I/O (textio package)              ...............................................................


TEXTIO defines read, write, readline, writeline functions.

Described in:
• http://www.eng.auburn.edu/department/ee/mgc/vhdl.html#textio
These functions can be used to read test vectors from a file and write results to a file.


4.6.8 Queue Specification

Most bugs in queues are related to the queue becoming full, becoming empty, and/or wrap of
indices.

Specification should be “obviously correct”. Avoid bugs in specification by making specification
queue larger than the max number of writes that we will do in test suite. Thus, the specification
queue will never become full or wrap. However, the implementation queue will become full and
wrap.
290                                              CHAPTER 4. FUNCTIONAL VERIFICATION


Write Index Update in Specification        ....................................................


We increment write-index on every write, we never wrap.
  process (clk) begin
    if rising_edge(clk) then
      if (reset = ’1’) then
        wr_idx <= 0;
      elsif (do_wr = ’1’) then
        wr_idx <= wr_idx + 1;
      end if;
    end if;
  end process;


Things to Notice    .......................................................................


Things to notice in queue specification:
  1. don’t care conditions (’-’)
  2. uninitialized data (hint: what is the value of rd_data when do more reads than writes?


Don’t Care    ............................................................................


  rd_data <=   data_array(rd_idx) when (do_rd =’1’)
          else (others => ’-’);


4.6.9 Queue Testbench

Things to notice in queue testbench:
  1. running multipe test sequences
  2. uninitialized data ’U’
  3. std_match to compare spec and impl data
                                          0          ∼     0
                                          0          ∼     L
                                          1          ∼     1
                                          1          ∼     H
                                          -          ∼ everything
                                   everything else   ∼ everything
      With equality, ’-’ = ’1’, but we want to use ’-’ to mean “don’t care” in specification.
      The solution is to use std match, rather than = to check implementation signals against
      the specification.
4.7. EXAMPLE: MICROWAVE OVEN                                                                      291


Stimulus Process Structure        ........................................................... .


The stimulus process runs multiple test vectors in a single simulation run.
  stimulus : process
    type test_datum_ty is
      record
        r_reset, ... normal fields ...
      end record;
    type test_vectors_ty is array(natural range <>) of test_datum_ty;
    constant test_vectors : test_vectors_ty :=
      ( --   reset ... other signal ...
          (   ’1’, normal fields), -- test case 1
          (   ’0’, normal fields),
         ...
          (   ’1’, normal fields), -- test case 2
          (   ’0’, normal fields),
         ...
      );
  begin
    for i in test_vectors’range loop
      if (test_vectors(i).r_reset = ’1’) then
        ... reset code ...
      end if;
      reset <= ’0’;
      ... normal sequence ...
      wait until rising_edge(clk);
    end loop;
  end process;
After reset is asserted, set signals to ’U’.



4.7 Example: Microwave Oven
This question concerns the VHDL code microwave, which controls a simple microwave oven;
the properties prop1...prop3; and two proposed changes to the VHDL code.
INSTRUCTIONS:
   1. Assume that the code as currently written is correct — any change to the code that causes a
      change to the behaviour of the signals heat or count is a bug.
   2. For each of the two proposed code changes, answer whether the code change will cause a
      bug.
   3. If the code change will cause a bug, provide a test case that will exercise the bug and identify
      all of the given properties (prop1, prop2, and prop3) that will detect the bug with the test
      case you provide.
292                                              CHAPTER 4. FUNCTIONAL VERIFICATION


  4. If none of the three properties can detect the bug, provide a property of your own that will
     detect the bug with the testcase you provide.


  Question: For each of the three properties prop1...prop2, answer whether the
    property is best checked as part of a testbench or assertion. For each property, justify
    why a testbench or an assertion is the best method to validate that property.

prop1 If start is pushed and the door is closed, then heat remains on for exactly the time specified
    by the timer when start was pushed, assuming reset remains false and the door remains
    closed.

       Answer:
          Testbench: All relevant signals are primary inputs or outputs, so can
         check property without seeing internal signals. Testbenches are only able
         to set and observe primary inputs and outputs.

prop2 If the door is open, then heat is off.

       Answer:
          Testbench: same as previous property.

prop3 If start is not pushed, reset is false, and count is greater than zero, then count is decre-
    mented.

       Answer:
          Assertion: To see count, need access to internal signals.
entity microwave is
  port (
    timer     -- time input from user
    : in unsigned(7 downto 0);
    reset,    -- resets microwave
    clk,      -- clock signal input
    is_open, -- detects when door is open
    start     -- start button input from user
    : in std_logic;
    heat : out std_logic -- 1=on, 0=off
  );
end microwave;

architecture main of microwave is
  signal count : unsigned(7 downto 0); -- internal time count
  signal x_heat : std_logic;
begin
4.7. EXAMPLE: MICROWAVE OVEN                                                                                                                                  293


  -- heat process ------------------------------
  process (clk)
  begin
    if rising_edge(clk) then
      if reset = ’1’ then
        x_heat <= ’0’;
      elsif (is_open = ’0’) and (start = ’1’) and                                                                   -- region of
            (time > 0)                                                                                              -- change #1
      then                                                                                                          --
        x_heat <= ’1’;                                                                                              --
      elsif (is_open = ’0’) and (count > 0) then                                                                    --
        x_heat <= x_heat;                                                                                           --
      else
        x_heat <= ’0’;
      end if;
    end if;
  end process;
  -- count process ------------------------------
  process (clk)
  begin
    if rising_edge(clk) then
      if (reset = ’1’) then
        count <= to_unsigned(0, 8);
      elsif (start = ’1’) then                                                                                      -- region of
        count <= timer;                                                                                             -- change #2
      elsif (count > 0) then                                                                                        --
        count <= count - 1;                                                                                         --
      end if;
    end if;
  end process;
  heat <= x_heat;
end main;


Properties   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..

prop1 If start is pushed and the door is closed, then heat remains on for exactly the time specified
    by the timer when start was pushed, assuming reset remains false and the door remains
    closed.
prop2 If the door is open, then heat is off.
prop3 If start is not pushed, reset is false, and count is greater than zero, then count is decre-
    mented.


Change #1     ........................................................................... .
294                                           CHAPTER 4. FUNCTIONAL VERIFICATION


          elsif (start = ’1’) then
            count <= time;
 From:    elsif (count > 0) then
            count <= count - 1;

          elsif (count > 0) then
            count <= count - 1;
 To:      elsif (start = ’1’) then
            count <= time;



  Answer:


       The change introduces a bug that is caught by properties 1 and 3.

       Test Cases
        testcase1 Maintain reset=0. Close door, set timer to some value (v1 ) and
             then push start. Leave the door closed. While the microwave is on, set
             timer to a value (v2 ) that is greater than v1 and then push start.
             In old the code, the new value on the timer will be read in. In the new
             code, the new value on the timer will be ignored. The reason to make v2
             greater than v1 is to prevent counter from being exactly equal to v2 when
             start is pushed a second time. In that case, the bug would not be
             exercised. Note, the old code violated prop1.
        testcase2 reset = 0, microwave off, door closed, count = 0. Set timer to a
             non-zero value. Press and hold start for a number of cycles. In the
             original code, the value of timer would be reloaded into count on each
             rising edge of the clock. With the change, the value of count continues to
             decrement and the timer is not reloaded into count. Note: in this case,
             only prop1 will detect the bug. Prop3 will not detect the bug because the
             antecedent, or precondition for the property is false.


Change #2     ........................................................................... .

          elsif (is_open = ’0’) and (start = ’1’) and (time > 0)
          then x_heat <= ’1’;
 From:
          elsif (is_open = ’0’) and (count > 0)
          then x_heat <= x_heat;
          elsif     (is_open = ’0’)
 To:            and ((start = ’1’) or (count > 0))
          then x_heat <= ’1’;
          else x_heat <= ’0’;
4.7. EXAMPLE: MICROWAVE OVEN                                                                  295


  Answer:


     The change introduces a bug that would be caught by prop1, but not by prop2
     or prop3.
     The following scenario or test case will catch the bug with prop1. Maintain
     reset=0. Microwave is off, door is closed, timer is set to 0. Push start. With
     old code, microwave will remain off. With new code, microwave will turn on
     and remain on as long as start is pushed.
     The change to code exercises another bug that is not caught by prop1. This
     bug demonstrates a weakness in prop1 that should be remedied.
     Testcase: reset = 0, microwave off, door closed. Set timer to a non-zero
     value. Press (and release) start. Before timer expires, open door. Close door
     before count = 0. In the original code, the microwave will remain off, but with
     the change, the microwave will start again. Note: the same properties detect
     the bug as with the original solution.
     The weakness in prop1 is that it assumes that door remains closed. So, any
     testcase where the door is opened will pass prop1. In verification, this is
     known as the “false implies anything problem”, or a testcase that passes a
     property “vacuously”.
     To catch this bug, we must either change prop1 or add another property. In
     fact, we probably should do both.
     First we strengthen prop1 to deal with situations where the door is opened
     while the microwave is on. The property gets a bit complicated: “If start is
     pushed and the door is closed, then heat remains on until the earlier of either
     opening of the door or the expiration of the time specified by the timer when
     start was pushed, assuming reset remains false.”
     Second, we add a property to ensure that the microwave does not turn back
     on when the door is re-closed with time remaining on the counter: “If the
     microwave is off, it remains off until start is pushed.” This fourth property is
     written to be as general as possible. We want to write properties that catch as
     many bugs as possible, rather than write properties for specific testcases or
     bugs.


Coverage    ............................................................................ .


  Question:   If msb of src1 is ’1’ and lsb of src2 is ’0’ or sum(3) is ’1’, then result is
    wrong. What is the minimum coverage needed to detect bug? What is the minimim
    coverage needed to guarantee that the bug will be detected?
296                                                CHAPTER 4. FUNCTIONAL VERIFICATION


4.8 Functional Verification Problems
P4.1 Carry Save Adder
   1. Functionality Briefly describe the functionality of a carry-save adder.
   2. Testbench Write a testbench for a 16-bit combinational carry save adder.
   3. Testbench Maintenance Modify your testbench so that it is easy to change the width of the
      adder and the latency of the computation.
      NOTES:
       (a) You do not need to support pipelined adders.
       (b) VHDL “generics” might be useful.


P4.2 Traffic Light Controller

P4.2.1 Functionality

Briefly describe the functionality of a traffic-light controller that has sensors to detect the presence
of cars.


P4.2.2 Boundary Conditions

Make a list of boundary conditions to check for your traffic light controller.


P4.2.3 Assertions

Make a list of assertions to check for your traffic light controller.
P4.3 State Machines and Verification                                                                                        297


P4.3 State Machines and Verification

P4.3.1 Three Different State Machines




                                                                                 */0          */0
                                                                      s0               s1             s2

                                                                */1                                       */0
                                 1/0
                     s0                  s1
                                                                                 */0
                                                                      s9               s8             s3
                                  0/0
               */1                        */0                                           */0               */0
                                                                                 */0
                                                                      s6               s7             s4
                     s3                  s2
                                 */0
                                                                                 */0
                                                                                                    */0
        Figure 4.10: A very simple machine                                             s5


                                                            Figure 4.11: A very big machine



               */0                                    */0
       s0                   s1                  q0             q1
                                                                           */0                        input/output


        */1                */0                */1                           q2
                                                                                                          * = don’t care
               s2
                                                                            */0
               */0
                                                q4             q3
                                                      */0                                     Figure 4.13: Legend


                          Figure 4.12: A concurrent machine
Answer each of the following questions for the three state machines in figures4.10–4.12.


Number of Test Scenarios How many “test scenarios” (sequences of test vectors) would you
need to fully validate the behaviour of the state machine?


Length of Test Scenario            What is the maximum length (number of test vectors) in a test scenario
for the state machine?
298                                                CHAPTER 4. FUNCTIONAL VERIFICATION


Number of Flip Flops Assuming that neither the inputs nor the outputs are registered, what is
the minimum number of flip-flops needed to implement the state machine?


P4.3.2 State Machines in General

If a circuit has i signals of 1-bit each that are inputs, f 1-bit signals that are outputs of flip-flops
and c 1-bit signals that are the outputs of combinational circuitry, what is the maximum number of
states that the circuit can have?


P4.4 Test Plan Creation

You’re on the functional verification team for a chip that will control a simple portable CD-
player. Your task is to create a plan for the functional verification for the signals in the entity
cd digital.
You’ve been told that the player behaves “just like all of the other CD players out there”. If your
test plan requires knowledge about any potential non-standard features or behaviour, you’ll need
to document your assumptions.
                               track        min          sec




                              prev     stop       play    next      pwr



entity cd_digital is
  port (
    ----------------------------------------------------
    -- buttons
    prev,
    stop,
    play,
    next,
    pwr           : in std_logic;
    ----------------------------------------------------
    -- detect if player door is open
    open          : in std_logic;
    ----------------------------------------------------
    -- output display information
    track : out std_logic_vector(3 downto 0);
    min   : out unsigned(6 downto 0);
    sec   : out unsigned(5 downto 0)
  );
end cd_digital;
P4.5 Sketches of Problems                                                                   299


P4.4.1 Early Tests

Describe five tests that you would run as soon as the VHDL code is simulatable. For each test:
describe what your specification, stimulus, and check. Summarize the why your collection of tests
should be the first tests that are run.


P4.4.2 Corner Cases

Describe five corner-cases or boundary conditions, and explain the role of corner cases and
boundary conditions in functional verification.

NOTES:
  1. You may reference your answer for problem P4.4.1 in this question.
  2. If you do not know what a “corner case” or “boundary condition” is, you may earn partial
     credit by: checking this box   and explaining five things that you would do in functional
     verification.


P4.5 Sketches of Problems
  1. Given a circuit, VHDL code, or circuit size info; calculate simulation run time to achieve
     n% coverage.

  2. Given a fragment of VHDL code, list things to do to make it more robust — e.g. illegal data
     and states go to initial state.

  3. Smith Problem 13.29
300   CHAPTER 4. FUNCTIONAL VERIFICATION
Chapter 5

Timing Analysis

5.1 Delays and Definitions
In this section we will look at the different timing parameters of circuits. Our focus will be on
those parameters that limit the maximum clock speed at which a circuit will work correctly.


5.1.1 Background Definitions

  Definition fanin: The fanin of a gate or signal x are all of the gates or signals y where an
    input of x is connected to an output of y.


  Definition fanout: The fanout of a gate or signal x are all of the gates or signals y where
    an output of x is connected to an input of y.




                 y0                                                           y0

                 y1                                                           y1
                                                                   x
                 y2               x                                           y2

                 y3                                                           y3

                 y4                                                           y4


      Figure 5.1: Immediate Fanin of x                 Figure 5.2: Immediate Fanout of x



                                               301
302                                                           CHAPTER 5. TIMING ANALYSIS


  Definition immediate fanin/fanout: The phrases immediate fanout and immediate fanin
    mean that there is a direct connection between the gates.




                                                          x
                                       x




        Figure 5.3: Transitive Fanin                     Figure 5.4: Transitive Fanout



  Definition transitive fanin/fanout: The phrases transitive fanout and transitive fanin
    mean that there is either a direct or indirect connection between the gates.


        Note: “Immediate” vs “Transitive” fanin and fanout       Be careful to dis-
        tinguish between immediate fan(in/out) and transitive fanin/out. If “fanin”
        or “fanout” are not qualified with “immediate” or “transitive”, be sure to
        make sure whether “immediate” or “transitive” is meant. In E&CE 327,
        “fan(in/out)” will mean “immediate fan(in/out)”.


5.1.2 Clock-Related Timing Definitions

5.1.2.1 Clock Skew

              skew

clk1                                                                 clk1                 clk3
clk2

clk3                                                                 clk2                 clk4

clk4




  Definition Clock Skew: The difference in arrival times for the same clock edge at
    different flip-flops.
5.1.2 Clock-Related Timing Definitions                                                                    303


Clock skew is caused by the difference in interconnect delays to different points on the chip.

Clock tree design is critical in high-performance designs to minimize clock skew. Sophisticated
synthesis tools put lots of effort into clock tree design, and the techniques for clock tree design still
generate PhD theses.


5.1.2.2 Clock Latency

                                                                 master clock
                             latency




                                                                      intermediate clock
                                                                                           final clock
       master clock

 intermediate clock
         final clock




   Definition Clock Latency: The difference in arrival times for the same clock edge at
     different levels of interconnect along the clock tree. (Intuitively “different points in
     the clock generation circuitry.”)


           Note: Clock latency         Clock latency does not affect the limit on the minimim
           clock period.


5.1.2.3 Clock Jitter

      ideal clock




 clock with jitter


                                                        jitter


   Definition Clock Jitter: Difference between actual clock period and ideal clock period.


Clock jitter is caused by:
304                                                                          CHAPTER 5. TIMING ANALYSIS


• temperature and voltage variations over time
• temperature and voltage variations across different locations on a chip
• manufacturing variations between different parts


5.1.3 Storage-Related Timing Definitions

Storage devices (latches, flip-flops, memory arrays, etc) define setup, hold and clock-to-Q times.


5.1.3.1 Flops and Latches

 clk                                                        clk
      d                                                      d
      q                                                      q

                  Flop Behaviour                                              Latch Behaviour
Storage devices have two modes: load mode and store mode.

Flops are edge sensitive: either rising edge or falling edge. An ideal flop is in load mode only for
the instant just before the clock edge. In reality, flops are in load mode for a small window on
either side of the edge.

Latches are level sensitive: either active high or active low. A latch is in load mode when its enable
signal is at the active level.


Timing Parameters            ....................................................................

                  Setup    Hold                           Setup   Hold                            Setup   Hold


 d        α           β                 d    α                β                 d    α                β
clk                                    clk                                     clk
 q                                β     q         α                      β      q         α                      β


                          Clock-to-Q         Clock-to-Q                              Clock-to-Q

              Flip-flop                       Active-high latch                       Active-low latch
Setup and hold define the window in which input data are required to be constant in order to
guarantee that storage device will store data correctly. Setup defines the beginning of the window.
Hold defines the end of the window. Setup and hold timing constraints ensure that, when the
storage device transitions from load mode to store mode, the input data is stored correctly in the
storage device. Thus, the setup and hold timing constraints come into play when the storage device
transitions from load mode to store mode. Setup is assumed to happen before the clock edge and
5.1.3 Storage-Related Timing Definitions                                                           305


hold is assumed to happen after the edge. If the end of the time window constraint occurs before
the clock edge, then the hold constraint is negative.

Clock-to-Q defines the delay from the clock edge to when the output is guaranteed to be stable.

         Note: Require / Guarantee      Setup and hold times are requirements that the
         storage device imposes upon its environment. Clock-to-Q is a guarantee that
         the storage device provides its environment. If the environment satisfies the
         setup and hold times, then the storage device guarantees that it will satisfy the
         clock-to-Q time.

In this section, we will use the definitions of setup, hold and clock-to-Q. Section 5.2 will show how
to calculate setup, hold, and clock-to-Q times for flip flops, latches, and other storage devices.


5.1.3.2 Timing Parameters for a Flop

Setup Time      .......................................................................... .



   Definition Setup Time (T         ) : Latest time before arrival of clock edge (flip flop), or
                             SUD
      deasserting of enable line (latch), that input data is required to be stable in order for
      storage device to work correctly.


If setup time is violated, current input data will not be stored; input data from previous clock cycle
might remain stored.


5.1.3.3 Hold Time

   Definition Hold Time (T       ): Latest time after arrival of clock edge (flip flop), or
                            HO
      deasserting of enable line (latch), that input data is required to remain stable in order
      for storage device to work correctly.


If hold time is violated, current input data will not be stored; input data from next clock cycle
might slip through and be stored.


5.1.3.4 Clock-to-Q Time

   Definition Clock-to-Q Time (T         ): Earliest time after arrival of clock edge (flip flop),
                                    CO
      or asserting of enable line (latch) when output data is guaranteed to be stable.
306                                                             CHAPTER 5. TIMING ANALYSIS


Review: Timing Parameters         .......................................................... .

Setup : Time before arrival of clock edge (flip flop), or deasserting of enable line (latch), that
     input data is required to start being stable
Hold : time after arrival of clock edge (flip flop), or deasserting of enable line (latch), that input
    data is required to remain stable
Clock-to-Q : Time after arrival of clock edge (flip flop), or asserting of enable line (latch) when
    output data is guaranteed to start being stable


5.1.4 Propagation Delays

Propagation delay is the time it takes a signal to travel from the source (driving) flop to the desti-
nation flop. The two factors that contribute to propagation delay are the load of the combinational
gates between the flops and the delay along the interconnect (wires) between the gates.


5.1.4.1 Load Delays

Load delay is proportional to load capacitance.

Timing of a simple inverter with a load.



                                   1->0                                   0->1
Vi                 Vo
                                                         0->1                                   1->0



      Schematic                      Input 1 → 0:                            Input 0 → 1:
                                   Charge output cap                       Discharge output
                                                                                 cap
Load capacitance is a dependent on the fanout (how many other gates a gate drives) and how big
the other gates are.

Section 5.4.2 goes into more detail on timing models and equations for load delay.


5.1.4.2 Interconnect Delays

Wires, also known as interconnect, have resistance, and there is a capacitance between a wire and
both the substrate and parallel wires. Both the resistance and capacitance of wires increase delay.
• Wire resistance is dependent upon the material and geometry of the wire.
5.1.5 Summary of Delay Factors                                                                  307


•Wire capacitance is dependent on wire geometry, geometry of neighboring wires, and materials.
•Shorter wires are faster.
•Fatter wires are faster.
•FPGAs have special routing resources for long wires.
•CMOS processes use higher metal layers for long wires, these layers have wires with much
 larger cross sections than lower levels of metal.
More on this in section 5.4.


5.1.5 Summary of Delay Factors


           Name            Symbol     Definition
           Skew                       Difference in arrival times for different clock
                                      signals
           Jitter                     Difference in clock period over time
           Clock-to-Q      T          Delay from clock signal to Q output of flop
                             CO
           Setup           T          Length of time prior to clock/enable that data
                             SUD
                                      must be stable
           Hold            T          Length of time after clock/enable that data must
                               HO
                                      be stable
           Load                       Delay due to load (fanout/consumers/readers)
           Interconnect               Delay along wire


                               Table 5.1: Summary of delay factors




5.1.6 Timing Constraints

For a circuit to operate correctly, the clock period must be longer than the sum of the delays shown
in table5.1.

    Definition Margin: The difference between the required value of a timing parameter
      and the actual value. A negative margin means that there is a timing violation. A
      margin of zero means that the timing parameter is just satisfied: changing the timing
      of the signals (which would affect the actual value of the parameter) could violate the
      timing parameter. A positive margin means that the constraint for the timing
      parameter is more than satisfied: the timing of the signals could be changed at least a
      little bit without violating the timing parameter.

          Note:     “Margin” is often called “slack”. Both terms are used commonly.
308                                                                 CHAPTER 5. TIMING ANALYSIS


5.1.6.1 Minimum Clock Period

                                                                              signal is stable
                       a             b
                                                                              signal may change
            clk1              clk2
                                                                              signal may rise

                                                                              signal may fall

                                          clock period
                                                      propagation

                      skew    jitter clock-to-Q interconnect + load           setup


        clk1


        clk2


        a


        b


                                                                      slack



         ClockPeriod > Skew + Jitter + T                 + Interconnect + Load + T
                                                 CO                                          SUD


       Note:       The minimum clock period is independent of hold time.
5.1.6 Timing Constraints                                                                        309


5.1.6.2 Hold Constraint
                                          clock period




          clk1


          clk2


          a


          b




                    Skew + Jitter + T        ≤ T         + Interconnect + Load
                                        HO         CO


5.1.6.3 Example Timing Violations

The figures below illustrate correct timing behaviour of a circuit and then two types of violations:
setup violation and hold violation. In the figures, the black rectangles identify the point where the
violation happens.
310                                          CHAPTER 5. TIMING ANALYSIS



                  b           c
              a                       d
            clk




        a    α                β                          γ

      clk

        b                    α                  β                   γ

                              Clock-to-Q
                                           Prop

                                                    Setup Hold

        c                    α                           β

        d                                            α              β



                   Figure 5.5: Good Timing



       a    α                β                           γ

      clk
       b    α               α                   β                   γ

                             Clock-to-Q
                                             Prop

                                                    Setup

       c                                        α            ββ

       d                                             α            ?α?β?
                                                                  ?α?β?



                  Figure 5.6: Setup Violation
5.2. TIMING ANALYSIS OF LATCHES AND FLIP FLOPS                                                    311



                                                 b c
                                   a                        d
                                 clk




                            a           β               γ

                          clk

                            b                       β               γ
                                                                   Clock-to-Q

                                                                     Prop
                                                            Hold

                            c                           β               γ
                            d                                       ?β?γ?



                                       Figure 5.7: Hold Violation



5.2 Timing Analysis of Latches and Flip Flops
In this section, we show how to find the clock-to-Q, setup, and hold times for latches, flip-flops,
and other storage elements.


5.2.1 Simple Multiplexer Latch

We begin our study of timing analysis for storage devices with a simple latch built from an inverter
ring and multiplexer. There are many better ways to build latches, primarily by doing the design
at the transistor level. However, the simplicity of this design makes it ideal for illustrating timing
analysis.


5.2.1.1 Structure and Behaviour of Multiplexer Latch

Two modes for storage devices:
• loading data:
312                                                                     CHAPTER 5. TIMING ANALYSIS


  – loads input data into storage circuitry
  – input data passes through to output
• using stored data
  – input signal is disconnected from output
  – storage circuitry drives output

                  clk
      i
                                    o



              Schematic

               ’1’                                      ’0’
  i                                     i
                                o                                        o



 Loading / pass-through mode                        Storage mode


Unfold Multiplexer to Simple Gates          ....................................................


                                                          d
                       a                                clk
          s          sel                                                                   o
                                            o
   a
   b          o         b
 Multiplexer: symbol and implementation
                                                               Latch implementation

          Note: inverters on clk Both of the inverters on the clk signal are needed.
          Together, they prevent a glitch on the OR gate when clk is deasserted. If
          there was only one inverter, a glitch would occur. For more on this, see sec-
          tion 5.2.1.6

                            1                                                      0
    d=’0’                       1                               d=’1’                  0
                            1                                                      1
  clk=’1’                               1       0             clk=’1’                          0   1
                            0                       o                              0                   o
                                0                                                      0




                   Loading ’0’                                               Loading ’1’
5.2.1 Simple Multiplexer Latch                                                                      313


      d                       0                             d                       0
                        0                                                   0
clk=’0’                            1      0           clk=’0’                           0       1
                        1                         o=’0’                     1                       o=’1’
                             1                                                      0
                        1                                                   0



                  Storing ’0’                                         Storing ’1’


5.2.1.2 Strategy for Timing Analysis of Storage Devices

The key to calculating setup and hold times of a latch, flop, etc is to identify:

   1. how the data is stored when not connected to the input (often a pair of inverters in a loop)
   2. the gate(s) that the clock uses to cause the stored data to drive the output (often a transmission
      gate or multiplexor)
   3. the gate(s) that the clock uses to cause the input to drive the output (often a transmission gate
      or multiplexor)

        d                    0                            d
                        0                                               1
  clk=’0’                                           clk=’1’
                        1                     o                         0                   o
                                                                                0




          Note: Storage devices vs. Signals      We can talk about the setup and hold
          time of a signal or of a storage device. For a storage device, the setup and
          hold times are requirements that it imposes upon all environments in which it
          operates. For an individual signal in a circuit, there is a setup and hold time,
          which is the amount of time that the signal is stable before and after a clock
          edge.
314                                                              CHAPTER 5. TIMING ANALYSIS


5.2.1.3 Clock-to-Q Time of a Multiplexer Latch


                                          l1
                            d                    l2
                                          c2
                          clk
                                                            qn          q
                                        cn
                                                 s2

                                                   s1


                          Figure 5.8: Latch for Clock-to-Q analysis


       d    ω    α

       l1   ω        α
       l2                                               α
      qn    ω                                               α

       q    ω                                                   α
      s1    ω                                                       α

      s2    ω

      clk
      cn

      c2

                                               clock-to-Q

                  Figure 5.9: Waveforms of latch showing Clock-to-Q timing

Assume that input is stable, and then clock signal transitions to cause the circuit to move from
storage mode to load mode.

Calculate clock-to-Q time by finding delay of critical path from where clock signal enters storage
circuit to where q exits storage circuit.

The path is: clk → cn → c2 → l2 → qn → q, which has a delay of 5 (assuming each gate has a
delay of exactly one time unit).
5.2.1 Simple Multiplexer Latch                                                                                    315


5.2.1.4 Setup Timing of a Multiplexer Latch

Storage device transitions from load mode to store mode. Setup is time that input must be stable
before clock changes.

                                                           l1
                                        d                           l2
                                                           c2
                                      clk
                                                                                 qn           q
                                                          cn
                                                                    s2

                                                                     s1


                                          Figure 5.10: Latch for Setup Analysis
                              setup + margin

      d      ω    α
      l1     ω            α
      l2     ω                α
     qn      ω                        α

      q      ω                                α
     s1      ω                                    α

     s2                                                              ω

     clk
     cn

     c2



                              Figure 5.11: Setup with margin: goal is to store α

Step-by-step animation of latch transitioning from load to store mode.
                                                                            α                 α
       α              α                                                    d                      0
     d                            α                                          0        1   0
       1     0    1                                                      clk
   clk                                                                                                    α       α
                                                      α         α
                                                                                              1
                      0                                                                           α
                                  0                                                           α
                      α
                                                                                                      α   α
                                          α           α

                                                                                        t=3: l2 is set to 0,
           Circuit is stable in load mode
                                                                                  because c2 turns off AND gate
316                                                            CHAPTER 5. TIMING ANALYSIS

       α               α                                  α              α
     d                     α                           d                      0
       0      0    1                                     0      1    0
   clk                                               clk
                                     α           α                                     α           α
                       0                                                 1
                           0                                                 α
                       α                                                 α
                               α     α                                            α    α


       t=0: Clk transitions from load to store            t=4: α from store path propagates to q

       α               α                                  α              α
     d                     α                           d                      0
       0      1    1                                     0      1    0
   clk                                               clk
                                     α           α                                     α           α
                       1                                                 1
                           0                                                 α
                       α                                                 α
                               α     α                                            α    α


       t=1: Clk transitions from load to store           t=5: α from store path completes cycle

       α               α
     d                     α
       0      1    0
   clk
                                     α           α
                       1
                           α
                       α
                               α     α


             t=2: s1 propagates to s2,
           because cn turns on AND gate

The value on s1 at t=1 will propagate from the store loop to the output and back through the store
loop. At t=1, s1 must have the value that we want to store. Or, equivalently, the value to store must
have saturated the store loop by t=1. It takes 5 time units for a value on the input d to propagate to
s1 (d → l1 → l2 → qn → q → s1).

The setup time is the difference in the delay from d to s1 and the delay from clk to cn: 5 − 1 = 4,
so the setup time for this latch is 4 time units.
5.2.1 Simple Multiplexer Latch                                                                                                                317



                                   setup with negative margin

            d     ω                 α

            l1    ω                     α

            l2    ω                         α

           qn     ω                             α   α/ω       ω           α   α/ω             ω           α   α/ω             ω           α

            q     ω                                 α     α/ω         ω           α   α/ω             ω           α   α/ω             ω

            s1    ω                                       α     α/ω           ω           α   α/ω             ω           α   α/ω

            s2                                  ω                 α   α/ω             ω           α   α/ω             ω           α   α/ω

           clk

           cn

           c2


                                 Figure 5.12: Setup Violation
318                                                                  CHAPTER 5. TIMING ANALYSIS


Step-by-step animation of latch transitioning from load to store mode with setup violation where
α arrives 1 time-unit before the rising edge of the clock.
                                                          α                  α
      ω                 ω                                d                           α
  d                         ω                              0     1           1
    1          0    1                                  clk
clk                                                                                          ω         ω
                                      ω            ω
                                                                             1
                        0                                                            0
                            0                                                ω
                        ω
                                                                                         ω   ω
                                ω     ω

                                                               t=1: α propagates through AND
      Circuit is stable in load mode with ω
                                                               Clk propagates through inverter

                                                Trouble: inconsistent values on load path and store path.
      α                 ω                       Old value (ω) still in store path when store path is enabled.
  d                         ω
    1          0    1
clk                                              α               α
                                      ω        d ω                       α
                                                 0        1      0
                        0                    clk
                            0                                                            α         ω
                        ω
                                                                 1
                                ω     ω                                  ω
                                                                 ω
                                                                                 ω       ω
           t=-1: D transitions from ω to α
                                                  t=2: old ω propagates through AND

      α                 α                                 α                  α
  d                         ω                            d                           0
    0          0    1                                      0         1   0
clk                                                    clk
                                      ω            ω                                         ω/α       α
                        0                                                    1
                            0                                                        ω
                        ω                                                    ω
                                ω     ω                                                  ω   α

          t=0: α propagates through inverter                          t=3: l2 is set to 0,
           Clk transitions from load to store                   because c2 turns off AND gate
5.2.1 Simple Multiplexer Latch                                                                                                                                      319


    α                    α                                                      α=1                                 0
  d                                                                        d                                                        0
    0     1          0                                                       0                      1           0
clk                                                                      clk
                                                   ω                ω/α                                                                                         0    1
                         1                                                                                          1
                                   ω                                                                                                0
                         α                                                                                          1
                                           α      ω/α                                                                                           1               1

   t=4: ω/α from store path propagates to q                                     t=5: Illustrate instability with ω=0, α=1

    α                    α
  d                                    0
    0      1         0
clk
                                                    ω                ω
                          1
                                   α
                         ω/α
                                           ω/α      ω

   t=5: ω/α from store path completes cycle
                                           setup with negative margin
                                                 -3 -2 -1 0     1    2    3     4       5       6

                d              ω                        α

                l1             ω                            α
                l2             ω                                α
               qn              ω                                     α    α/ω       ω           α   α/ω             ω           α   α/ω             ω           α

                q              ω                                          α     α/ω         ω           α   α/ω             ω           α   α/ω             ω

               s1              ω                                                α     α/ω           ω           α   α/ω             ω           α   α/ω

               s2                                                    ω                  α   α/ω             ω           α   α/ω             ω           α   α/ω

               clk
               cn

               c2
320                                                               CHAPTER 5. TIMING ANALYSIS


          We now repeat the analysis of setup violation, but illustrate the minimum violation (input transi-
          tions from ω to α 3 time-units before the clock edge).
                ω                  ω                               α                α
            d                              ω                      d                     α
              1            0   1                                    1      0    1
          clk                                                   clk
                                                   ω        ω                                     ω           ω
                                   0                                                0
                                           0                                            0
                                   ω                                                ω
                                               ω   ω                                        ω     ω


                    Circuit is stable in load mode with ω               t=-1: α propagates through AND

                α                  ω                               α                α
            d                              ω                      d                     α
              1            0   1                                    0      0    1
          clk                                                   clk
                                                   ω        ω                                     α           ω
                                   0                                                0
                                           0                                            0
                                   ω                                                ω
                                               ω   ω                                        ω     ω


                      t=-3: D transitions from ω to α               t=0: Clk transitions from load to store

                α                  α                               α                α
            d                              ω                      d                     α
              1            0   1                                    0      1    1
          clk                                                   clk
                                                   ω        ω                                     α           α
                                   0                                                1
                                           0                                            0
                                   ω                                                ω
                                               ω   ω                                        ω     α


                    t=-2: α propagates through inverter             t=1: Clk propagates through inverter

  Trouble: inconsistent values on load path and store path.
                                                             α
  Old value (ω) still in store path when store path is enabled.                     α
                                                           d                            0
   α             α                                           0             1    0
 d                       α                               clk
      0         1      0                                                                          α           α
clk
                                               α       α                            1
                                                                                        α
                           1                                                    ω/α
                               ω
                           α                                                                ω/α   α
                                       α       α
                                                                  t=5: ω/α from store path completes cycle
      t=2: old ω propagates through AND
5.2.1 Simple Multiplexer Latch                                                                                                                                     321


    α                    α
  d                                  0                                         α=1                                 0
    0      1         0                                                  d                                                              0
clk                                                                       0                        1           0
                                                 ω/α                α clk
                                                                                                                                                               0    1
                         1
                                 α                                                                                 1
                         α                                                                                                         0
                                                                                                                   1
                                         α        α
                                                                                                                                               1               1

              t=3: l2 is set to 0,
                                                                               t=5: Illustrate instability with ω=0, α=1
        because c2 turns off AND gate

    α                    α
  d                              0
    0     1          0
clk
                                                 α                 ω/α
                         1
                                 α
                         α
                                         α       ω/α


   t=4: ω/α from store path propagates to q
                                             -3 -2 -1 0        1    2    3     4       5       6
                                                      setup with negative margin

                d            ω               α

                l1           ω                   α

                l2           ω                         α

               qn            ω                             α             α/ω       α           α   α/ω             α           α   α/ω             ω           α

                q            ω                                 α               α/ω         α           α   α/ω             α           α   α/ω             ω

               s1            ω                                      α                α/ω           α           α   α/ω             α           α   α/ω

               s2                                                   ω α                    α/ω             α           α   α/ω             α           α   α/ω

               clk
               cn

               c2
322                                                                                         CHAPTER 5. TIMING ANALYSIS



                                        setup with negative margin

             d      ω                       α

             l1     ω                               α

             l2     ω                                    α

            qn      ω                                         α   α/ω       ω           α   α/ω             ω           α   α/ω             ω           α

             q      ω                                              α    α/ω         ω           α   α/ω             ω           α   α/ω             ω

             s1     ω                                                   α     α/ω           ω           α   α/ω             ω           α   α/ω

             s2                                               ω                 α   α/ω             ω           α   α/ω             ω           α   α/ω

            clk

            cn

            c2


                                    Figure 5.13: Setup Violation


                        setup
       d    ω       α
       l1   ω           α
       l2   ω               α

      qn    ω                   α                       α α                 α α
       q    ω                       α                        α α                    α α

      s1    ω                           α                         α α                       α α
      s2                                        α                      α α                          α α
      clk

      cn

      c2


                                Figure 5.14: Minimum Setup Time

When cn is asserted, α must be at s1. Otherwise, ω will affect storage circuitry when data input
is disconnected.
5.2.1 Simple Multiplexer Latch                                                 323


5.2.1.5 Hold Time of a Multiplexer Latch


                                           l1
                         d                       l2
                                           c2
                       clk
                                     cn                     qn             q

                                                 s2

                                                   s1


                             Figure 5.15: Latch for Hold Analysis
                                            hold + margin

      d    α                                                       β

     l1    α                                                           β
     l2    α

     qn    α                                                     α α

      q    α
     s1    α

     s2                                                     α

    clk
     cn

     c2



                         Figure 5.16: Hold OK: goal is to store α
324                                                                   CHAPTER 5. TIMING ANALYSIS



                                                         α                α
      α             α                                   d                             α
   d                            α                         0       1           0
     1       0          1                             clk
 clk                                                                                          α         α
                                      α             α
                                                                              1
                        0                                                             α
                            0                                                 α
                        α
                                                                                              α
                                      α


                                                                  t=6: Clk transition propagates to c2,
                 Circuit is stable in load mode
                                                          l1 may change now without affecting storage device

                                                              α               α
      α             α                                      d                              0
   d                            α                            0        1           0
     0       0          1                                clk
 clk                                                                                              α           α
                                      α             α
                                                                                  1
                        0                                                                 α
                            0                                                     α
                        α
                                                                                                  α
                                      α


                                                                      t=7: Clk transition propagates to l2,
            t=0: Clk transitions from load to store

      α             α
   d                            α
     0       1          1
 clk
                                      α             α
                        1
                            0
                        α
                                      α


             t=5: Clk transition propagates to cn

                                Figure 5.17: Animation of hold analysis

It takes 6 time units for a change on the clock signal to propagate to the input of the AND gate that
controls the load path. It takes 1 time unit for a change on d to propagate to its input to this AND
gate. The data input must remain stable for 6 − 1 = 5 time units after the clock transitions from
load to store mode, or else the new data value (e.g., β) will slip into the storage loop and corrupt
the value α that we are trying to store.
5.2.1 Simple Multiplexer Latch                                                         325



                                         hold with negative margin


    d        ω       α                       β

   l1        ω           α                       β

   l2        ω               α                       β
   qn        ω                   α                       β                       β β
    q        ω                       α                       β
   s1        ω                           α                       β

   s2                                                                        β
  clk
   cn

   c2


                                 Figure 5.18: Hold violation: β slips through to q


                                                     hold

        d        ω       α                                           β

        l1       ω           α                                           β
        l2       ω               α
        qn       ω                   α
        q        ω                       α
        s1       ω                           α

        s2                                                               α

     clk

        cn

        c2


                                         Figure 5.19: Minimum Hold Time

Can’t let β affect l1 before c2 deasserts.

Hold time is difference between path from clk to c2 and path from d to l1.
326                                                                CHAPTER 5. TIMING ANALYSIS


5.2.1.6 Example of a Bad Latch

This latch is very similar to the one from section 5.2.1.5, however this one does not work correctly.
The difference between this latch and the one from section 5.2.1.5 is the location of the inverter
that determines whether l2 or s2 is enabled. When the clock signal is deasserted, c2 turns off the
AND gate l2 before the AND gate s2 turns on. In this interval when both l2 and s2 are turned
off, a glitch is allowed to enter the feedback loop.

The glitch on the feedback loop is independent of the timing of the signals d and clk.


                                             l1
                         d                            l2
                                    c2
                       clk
                                                                   qn               q
                                             cn
                                                      s2

                                                          s1

            d         α                           β

            l1        α                               β

            l2        α

           qn         α                           α                α

            q         α                               α                 α

           s1         α                                    α                α

           s2                               α                  α                α

          clk

           c2
           cn




5.2.2 Timing Analysis of Transmission-Gate Latch

The latch that we now examine is more realistic than the simple multiplexer-based latch. We
replace the multiplexer with a transmission gate.
5.2.2 Timing Analysis of Transmission-Gate Latch                                                              327


5.2.2.1 Structure and Behaviour of a Transmission Gate




  Symbol

               s’                    ’1’                   ’0’                       ’0’                  ’0’


  i                      o       i            o       i                o      ’1’                  ’0’


               s                     ’0’                   ’1’                       ’1’                  ’1’
 Implementation                      Open                 Closed              Transmit ’1’         Transmit ’0’

                     s
           i                 o
 Transmission gate as switch


5.2.2.2 Structure and Behaviour of Transmission-Gate Latch

(Smith 2.5.1)

  d                                               q

clk




                             0                                                        1
      d                                               q            d                                      q

                             1                                                       0
   clk 1                                                      clk 1
                                       1                                                      0

                                      0                                                       1
                    Loading data into latch                                Using stored data from latch
328                                                           CHAPTER 5. TIMING ANALYSIS


5.2.2.3 Clock-to-Q Delay for Transmission-Gate Latch

        d                                   q

 clk 1




5.2.2.4 Setup and Hold Times for Transmission-Gate Latch


        d                       path1       q
                                                            path2                             q
                                                        d
      clk 1
                                                   clk 1
                 path2                                      path1


              Setup time = path1 – path2
                 Setup time for latch                       Hold time = path1 – path2
                                                               Hold time for latch


5.2.3 Falling Edge Flip Flop

(Smith 2.5.2)

We combine two active-high latches to create a falling-edge, master-slave flip flop. The analysis
of the master-slave flip-flop illustrates how to do timing analysis for hierarchical storage devices.
Here, we use the timing information for the active high latch to compute the timing information
of the flip-flop. We do not need to know the primitive structure of the latch in order to derive the
timing information for the flip flop.
5.2.3 Falling Edge Flip Flop                                                          329


5.2.3.1 Structure and Behaviour of Flip-Flop
              d                     m                             q

         clk             EN                EN




    d         A          B     α           C          D       E       β           F

  clk

   m     ??        A     B         α                      D       E       β

clk_b

    q              ??                  α                                      β


              d                     m                             q

         clk             EN                EN




                                               TInv


    d                          α

  clk

   m                                   α

clk_b

    q                                                                 α


                  Tinv        Tmd       Latch          Latch
                                        Setup         Clock-Q

 TInv delay through an inverter
 Tmd propagation delay from m to d
330                                                     CHAPTER 5. TIMING ANALYSIS


5.2.3.2 Clock-to-Q of Flip-Flop

            d             m                     q

          clk    EN           EN




      d               α

  clk

      m                   α

clk_b

      q                                             α


                              Tinv     Latch
                                     Clock-to-Q

                                      Flop
                                   Clock-to-Q


                               T     Flop = TInv + T Latch
                                   CO               CO
5.2.3 Falling Edge Flip Flop                                                                 331


5.2.3.3 Setup of Flip-Flop

                          d                  m                  q

                       clk      EN               EN




    d                                    α

   clk
    m                                        α

 clk_b

    q                                                               α

                  Latch
                  Setup

                  Flop
                  Setup


                                     T         Flop = T         Latch
                                         SUD              SUD

The setup time of the flip flop is the same as the setup time of the master latch. This is because,
once the data is stored in the master latch, it will be held for the slave latch.
332                                                           CHAPTER 5. TIMING ANALYSIS


5.2.3.4 Hold of Flip-Flop

             d               m                 q

          clk       EN            EN




      d                  α             β

   clk
      m                       α

clk_b

      q                                            α



                         Hold time for latch
                         Hold time for flop


                                       T     Flop = T Latch
                                           HO        HO

The hold of the flip flop is the same as the hold time of the master latch. This is because, once the
data is stored in the master latch, it will be held for the slave latch.


5.2.4 Timing Analysis of FPGA Cells

(Smith 5.1.5)

We can apply hierarchical analysis to structures that include both datapath and storage circuitry.
We use an Actel FPGA cell to illustrate. The description of the Actel FPGA cell in the course notes
is incomplete, refer to Smith’s book for additional material.
5.2.4 Timing Analysis of FPGA Cells                                         333


5.2.4.1 Standard Timing Equations


                            T = delay from D-inputs to storage element
                           PD
                      T       = delay from clk-input to storage element
                        CLKD
                       T      = delay from storage element to output
                         OUT
                       T      = setup time
                         SUD
                              = “slowest D path” − “fastest clk path”
                              = T          −T
                                  PD Max       CLKD Min
                         T    = hold time
                           HO
                              = “slowest clk path” − “fastest D path”
                              = T              −T
                                  CLKD Max        PD Min
                         T    = delay clk to Q
                           CO
                              = “clk path” + “output path”
                              = T        +T
                                  CLKD       OUT


5.2.4.2 Hierarchical Timing Equations

Add combinational logic to inputs, clock, and outputs of storage element.
                                                t’
                                                 SUD
                                          d                q      t’
data inputs           t’                        t’                 OUT
                       PD                        HO
                                                t’
                                                 CO


                                              clk

        clk                     t’
                                 CLKD




                                        ′
                       T        = T       + T’       − T’
                           SUD      SUD       PD Max     CLKD Min
                           T    = T ′ + T’            − T’
                             HO     HO      CLKD Max      PD Min
                           T    = T ′ + T’            +T
                             CO     CO      CLKD Max      OUT Max


5.2.4.3 Actel Act 2 Logic Cell

Timing analysis of Actel Act 2 logic cell (Smith 5.1.5).
334                                                             CHAPTER 5. TIMING ANALYSIS


Actel ACT
     • Basic logic cells are called Logic Module
     • ACT 1 family: one type of Logic Module (see Figure 5.1, Smith’s pp. 192)
     • ACT 2 and ACT 3 families: use two different types of Logic Module (see Figure 5.4,
       Smith’s pp. 198)
     • C-Module (Combinatorial Module) — combinational logic similar to ACT 1 Logic Mod-
       ule but capable of implementing five-input logic function
     • S-Module (Sequential Module) — C-Module + Sequential Element (SE) that can be con-
       figured as a flip-flop

Actel Timing
     • ACT family: (see Figure 5.5, Smith’s pp. 200)
     • Simple. Why?
       – Only logic inside the chip
       – Not exact delay (as no place and route, physical layout, hence not accounting for inter-
         connection delay)
       – Non-Deterministic Actel Architecture
       • All primed parameters inside S-Module are assumed — Calculate tSUD, tH, and tCO
       • The combinational logic delay of 3 ns: 0.4 went into increasing the setup time, tSUD, and
         2.6 ns went into increasing the clock-output delay, tCO. From outside we can say that the
         combinational logic delay is buried in the flip-flop set up time


  d                                      q
                                                                           q
clk                                            d
                                             clk

                                                          clr
                                             Actel latch with active-low
  Simple Actel-style latch
                                                        clear


                                     m
  d                                                         q
clk


 clr




                                 Actel flop with active-low clear
5.2.4 Timing Analysis of FPGA Cells                                                         335


                                    SE-Module
        C-Module
d00
d01                                                       m
                                                                                 q
d10
d11                                   se_clk           se_clk_n
 a1
 b1
 a0
 b0


  clk


  clr




                                     Actel sequential module


5.2.4.4 Timing Analysis of Actel Sequential Module

Timing parameters for Actel latch
with active-low clear                           Other given timing parameters

           T     0.4ns
             SUD                                 C-Module delay (t′ )                       3ns
           T     0.9ns                                            PD
             HO                                  t’CLKD (from clk to se clk and se clk n)   2.6ns
           T     0.4ns
             CO


  Question: What are the setup, hold, and T    times for the entire Actel sequential
                                            CO
    module?


  Answer:


        See Smith pp 199. Use Smith’s eqn 5.15, 5.16, and assume t ′   = 2.6ns.
                                                                  CLKD
         T     0.8ns
           SUD
         T     0.5ns
           HO
         T     3.0ns
           CO
336                                                            CHAPTER 5. TIMING ANALYSIS


5.2.5 Exotic Flop

As a contrast to the gate-level implementations of latches that we looked at previously, the figure
below is the schematic for a state-of-the-art high-performance latch circa 2001.
                   precharge node         keeper     precharge node         keeper




                                                                                          q

              d

           clk




                  inverter chain


The inverter chain creates an evaluation window in time when clock has just risen and the p tran-
sistors are turned on.

When clock is ’0’, the left precharge node charges to ’1’ and the right precharge node discharges
to ’0.

If d is ’1’ during the evaluation window, the left precharge node discharges to ’0’. The left
precharge nodes goes through an inverter to the second precharge node, which will charge from
’0’ to texttt’1’, resulting in a ’0’ on q.

If d is ’0’ during the evaluation window, the left precharge node stays at the precharge value of
’1’. The left precharge nodes goes through an inverter to the second precharge node, which will
stay at ’0’, resulting in a ’1’ on q.

The two inverter loops are keepers, which provide energy to keep the precharge nodes at their
values after the evaluation window has passed and the clock is still ’1’.



5.3 Critical Paths and False Paths
5.3.1 Introduction to Critical and False Paths

In this section we describe how to find the critical path through the circuit: the path that limits the
maximum clock speed at which the circuit will work correctly. A complicating factor in finding the
5.3.1 Introduction to Critical and False Paths                                                                   337


critical path is the existence of false paths: paths through the circuit that appear to be the critical
path, but in fact will not limit the clock speed of the circuit. The reason that a path is false is that
the behaviour of the gates prevents a transition (either 0 → 1 or 1 → 0) from travelling along the
path from the source node the destination node.


   Definition critical path: The slowest path on the chip between flops or flops and pins.
     The critical path limits the maximum clock speed.


   Definition false path: : a path along which an edge cannot travel from beginning to end.


To confirm that a path is a true critical path, and not a false path, we must find a pair of input
vectors that exercise the critical path. The two input vectors usually differ only their value for the
input signal on the critical path.1 The change on this signal (either 0 → 1 or 1 → 0) must propagate
along the candidate critical path from the input to the output.

Usually the two input vectors will produce different output values. However, a critical path might
produce a glitch (0 → 1 → 0 or 1 → 0 → 1) on the output, in which case the path is still the critical
path, but the two input vectors both result in the same value on the output signal. Glitches should be
ignored, because they may result in setup violations. If the glitching value is inside the destination
flop or latch at the end of the clock period, then the storage element will not store a stable value.


Outline      .............................................................................. .


The algorithm that we present comes from McGeer and Brayton in a DAC 198? paper. The
algorithm to find the critical path through a circuit is presented in several parts.

   1. Section 5.3.2: Find the longest path ignoring the possibility of false paths.
   2. Section 5.3.3: Almost-correct algorithm to test whether a candidate critical path is a false
      path.
   3. Section 5.3.4: If a candidate path is a false path, then find the next candidate path, and repeat
      the false-path detection algorithm.
   4. Section 5.3.5: Correct, complete, and complex algorithm to find the critical path in a circuit.
   1
    Section 5.3.5 discusses late-side inputs and situations where more than one input needs to change for the critical
path to be exercised.
338                                                                  CHAPTER 5. TIMING ANALYSIS


Notes     ................................................................................ .


           Note:     The analysis of critical paths and false paths assumes that all inputs
           change values at exactly the same time. Timing differences between inputs are
           modelled by the skew parameter in timing analysis.

Throughout our discussion of critical paths, we will use the delay values for gates shown in the
table below.
                                                gate delay
                                                NOT      2
                                                AND      4
                                                OR       4
                                                XOR      6


5.3.1.1 Example of Critical Path in Full Adder

  Question:        Find the critical path through the full-adder circuit shown below.

                       ci
                                        i                                                      s
                        a
                        b                                     k
                                        j                                                      co




  Answer:



        Annotate with Max Distance to Destination          . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..

                            8                          6
                       ci                                       0                          0
                          14       14                  6                                        s
                        a                   8
                          14       14                  8
                        b                                       4          4
                                   8                   8                              0    0
                                            4                              4                    co
                                   8




        Find Candidate Critical Path        .............................................. .
5.3.1 Introduction to Critical and False Paths                                                                                               339


                         8                                        6
                     ci                                                      0                          0
                        14             14                         6                                          s
                      a                       8
                        14             14                         8
                      b                                                      4          4
                                       8                          8                                0    0
                                              4                                         4                    co
                                       8


      There are two paths of length 14: a–co and b–co. We arbitrarily choose
      a–co.


      Test if Candidate is Critical Path            . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..

                        ’1’
                     ci
                                                                                                             s
                      a
                        ’0’            ’0’                       ’1’
                      b
                                       ’0’
                                              ’0’                                      ’0’                   co


      Yes, the candidate path is the critical path.

      The assignment of ci=1, a=0, b=0 followed in the next clock cycle by ci=1,
      a=1, b=0 will exercise the critical path. As a shortcut, we write the pair of
      assignments as: ci=1, a=↑, b=0.


  Question:      Do the input values of ci=0, a=↓, b=1 exercise the critical path?


  Answer:

                                 ’0’
                              ci
                                                                                                                        s
                               a
                                 ’1’         ’1’                           ’0’
                               b                                                      ’0’         ’0’
                                             ’1’
                                                                                                                        co



      The alternative does not exercise the critical path. Instead, the alternative
      excitation follows a shorter path, so the output stabilizes sooner.

      Lesson: not all all transitions on the inputs will exercise the critical path.
      Using timing simulation to find the maximum clock speed of a circuit might
      overestimate the clock speed, because the inputs values that you simulate
      might not exercise the critical path.
340                                                           CHAPTER 5. TIMING ANALYSIS


5.3.1.2 Preliminaries for Critical Paths

There are three classes of paths on a chip:


   • entry path: from an input to a flop

      Quartus does not report this by default. When Quartus reports this path, it is reported as the
      period associated with “System fmax”.
      In Xilinx timing reports this is reported as “Maximum Delay”

   • stage path: from one flop to another flop

      In Quartus timing reports, this is reported as the period associated with “Internal fmax”.
      In Xilinx timing reports, this is reported as “Clock to Setup” and “Maximum Frequency”.

   • exit path: from a flop to an output

      Quartus does not report this by default. When Quartus reports this path, it is reported as the
      period associated with “System fmax”.
      In Xilinx timing reports this is reported as “Maximum Delay”


5.3.1.3 Longest Path and Critical Path

The longest path through the circuit might not be the critical path, because the behaviour of the
gates might prevent an edge (0 → 1 or 1 → 0) from travelling along the path.


Example False Path       .................................................................. .



   Question:     Determine whether the longest path in the circuit below is a false path

                                   a
                                                                y

                                   b
5.3.1 Introduction to Critical and False Paths                                          341


  Answer:


      For this example, we use a very naive approach simply to illustrate the
      phenomenon of false paths. Sections 5.3.2–5.3.5 present a better algorithm
      to detect false paths and find the real critical path.

      In the circuit above, the longest path is from b to y:

      The four possible scenarios for the inputs are:


                                      (a = 0,        b = 0 → 1)
                                      (a = 0,        b = 1 → 0)
                                      (a = 1,        b = 0 → 1)
                                      (a = 1,        b = 1 → 0)


                         a = 0, b = 0 → 1                      a = 0, b = 1 → 0
                                                           0
                 a                                     a                    0
                                                 y                                0 y
                                                                  0         0
                                                                       0
                 b                                     b

                         a = 1, b = 0 → 1                      a = 1, b = 1 → 0
                     0                                     1
                 a                    0                a                    1
                                            0 y                                   1 y
                           0          0                           1
                                 0
                 b                                     b

      In each of the four scenarios, the edge is blocked at either the AND gate or
      the OR gate. None of the four scenarios result in an edge on the output y, so
      the path from b to y is a false path.


  Question:     How can we determine analytically that this is a false path?


  Answer:


      The value on a will always force either the AND gate to be a ’0’ (when a is
      ’0’) or the the OR gate to be a ’1’ (when a is ’1’). For both a=’0’ and
      a=’1’, a change on b will be unable to propagate to y. The algorithm to
      detect false paths is based upon this type of analysis.
342                                                                    CHAPTER 5. TIMING ANALYSIS


Preview of Complete Example             ........................................................ .


This example illustrates all of the concepts in analysing critical paths. Here, we explore the circuit
informally. In section 5.3.5, we will revisit this circuit and analyse it according to the complete,
correct, and complex algorithm.


   Question:     Find the critical path through the circuit below.


                                    b          d       e                      g
                       a                                           f

                                    c




   Answer:


      Even though the equation for this circuit reduces to false, the output signal (g)
      is not a constant ’0’. Instead, glitches can occur on g. To explore the
      behaviour of the circuit, we will stimulate the circuit first with a falling edge,
      then a rising edge.

      Stimulate the circuit with a falling edge and see which path the edge follows.
                                                                          0
                           0    0          2       4           6
                                    b          d       e                      g
                       a                                           f10
                                                           2
                                0
                                    c



      The longest path through the circuit is the middle path.

      At g, the side input (a) has a controlling value before the falling edge arrives
      on the path input (e). Thus, a falling edge is unable to excite the longest path
      through the circuit.

      Stimulate the circuit with a rising edge and see which path the edge follows.
                                                                          0
                           0    0          2       4           6                  10
                                    b          d       e                      g
                       a                                           f6
                                                           2
                                0
                                    c



      At f, the side input c has a controlling value before the falling edge arrives on
      the path input (e). Thus, a rising edge is unable to excite the longest path
      through the circuit.
5.3.2 Longest Path                                                                                343


      Of the two scenarios, the falling edge follows a longer path through the circuit
      than the rising edge. The critical path is the lower path through the circuit.

      When we develop our first algorithm to detect false paths (section 5.3.3), we
      will assume that at each gate, the input that is on the critical path will arrive
      after the other inputs. Not all circuits satisfy the assumption. At f, when a is a
      falling edge, the path input (c) arrives before the side input e. This
      assumption is removed in section 5.3.5, where we present the complete
      algorithm by dealing with late-arriving side inputs.


5.3.1.4 Timing Simulation vs Static Timing Analysis

The delay through a component is usually dependent upon the values on signals. This is because
different paths in the circuit have different delays and some input values will prevent some paths
from being exercised. Here are two simple examples:


   • In a ripple-carry adder, if a carry out of the MSB is generated from the least significant bit,
     then it will take longer for the output to stabilize than if no carries generated at all.

   • In a state machine using a one-hot state encoding, false paths might exist when more than
     one state bit is a ’1’.


Because of these effects, static timing analysis might be overly conservative and predict a delay
that is greater than you will experience in practice. Conversely, a timing simulation may not
demonstrate the actual slowest behaviour of your circuit: if you don’t ever generate a carry from
LSB to MSB, then you’ll never exercise the critical path in your adder. The most accurate delay
analysis requires looking at the complete set of actual data values that will occur in practice.


5.3.2 Longest Path

The following is an algorithm to find the longest path from a set of source signals to a set of
destination signals. We first provide a high-level, intuitive, description, and then present the actual
algorithm.


Outline of Algorithm to Find Longest Path         ............................................ .


The basic idea is to annotate each signal with the maximum delay from it to an output.
• Start at destination signals and traverse through fanin to source signals.
  – Destination signals have a delay of 0
  – At each gate, annotate the inputs by the delay through the gate plus the delay of the output.
344                                                                                        CHAPTER 5. TIMING ANALYSIS


     – When a signal fans out to multiple gates, annotate the output of the source (driving) gate with
       maximum delay of the destination signals.
• The primary input signal with the maximum delay is the start of the longest path. The delay
  annotation of this signal is the delay of the longest path.
• The longest path is found by working from the source signal to the destination signals, picking
  the fanout signal with the maximum delay at each step.


Algorithm to Find Longest Path                  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..


     1. Set current time to 0
     2. Start at destination signals
     3. For each input to a gate that drives a destination signal, annotate the input with the current
        time plus the delay through the gate
     4. For each gate that has times on all of its fanout but not a time for itself,
         (a) annotate each input to the gate with the maximum time on the fanout plus the delay
             through the gate
         (b) go to step 4

     5. To find the longest path, start at the source node that has the maximum delay. Work forward
        through the fanout. For signals that fanout to multiple signals, choose the fanout signal with
        the maximum delay.


Longest Path Example            .. . . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. ..



     Question:     Find the longest path through the circuit below.

            d       f
 a                                       g                                                          j
                                                                                                                                  l
                                                                h
b
                                                                                                            k                   m
                                                                    i
            e
 c



     Answer:


        Annotate signals with the maximum delay to an output:
5.3.3 Detecting a False Path                                                                      345


                  d 14     f 12      12 g
         a 16                             8                             2       j 0
                                     12
                                                           h 4
         b 12                                         6
                                                                            4         k 0
                                                      8     i 4             4
                   e 8
         c 10                                         8


      Find longest path:
                  d 14     f 12      12 g
         a 16                             8                             2       j 0
                                     12
                                                           h 4
         b 12                                         6
                                                                            4     k 0
                                                      8     i 4             4
                  e 8
         c 10                                         8


      The path from a to y has a delay of 16.


5.3.3 Detecting a False Path

In this section, we will explore a simple and almost correct algorithm to determine if a path is a
false path. The simple algorithm in this section sometimes gives the incorrect results if the candi-
date path intersects false paths. For all of the example circuits in this section, the algorithm gives
the correct result. The purpose of presenting this almost-correct algorithm is that it is relatively
easy to understand and introduces one of the key concepts used in the complicated, correct, and
complete algorithm for finding the critical path in section 5.3.5.


5.3.3.1 Preliminaries for Detecting a False Path

The controlling value of a gate is the value such that if one of the inputs has this value, the output
can be determined independently of the other inputs.

For an AND gate, the controlling value is ’0’, because when one of the inputs is a ’0’, we know
that the output will be ’0’ regardless of the values of the other inputs.

The controlled output value is the value produced by the controlling input value.

                         Gate     Controlling Value       Controlled Output
                         AND            ’0’                     ’0’
                          OR            ’1’                     ’1’
                         NAND           ’0’                     ’1’
                         NOR            ’1’                     ’0’
                         XOR            none                    none
346                                                            CHAPTER 5. TIMING ANALYSIS


Path Input, Side Input        .................................................................



   Definition path input: For a gate on a path (either a candidate critical path, or a real
     critical path), the path input is the input signal that is on the path.


   Definition side input: For a gate on a path (either a candidate critical path, or a real
     critical path), the side inputs are the input signals that are not on the path.


The key idea behind the almost-correct algorithm is that: for an edge to propagate along a path,
the side inputs to each gate on the path must have non-controlling values. The complete, correct,
and complicated algorithm generalizes this constraint to handle circuits where the side inputs are
on false paths.


Reconvergent Fanout       ..................................................................



   Definition reconvergent fanout: There are paths from signals in the fanout of a gate that
     reconverge at another gate.


Most of the difficulties both with critical paths and with testing circuits for manufacturing faults
(Chapter 7) are caused by reconvergent fanout.
                                                           g
                          a
                                                                           y
                                                           h
                                       d       e
                          b                                                z

                                                   f
                          c

There are two sets of reconvergent paths in the circuit above. One set of reconvergent paths goes
from a to y and one set goes from d to z.

If a candidate path has reconvergent fanout, then the rising or falling edge on the input to the path
might cause a side input along the path to have a rising or falling edge, rather than a stable ’0’ or
’1’.

To support reconvergent fanout, we extend the rule for side inputs having non-controlling values
to say that side inputs must have either non-controlling values or have edges that stabilize in non-
controlling values.
5.3.3 Detecting a False Path                                                                                                         347


Rules for Propagating an Edge Along a Path            . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..


These rules assume that side inputs arrive before path inputs. Section 5.3.5 relaxes this constraint.
    NOT

                 1                                                  1
    AND



                 0                                                  0
      OR



                 1                                                  1


    XOR
                 0                                                  0




   Question: Why do the rules not have falling edges for AND gates or rising edges for
     OR gates on the side input?



   Answer:

                                                             a
                                   a
                                   b              c          b

                                                             c


      For an AND gate, a falling edge on side-input will force the output to change
      and prevent the path input from affecting the output. This is because the final
      value of a falling edge is the controlling value for an AND gate. Similarly, for an
      OR gate, the final value of a rising edge is the controlling value for the gate.
348                                                                              CHAPTER 5. TIMING ANALYSIS


Analyzing Rules for Propagating Edges                  ............................................... .


The pictures below show all combinations of output edge (rising or falling) and input values (con-
stant 1, constant 0, rising edge, falling edge) for AND and OR gates. These pictures assume that
the side input arrives before the path intput. The pictures that are crossed out illustrate situations
that prevent the path input from affecting the output. In these situations the inputs cause either a
constant value on the output or the side input affects the output but the path input does not. The
pictures that are not crossed out correspond to the rules above for pushing edges through AND and
OR gates.

                      0                                                               1


               constant 0 output       constant 0 output
   AND
                      0                                                               1


                constant 0 output        0 is controlling

                      0                                                               1


                                                                          constant 1 output                    1 is controlling
      OR
                      0                                                               1


                                                                           constant 1 output                 constant 1 output



Viability Condition of a Path       . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . . .. . . . . .. . . ..



   Definition Viability condition: For a path (p) though a circuit, the viability condition
     (sometimes called the viability constraint) is a Boolean expression in terms of the
     input signals that defines the cases where an edge will propagate along the path.
     Equivalently: the cases where a transition on the primary input to the path will excite
     the path.


Based upon the rules for propagating an edge that we have seen so far, the viability condition for
a path is: every side input has a non-controlling value. As always, section 5.3.5 has the complete
viability condition.
5.3.3 Detecting a False Path                                                                       349


5.3.3.2 Almost-Correct Algorithm to Detect a False Path

The rules above for propagating an edge along a candidate path assume that the values on side
inputs always arrive before the value on the path input. This is always true when the candidate
path is the longest path in the circuit. However, if the longest path is a false path, then when we are
testing subsequent candidate paths, there is the possibility that a side input will be on a false path
and the side input value will arrive later than the value from the path input.

This almost-correct algorithm assumes that values on side inputs always arrive before values on
path inputs. The correct, complex, and complete critical path algorithm in section 5.3.5 extends
the almost correct algorithm to remove this assumption.

To determine if a path through a circuit is a false path:

   1. Annotate each side input along the path with its non-controlling value. These annotations
      are the constraints that must be satisfied for the candidate path to be exercised.
   2. Propagate the constraints backward from the side inputs of the path to the inputs of the circuit
      under consideration.
   3. If there is a contradiction amongst the constraints, then the candidate path is a false path.
   4. If there is no contradiction, then the constraints on the inputs give the conditions under which
      an edge will traverse along the candidate path from input to output.


5.3.3.3 Examples of Detecting False Paths

False-Path Example 1       ................................................................ .



   Question:      Determine if the longest path in the circuit below is a false path.

                  d 14     f 12      12 g
         a 16                             8                              2       j 0
                                     12
                                                            h 4
         b 12                                         6
                                                                             4     k 0
                                                      8     i 4              4
                   e 8
         c 10                                         8




   Answer:


      Compute constraints for side inputs to have non-controlling values:
350                                                                  CHAPTER 5. TIMING ANALYSIS


                 d              f                 g
           a                                                                           j
                                    1   ‘1’                                                              l
        b                           0                 ‘0’ h ‘1’
                                                                                   ‘1’ k
                                                                                                     m
                                                                 i
                   e                                  ‘0’
           c

                                                            Contradictory values.

                       side input non-controlling value                constraint
                          g[b]             1                               b
                          i[e]             0                               c
                          k[h]             1                               b

      Found contradiction between g[b] needing b and k[h] needing b, therefore the
      candidate path is a false path.

      Analyze cause of contradiction:
               d            f                 g
       a                                                                           j
                                                                                                     l
                                                            h
       b
                                                                                           k
                                                                                                   m
                                                             i
                e 2
       c

                                                            These side inputs will always have
                                                            opposite values. Both side inputs
                                                            feed the same type of gate (AND),
                                                            so it always be the case that one of the
                                                            side inputs will be a controlling value (0).



False-Path Example 2            ................................................................ .



  Question:     Determine if the longest path through the circuit below is a critical path.
    If the longest path is a critical path, find a pair of input vectors that will exercise the
    path.


                                              d
                        a                                   f
                        b
                                                                               h
                                                            g
                                              e
                        c
5.3.3 Detecting a False Path                                                                   351


   Answer:

                                         d
                        a                               f
                        b
                                                                 ‘1’
                                                  ‘0’                  h
                                   ‘1’                  g
                                         e
                        c

                      side input non-controlling value           constraint
                         e[a]             1                          a
                         g[b]             0                          b
                         h[f]             1                        a+b

      Complete constraint is conjunction of constraints: ab(a + b), which reduces to
      false. Therefore, the candidate path is a false path.


False-Path Example 3        ................................................................ .


This example illustrates a candidate path that is a true path.


   Question: Determine if the longest path through the circuit below is a critical path. If
     the longest path is a critical path, find a pair of input vectors that will exercise the
     path.


                                         d
                        a                               f
                        b
                                                                       h
                                                        g
                                         e
                        c



   Answer:


      Find longest path; label side inputs with non-controlling values:
                                         d
                        a                               f
                        b
                                                                 ‘1’
                                                  ‘0’                  h
                                   ‘0’                  g
                                         e
                        c
352                                                              CHAPTER 5. TIMING ANALYSIS


      Table of side inputs, non-controlling values, and constraints on primary inputs:

                     side input non-controlling value             constraint
                        e[a]             0                            a
                        g[b]             0                            b
                        h[b]             1                          a+b

      The complete constraint is ab(a + b), which reduces to ab. Thus, for an edge
      to propagate along the path, a must be ’0’ and b must be ’0’.

      The primary input to the path (c) does not appear in the constraint, thus both
      rising and falling edges will propagate along the path. If the primary input to
      the path appears with a positive polarity (e.g. c) in the constraint, then only a
      rising edge will propagate. Conversely, if the primary input appears negated
      (e.g., c), then only a falling edge will propagate.

      The primary input to the path (c) does not appear in the constraint, thus both
      rising and falling edges will propagate along the path. If the primary input to
      the path appears with a positive polarity (e.g. c) in the constraint, then only a
      rising edge will propagate. Conversely, if the primary input appears negated
      (e.g., c), then only a falling edge will propagate.

                              Critical path     c, e, g, h
                              Delay             14
                              Input vector      a=0, b=0, c=rising edge

      Illustration of rising edge propagating along path:
                                         d‘1’
                       a‘0’                         ‘1’   f‘1’
                       b‘0’                         ‘0’
                                                                  ‘1’
                                                    ‘0’                   h
                                   ‘0’                    g
                                         e
                       c


      Illustration of falling edge propagating along path:
                                         d‘1’
                       a‘0’                         ‘1’   f‘1’
                       b‘0’                         ‘0’
                                                                  ‘1’
                                                    ‘0’                   h
                                   ‘0’                    g
                                         e
                       c



False-Path Example 4       ................................................................ .


This example illustrates reconvergent fanout.
5.3.3 Detecting a False Path                                                                  353


  Question: Determine if the longest path through the circuit below is a critical path. If
    the longest path is a critical path, find a pair of input vectors that will exercise the
    path.


                                c          d
                      a                                             g

                                           e       f
                      b



  Answer:

                                c          d
                      a                                       ‘1’   g

                                           e       f
                      b              ‘1’


                    side input non-controlling value          constraint
                       e[b]             1                         b
                       g[d]             1                         a

     The complete constraint is ab.

     The constraint includes the input to the path (a), which indicates that not all
     edges will propagate along the path. The polarity of the path input indicates
     the final value of the edge. In this case, the constraint of a means that we
     need a rising edge.

                               Critical path a, c, e, f, g
                               Delay         12
                               Input vector a=rising edge, b=1

     Illustration of rising edge propagating along path:
                                c          d
                      a                                             g

                                           e       f
                      b‘1’           ‘1’


     If we try to propagate a falling edge along the path, the falling edge on the
     side input d forces the output g to fall before the arrival of the falling edge on
     the path input f. Thus, the edge does not propagate along the candidate
     path.
354                                                             CHAPTER 5. TIMING ANALYSIS


                                  c           d
                        a                                               g

                                               e       f
                        b‘1’            ‘1’




Patterns in False Paths      ............................................................... .


After analyzing these examples, you might have begun to observe some patterns in how false paths
arise. There are several patterns in the types of reconvergent fanout that lead to false paths. For
example, if the candidate path has an OR gate and an AND that are both controlled by the same
signal and the candidate has an even number of inverters between these gates then the candidate
path is almost certainly a false path. The reason is the same as illustrated in the first example of a
false path. The side input will always have a controlling value for either the OR gate or the AND
gate.


5.3.4 Finding the Next Candidate Path

If the longest path is a false path, we need to find the next longest path in the circuit, which will be
our next candidate critical path. If this candidate fails, we continue to find the next longest of the
remaining paths, ad infinitum.


5.3.4.1 Algorithm to Find Next Candidate Path

To find the next candidate path, we use a path table, which keeps track of the partial paths that
we have explored, their maximum potential delay, and the signals that we can follow to extend a
partial path toward the outputs. We keep the path table sorted by the maximum potential delay of
the paths. We delete a path from the table if we discover that it is a false path.

The key to the path table is how to update the potential delay of the partial paths after we discover
a false path. All partial paths that are prefixes of the false path will need to have their potential
delay values recomputed. The updated delay is found by following the unexplored signals in the
fanout of the end of the partial path.
   1. Initialize path table with primary inputs, their potential delay, and fanout.
   2. Sort path table by potential delay (path with greatest potential delay at bottom of table)
   3. If the partial path with the maximum potential delay has just one unused fanout signal,
      then extend the partial path with this signal.
      Otherwise:
       (a) Create a new entry in the path table for the partial path extended by the unused fanout
           signal with the maximum potential delay.
5.3.4 Finding the Next Candidate Path                                                              355


      (b) Delete this fanout signal from the list of unused fanout signals for the partial path.

  4. Compute the constraint that side input of the new signal does not have a controlling value,
     and update constraint table.
  5. If the new constraint does not cause a contradiction,
     then return to step 3.
     Otherwise:
      (a) Mark this partial path as false.
      (b) For each partial path that is a prefix of the false path:
          • reduce the potential delay of the path by the difference between the potential delay
            of the fanout that was followed and the unused fanout with next greatest delay value.
      (c) Return to step 2


5.3.4.2 Examples of Finding Next Candidate Path

Next-Path Example 1      ..................................................................



  Question: Starting from the initial delay calculation and longest path, find the next
    candidate path and test if it is a false path.

                 d 14    f 12        12 g
        a 16                              8                            2       j 0
                                     12
                                                         h 4
        b 12                                        6
                                                                           4     k 0
                                                    8    i 4               4
                 e 8
        c 10                                        8



  Answer:


     Initial state of path table:
                                    potential   unused
                                    delay       fanout       path
                                    10             e         c
                                    12            h, g       b
                                    16             d         a

     Extend path with maximum potential delay until find contradiction or reach
     end of path. Add an entry in path table for each intermediate path with
     multiple signals in the fanout.
356                                                        CHAPTER 5. TIMING ANALYSIS


      Path table and constraint table after detecting that the longest path is a false
      path:

                            potential   unused
                            delay       fanout     path
                            10              e      c
                            12            h, g     b
                            16             j, i    a, d, f, g
                            false                  a, d, f, g, i, k

                    side input non-controlling value          constraint
                       g[b]             1                         b
                       i[e]             0                         c
                       k[h]             1                         b

      The longest path is a false path. Recompute potential delay of all paths in
      path table that are prefixes of the false path.

      The one path that is a prefix of the false path is: a,d,f,g . The remaining
      unused fanout of this path is j, which has a potential delay on its input of 2.
      The previous potential delay of g was 8, thus the potential delay of the prefix
      reduces by 8 − 2 = 6, giving the path a potential delay of 16 − 6 = 10.

      Path table after updating with new potential delays:

                            potential   unused
                            delay       fanout     path
                            false                  a, d, f, g, i, k
                            10              e      c
                            10               i     a, d, f, g
                            12             h, g    b

      Extend b through g, because g has greater potential delay than the other
      fanout signal (h).

                            potential   unused
                            delay       fanout     path
                            false                  a, d, f, g, i, k
                            10               e     c
                            10                i    a, d, f, g
                            12             h, g    b
                            12              i, j   b, g

                    side input non-controlling value          constraint
                       g[a]             1                         a

      From g, we will follow i, because it has greater potential delay than j.
5.3.4 Finding the Next Candidate Path                                                                      357


                                 potential       unused
                                 delay           fanout         path
                                 false                          a, d, f, g, i, k
                                 10                 e           c
                                 10                  i          a, d, f, g
                                 12               h, g          b
                                 12                i, j         b, g
                                 12                             b, g, i, k

                    side input non-controlling value                       constraint
                       g[a]             1                                      a
                       i[e]             0                                      c
                       k[h]             1                                      b

     We have reached an output without encountering a contradiction in our
     constraints. The complete constraint is abc.
                             Critical path       b, g, i, k
                             Delay               12
                             Input vector        a=1, b=falling edge, c=1

     Illustrate the propagation of a falling edge:
                d        f                   g
        a‘1’                          ‘1’
                                                                                   2       j

                                                                  h
        b
                                                                                               k
                                                                  i
                e
        c‘1’                                              ‘0’



     At k, the rising edge on the side input (h) arrives before the falling edge on
     the path input (i). For a brief moment in time, both the side input and path
     input are ’1’, which produces a glitch on k.


Next-Path Example 2     ..................................................................


  Question:    Find the critical path in the circut below

       a                                                                               l           m
                                                                                                       m
       b                                                               j
                                                   i
       c                     g        h
                f
       d
                                                                       k
                                                                                                       k
       e
358                                                                CHAPTER 5. TIMING ANALYSIS


  Answer:


      Find the longest path:
        a 10                                                                  6
                                                                                  l    2   m   0
                                                                                                   m
        b 14                                                10    j
                                                                              6
                                           14                         6
                                                 i 10
        c 20           20
                            g   16    h    14
                                                            10
                 f
        d 22           20
                                                            4
                                                                  k   0                        0
            4                                               4                                      k
        e

      Initial state of path table:

                                     potential     unused
                                     delay         fanout        path
                                     4                k          e
                                     10              j, l        a
                                     14                i         b
                                     20               g          c
                                     22                f         d

      Extend path with maximum potential delay until find contradiction or reach
      end of path. Add an entry in path table for each intermediate path with
      multiple fanout signals.

                             potential unused
                             delay     fanout           path
                             4             k            e
                             10           j, l          a
                             14             i           b
                             20            g            c
                             22          j, k           d, f, g, h, i
                             false                      d, f, g, h, i, j, l

                     side input non-controlling value                     constraint
                        g[c]             1                                    c
                        i[b]             0                                    b
                        j[a]             0                                    a
                        l[a]             1                                    a

      Contradiction between j[a] and l[a], therefore the path d,f,g,h,i,j,l is
      a false path. And, any path that extends this path is also false.

      To find next candidate, begin by recomputing delays along the candidate
      path. The second gate in the contradiction is l. The last intermediate path
      before l with unused fanout is i. Cut the candidate path at this signal. The
5.3.4 Finding the Next Candidate Path                                                  359


     remaining initial part of the candidate path is: d, f, g, h, i. The only unused
     fanout of this path is k.
     We now calculate the new maximum potential delay of d, f, g, h, i , taking
     into account the false path that we just discovered. The delay from i along
     the candidate path j, l, m is 10 and the maximum potential delay along the
     remaining unused (k) is 4. The difference is: 10 − 4 = 6, and so the potential
     delay of d, f, g, h, i is reduced to 22 − 6 = 16.
     After updating the partial delay of d, f, g, h, i , the partial path with the
     maximum potential delay is c. The new critical path candidate will be: c, g, h,
     i, j, l, m.
     Update the path table with delay of 16 for previous candidate path. Extend c
     along path with maximum potential delay until find contradiction or reach end
     of path. Add an entry in path table for each intermediate path with multiple
     fanout signals.
                          potential unused
                          delay     fanout       path
                          false                  d, f, g, h, i, j, l
                          4            k         e
                          10          j, l       a
                          14            i        b
                          16           k         d, f, g, h, i
                          20           k         c, f, g, h, i
                          false                  c, f, g, h, i, j, l
     We encounter the same contradiction as with the previous candidate, and so
     we have another false path. We could have detected this false path without
     working through the path table, if we had recognized that our current
     candidate path overlaps with the section (j, l) of previous candidate that
     caused the false path.
     As with the previous candidate, we reduce the potential delay of the current
     candidate the path up through i by 6, giving us a potential delay of
     20 − 10 = 14 for c, f, g, h, i . The next candidate path is d, f, g, h, i, k
     with a delay of 16.
                          potential unused
                          delay     fanout       path
                          false                  d, f, g, h, i, j, l
                          false                  c, f, g, h, i, j, l
                          4            k         e
                          10          j, l       a
                          14            i        b
                          14           k         c, f, g, h, i
                          16           k         d, f, g, h, i
360                                                                               CHAPTER 5. TIMING ANALYSIS


      We extend the path through k and compute the constraint table.

                     side input non-controlling value                                 constraint
                        g[c]             1                                                c
                        i[b]             0                                                b
                        k[e]             0                                                e

      The complete constraint is bce. There is no constraint on a and d may be
      either a rising edge or a falling edge.

                      Critical path d, f, g, h, i, k
                      Delay         16
                      Input vector a=0, b=0, c=1, d=rising edge, e=0


Next Path Example 3         ................................................................. .


  Question:     Find the critical path in the circuit below.

                                          j
                       a                                      l
                                                                              m
                                          k                                                         m
                                e
                       b

                                f    g        h   i
                       c                                                      n
                                                                                            p
                                                                                                    p
                                                                              o
                       d



  Answer:

                                          j
                       a 12          10       8           8
                                                              l
                                                                  4       4       m             0
                                     12   k               8
                                                                                                    m
                                              8                           4
                           14   e    12
                       b

                                f    g        h   i                   8
                       c 16                           8                       n   4
                                                                      8                 4   p   0
                                                                                        4           p
                            8                                         6       o4
                       d

      Initial state of path table:
5.3.4 Finding the Next Candidate Path                                                  361


                               potential    unused
                               delay        fanout   path
                               8              n, o   d
                               12             j, k   a
                               14               e    b
                               16               f    c

     Extend c through f:

                           potential    unused
                           delay        fanout    path
                           8              n, o    d
                           12             j, k    a
                           14               e     b
                           16            m, n     c,f,g,h,i
                           false                  c,f,g,h,i,n,p

                   side input non-controlling value         constraint
                      n[d]             1                        d
                      p[o]             1                        d

     The first candidate is a false path. Recompute potential delay of c, f, g, h, i,
     which reduces it from 16 to 12.
                           potential    unused
                           delay        fanout    path
                           false                  c,f,g,h,i,n,p
                           8               n, o   d
                           12              j, k   a
                           12               m     c,f,g,h,i
                           14                e    b

     Extend b through e:

                           potential    unused
                           delay        fanout    path
                           false                  c,f,g,h,i,n,p
                           8               n, o   d
                           12              j, k   a
                           12               m     c,f,g,h,i
                           false                  b,e,k,l

                   side input non-controlling value         constraint
                      k[a]             1                        a
                       l[j]            1                        a
362                                                             CHAPTER 5. TIMING ANALYSIS


      The second candidate is a false path. There is no unused fanout signal from
      l for the path b, e, k, l, so this partial path is a false path and there is no new
      delay information to compute.

      There are two paths with a potential delay of 12. Choose c, f, g, h, i ,
      because the end of the path is closer to an output, so there will be less work
      to do in analyzing the path.

                                potential    unused
                                delay        fanout      path
                                false                    c,f,g,h,i,n,p
                                false                    b,e,k,l
                                8               n, o     d
                                12              j, k     a
                                12                       c,f,g,h,i,m

                   side input     non-controlling value   constraint
                      m[l]                 0            ¬(a ∗ (ab)) = true

                          Critical path   c,f,g,h,i,m
                          Delay           12
                          Input vector    a=0, b=1, c=rising edge, d=0


5.3.5 Correct Algorithm to Find Critical Path

In this section, we remove the assumption that values on side inputs always arrive earlier than the
value on the path input. We now deal with late arriving side inputs, or simply “late side inputs”.

The presentation of late side inputs is as follows:
Section 5.3.5.1 rules for how late side inputs can allow path inputs to exercise gates
Section 5.3.5.2 idea of monotone speedup, which underlies some of the rules
Section 5.3.5.3 one of the potentially confusing situations in detail.
Section 5.3.5.4 complete, correct, and complex algorithm.
Section 5.3.5.5 examples


5.3.5.1 Rules for Late Side Inputs

For each gate, there are eight sitations: the side input is controlling or non-controlling, the path
input is controlling or non-controlling, and the side input arrives early or arrives late.
5.3.5 Correct Algorithm to Find Critical Path                                                                       363


                  side=non-ctrl            side=non-ctrl               side=CTRL                  side=CTRL
                  path=CTRL                path=non-ctrl               path=CTRL                 path=non-ctrl


 Early Side
               path input causes glitch   path input propogates      side input propogates    neither input propogates



  Late Side
                 monotone speedup          monotone speedup          path input propogates    side input causes glitch




                  side=non-ctrl              side=non-ctrl              side=CTRL                  side=CTRL
                  path=non-ctrl              path=CTRL                  path=CTRL                 path=non-ctrl


 Early Side
                path input propogates     path input causes glitch    side input propogates     neither input propogates



  Late Side
                  monotone speedup          monotone speedup         path input propogates     side input causes glitch


Late side inputs give us three more situations for each of AND and OR gates where the path input
will/might excite the gate. In the two cases labeled monotone speedup, the path input does not
excite the gate with the current timing, but if our timing estimates for the side input are too slow,
or the timing of the side speeds up due to voltage or temperature variations, then the late input
might become an early side input.

The five situations where the path input excites the gate are:
side is early
       side=non-ctrl, path=non-ctrl The path input is the later of the two inputs to transition to
           a non-controlling value, so it is the one that causes the output to transition.
       side=non-ctrl, path=ctrl The side input transitions to a non-controlling value while the
           path input is a non-controlling value; this causes the output to transition to a
           non-controlled value. The path input then transitions to a controllilng value, causing a
           glitch on the output as it transitions to a controlled value.
side is late
       side=non-ctrl, path=non-ctrl If the side input arrives earlier than expected, then we will
            have an early arriving side input with a non-controlling value.
       side=non-ctrl, path=ctrl If the side input arrives earlier than expected, then we will have
            an early arriving side input with a non-controlling value.
       side=ctrl, path=ctrl The path input transitions to a controlling value before the side
            input; so, it is the input that causes the output to transition.
364                                                             CHAPTER 5. TIMING ANALYSIS


The three situations where the path input does not excite the gate are:
side is early
       side=ctrl, path=ctrl The side input transitions to a controlling value before the path input
            transitions to a controlling value. The edge on the path input does not propagate to the
            output.
       side=ctrl, path=non-ctrl It is always the case that at least one of the inputs is a
            controlling value, so the output of the gate is a constant controlled value.
side is late
       side=ctrl, path=non-ctrl The path input transitions to a non-controlling value while the
            side input is still non-controlling. This causes the output to transition to a
            non-controlled value. The side input then transitions to a controlling value, which
            causes the glitch as the output transitions to a controlled value. The second edge of the
            glitch is caused by the side input, so the side input determines the timing of the gate.

Combining together the five situations where the path input excites the gate gives us our complete
and correct rule: a path input excites the gate if the side-input is non-controlling or the side-input
arrives late and the path input is controlling.

Section 5.3.5.2 discusses monotone speedup in more detail, then section 5.3.5.3 demonstrates that
a late-arriving side input that causes a glitch cannot result in a true path. After these two tangents,
we finally present the correct, complete, and complex algorithm for critical path analysis.


5.3.5.2 Monotone Speedup

When we have a late side input with a non-controlling value, the path input does not excite the
gate, but the rules state that we should consider this to be a true path. The reason that we report
this as a true path, even though the path input does not excite the gate is due to the idea of
monotone speedup.


   Definition monotonic: A function ( f ) is monotonic if increasing its input causes the
     output to increase or remain the same. Mathematically: x < y =⇒ f (x) ≤ f (y).


   Definition monotononous: A lecture is monotonous if increasing the length of the
     lecture increases the number of people who are asleep.


   Definition monotone speedup: The maximum clockspeed of a circuit should be
     monotonic with respect to the speed of any gate or sub-circuit. That is, if we increase
     the speed of part of the circuit, we should either increase the clockspeed of the
     circuit, or leave it unchanged.
5.3.5 Correct Algorithm to Find Critical Path                                                          365


   Definition monotononous speedup: A lecture has monotonous speedup if increasing the
     pace of the lecture increases the number of people who are awake.

In the monotone speedup situations, if we were to report the candidate path as false and the side
input arrives sooner than expected, the path might generate an edge. Thus, a path that we initially
thought was a false path becomes a real path. Speeding up a part of the circuit turned a false path
into a real path, and thereby actually reduced the maximum clock speed of the circuit.
Monotone speedup is desirable, because if we claim that a circuit has a certain minimum delay
and then speed up some of the gates in the circuit (because of resizing gates, process variations,
temperature or voltage fluctuations), we would be quite distraught to discover that we have in fact
increased the minimum delay.
We can see the rationale behind the monotone speedup rules by observing that if we have a late
side input that transitions to a non-controlling value, and the circuitry that drives the late side
input speeds up, the late side input might become an early side input. For each of the two
monotone speedup situations, the corresponding early side input situation has a true path.


5.3.5.3 Analysis of Side-Input-Causes-Glitch Situation

In the following paragraphs we analyze the rule for a late side input where the side input is
controlling and the path input is non-controlling. The excitation rules say that in this situation the
path input cannot excite gate. We might be tempted to think that we could construct a circuit
where the first edge of the glitch (which is caused by the path input) propagates and the second
edge (which is caused by the late side input) does not propagate. Here we demonstrate why we
cannot create such a circuit. Readers who are willing to accept that the Earth is round without
personally circumnavigating the globe may wish to skip to section 5.3.5.4.
In the picture below, c is the gate that produces a glitching output because of a late-arriving side
input. We know that a, c is part of a false path and will demonstrate that in the current situation,
 b, c must also be part of a false path.
                                                    a            c
                                                    b

For a, c to be a part of a false path, there must be a gate that appears later in the circuit that
prevents the second edge of the glitch from propagating. In the figure below, this later gate is f,
with e being the path input (from c) and d being the side input.

                    d             f             d            f                d            f
                    e                           e                             e
                        very early side ctrl     middling early side ctrl           late side ctrl


                    d             f             d            f                d            f
                    e                           e                             e
                   very early side non-ctrl    middling early side non-ctrl       late side non-ctrl
366                                                             CHAPTER 5. TIMING ANALYSIS


For the first edge on e to propagate, the side input (d) must have a non-controlling value at the
time of the first edge. To prevent the second edge of the glitch from propagating from e to f, d
must be a controlling value. That is, d must transition from a non-controlling value to a
controlling value in the middle of the glitch on e. This corresponds to the “middling early side
ctrl” situation in the figure. From the perspective of the first edge of the glitch, this is identical to
the situation with the first gate (c), in that a late-arriving side input transitions to a controlling
value.

In this case of “middling early side ctrl”, the edge on d arrives later than the first edge on e,
which means that d, f is a slower path than b, c, ..., e, f , which means that d, f is part of a
false path. Thus, there is a gate later in the circuit that prevents the second edge of the glitch on f
from propagating. We wrap up the argument that the situation illustrated with a, b, c cannot lead
to a critical path through b, c in two ways: intuitively and mathematically.

Intuitively, for b, c to be part of a critical path, c must be followed by f, which itself must be
followed by another gate with a middling-early side input. All of the other cases that prevent the
second edge of the glitch from propagating will prevent both edges of the glitch from
propagating. This other gate with the middling-early side input produces a glitch and so must
itself be followed by yet another gate with a middling side input. This process continues ad
infinitum — we cannot construct a finite circuit that allows the first edge of the glitch on c to
propagate and prevents a second edge of the glitch from propagating.

Mathematically, we construct a simple inductive proof based on the number of later gates in the
candidate path. In the base case, f is the last gate in the path, and so it must be the gate that
propagates the first edge of the glitch and does not generate a glitch. There is no situation in
which this happens, thus the last gate in the path cannot have a middling-early input. In the
inductive case we assume that there are n gates later in the path and none of them have
middling-early side inputs. We can then prove that the gate just prior to the nth gate cannot have a
middling-early side input, because for it to have a middling-early side input, one of the n later
gates would need to have a middling-early side input that would allow the first edge of the glitch
to propagate and prevent the second edge of the glitch from propagating. From the inductive
hypothesis, we know that none of the n gates have a middling-early input, and so we have
completed the proof by contradiction.


5.3.5.4 Complete Algorithm

The possibility of late-arriving side inputs caused us to modify our rules for when a path input
will excite a gate. The complete rule (section 5.3.5.1) is: the side-input is non-controlling or the
side-input arrives late and the path input is controlling. Because we explore candidate critical
paths beginning with the slowest and working through faster and faster paths, a late-arriving side
input must be part of a previously discovered false path.

In the previous sections, when we did not have late-arriving side inputs, we could exercise the
critical path a change on just one input signal. With late-arriving side inputs, both the
primary-input to the critical path and the late-arriving side inputs might need to change.
5.3.5 Correct Algorithm to Find Critical Path                                                     367


When using the late-arriving side input portion of our excitation rule, we must ensure that the side
input does in fact arrive later than the path input. If we do not, we would fall into the situation
where both inputs are controlling and the side input arrives early. In this situation, the side input
excites the gate.

For the side input to arrive late, the late path to the side input must be viable. Stated more
precisely, the prefix of the previously discovered false path that ends at the side input must be
viable. The entire previously discovered false path is clearly not viable, it is only the prefix up to
the side input that must be viable. The viability condition for the prefix uses the same rule as we
use for normal path analysis: for every gate along the prefix the side-input is non-controlling or
the prefix’s side input arrives late and the prefix’s path input is controlling.

The complete, correct, and complex algorithm is:
• If find a contradiction on the path, check for side inputs that are on previously discovered false
  paths.
• If a gate and its side input are on a previously discovered false path, then the side input defines
  a prefix of a false path that is a late-arriving side input.
• For each late-arriving prefix, compute its viability (the conditions under which an edge will
  propagate along the prefix to the late side input).
• To the row of the late arriving side input in the constraint table, add as a disjunction the
  constraint that: the path input has a controlling value and at least one of the prefixes is viable.


5.3.5.5 Complete Examples

Complete Example 1           ................................................................. .


   Question:     Find the critical path in the circuit below.


                                       b       d         e                  g
                       a                                           f

                                       c




   Answer:


                                       b                                4
                            14   14          12 d   10   e 8   8   f4
                                                                            g   0
                        a                                               4
                                                               8
                                  10   c 8
368                                                                 CHAPTER 5. TIMING ANALYSIS


                              potential         unused
                              delay             fanout         path
                              14                 g, b, c       a
                              false                            a,b,d,e,f,g

                     side input non-controlling value                   constraint
                        f[c]             1                                  a
                        g[a]             1                                  a
      First false path, pursue next candidate.
                              potential         unused
                              delay             fanout         path
                              false                            a,b,d,e,f,g
                              10                  g, c         a
                              10                               a,c,f,g

                     side input non-controlling value                   constraint
                        f[e]             1                                  a
                        g[a]             1                                  a
      At first, this path appears to be false, but the side input f[e] is on the prefix
      of the false path a,b,d,e,f,g. Thus, f[e] is a late arriving side input.
      The candidate path will be a true path if the side input arrives late and the
      path input is a controlling value. The viability condition for the path a,b,d,e is
      true. The constraint for the path input (c) to have a controlling value for f is a.
      Together, the viability constraint of true and the controlling value constraint of
      a give us a late-side constraint of a.
      Updating the constraint table with the late arriving side input constraint gives
      us:
                     side input non-controlling value                   constraint
                        f[e]             1                              a + a = true
                        g[a]             1                                    a
      The constraint reduces to a. A rising edge will exercise the path.
                                   Critical path       a, c, f, g
                                   Delay               10
                                   Input vector        a=rising edge
      Illustration of rising edge exercising the critical path:
                                                                          0
                          0    0        2          4           6                  10
                                    b       d          e                      g
                      a                                            f6
                                                           2
                               0
                                    c
5.3.5 Correct Algorithm to Find Critical Path                                                         369


Complete Example 2           ................................................................. .


  Question:    Find the critical path in the circuit below.

                      a

                                 d
                      b
                                                                       i
                      c



  Answer:


     Find longest path:
                        8                         8     f4
                    a
                                                  8                                     4   j 0
                                                             8              8   i
                                                                                    4   4         j
                              d 16 e 14                 g 12 12             8
                        18                        14                   h8
                   b
                     12                                          12   i
                   c

     Explore longest path:

                                     potential         unused
                                     delay             fanout         path
                                     8                     f          a
                                     12                    h          c
                                     18                  f, g         b,d,e
                                     18                  h, i         b,d,e,g
                                     false                            b,d,e,g,h,i,j

                      side input non-controlling value                          constraint
                         h[c]             0                                         c
                         i[g]             0                                         b
                          j[f]            0                                        ab

     Contradiction.
                                                  0     f0
                    a
                                                  0                                     0   j
                                                                            0   i
                                                                                                  j
                             d        e   0   1         g 0
                   b                                                   h
                                                                      i
                   c

     First false path, find next candidate.
370                                                                      CHAPTER 5. TIMING ANALYSIS


      Changes in potential delays:

                                  Signal / path               old new
                                  g on b, d, e, g             12    8
                                   b, d, e, g                 18   14
                                  g[e] on b, d, e             14   10
                                  e on b, d, e                14   10
                                   b, d, e                    18   14

                               potential       unused
                               delay           fanout          path
                               false                           b,d,e,g,h,i,j
                               8                  f            a
                               12                 h            c
                               14               f, g           b,d,e
                               14                              b,d,e,g,i,j
                        8                  8    f4
                    a
                                           8                                      4   j 0
                                                     8               8    i
                                                                              4   4         j
                           d 12 e 10          g 8 12                 8
                        14                 10                   h8
                    b
                      12                                 12    i
                    c

                     side input non-controlling value                     constraint
                        h[c]             0                                    c
                        i[h]             0                                   cb
                         j[f]            0                                   ab

      Initially, found contradiction, but b, d, e, g, h is a prefix of a false path, and
      i[h] is a side input to the candidate path. We have a late side input.

      Note that at the time that we passed through i, we could not yet determine
      that we would need to use i[h] as a late side input. The lesson is that when
      a contradiction is discovered, we must look back along the entire candidate
      path covered so far to see if we have any late side inputs.

      Our late-arriving constraint for i[h] is:
      • late side path ( b, d, e, g, h ) is viable: c.
      • path input (i[g]) has a controlling value of ’1’: b.
      Combining these constraints together gives us bc.

      Adding the constraint of the late side input to to the condition table gives us:

                     side input non-controlling value                     constraint
                        h[c]             0                                     c
                        i[h]             0                                bc + bc = c
                         j[f]            0                                    ab
5.3.5 Correct Algorithm to Find Critical Path                                                                                                   371


         The constraints reduce to abc.
                                          Critical path              b, d, e, g, i, j
                                          Delay                     14
                                          Input vector              a=0, b=falling edge, c=0

         Illustration of falling edge exercising the critical path:
                                a
                                    0                                       f 8                                            8
                                                                    4                                                              14
                                                                                                          6   i                j
                                                                                                         10           10                j
                                    0   d       2   e   4           4       g           6
                                                                                             h      10
                               b
                                    0                                                       i
                               c



Complete Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
This example illustrates the benefits of the principle of monotone speedup when analyzing critical
paths.

                                                                b               d                                 f
                                            a                                               e

                                                                c



Critical-path analysis says that the critical path is a, c, e, f , with a late side input of e[d] and a
total delay of 10. The required excitation is a rising edge on a. However, with the given delays,
this excitation does not produce an edge on the output.
                                                                                                     0
                                                0           0           2           4
                                                                b               d                                 f
                                            a                                               e
                                                            0           2
                                                                c



For a more complete analysis of the behaviour, we also try a falling edge. The falling edge path
exercises the path a, f with a delay of 4.
                                                                                                     0
                                                0           0           2           4                                 4
                                                                b               d               6                 f
                                            a                                               e
                                                            0           2
                                                                c



Monotone speedup says that if we reduce the delay of any gate, we must not increase the delay of
the overall circuit. We reduce the delays of b and d from 2 to 0.5 and produce an edge at time 10
via the path a, c, e, f .
372                                                                                              CHAPTER 5. TIMING ANALYSIS


                                                                                             0
                                               0         0           0.5       1                             10
                                                             b             d                         f
                                           a                                           e 6
                                                         0           2
                                                             c



The critical path analysis said that the critical path was a, c, e, f with a delay of 10. With the
original circuit, the slowest path appeared to have a delay of 4. But, by reducing the delays of two
gates, we were able to produce an edge with a delay of 10. Thus, the critical path algorithm did
indeed satisfy the principle of monotone speedup.


Complete Example 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
This example illustrates that we sometimes need to allow edges on the inputs to late side paths.


    Question:            Find the critical path in the circuit below.

                                          c        d
                                 a                               e                                                    k
                                                                                                 i       j
                                 b
                                                             f             g       h




    Answer:


         The purpose of this example is to illustrate a situation where we need the
         primary input of a late-side path to toggle. To focus on the behaviour of the
         circuit, we show pictures of different situations and do not include the path
         and constraint tables.

         Longest path in the circuit, showing a contradiction between e[b] and j[h].
                                          c        d
                                 a                           e                                                        k
                                                                                                 i       j
                                                   1   0
                                 b                                                           0

                                                       1     f 0 g 1 h


         Second longest path b, f, g, h, i, j, k , using only early side inputs, showing a
         contradiction between k[e] and i[e].
                                          c        d                                                              1
                                 a                           e                 1   0         0
                                                                                                                      k
                                                                                                 i       j
                                 b
                                                             f           g         h
5.3.5 Correct Algorithm to Find Critical Path                                                                                                   373


         Second longest path using late side input i[e], which has a controlling value
         of 1 (rising edge) on i[h]. However, we neglect to put a rising edge on a.
         The late-side path is not exercised and our candidate path is also not
         exercised.
                                          c 0 d 1
                                 a1                      0
                                                              e
                                                                   1                         1   i     j
                                                                                                              1    k 0
                                                                                             6       1        0
                                 b
                                                         0   f 2           g 4 h 6


         We now put a rising edge on a, which causes our late side input (i[e]) to be
         a non-controlling value when our path input (i[h]) arrives.
                                     0    c 2 d 4                                                           48
                                a 0                          e4 8                                                  k 16
                                                                                                                    14
                                                        0                                        i 810 j
                                b                                                        6
                                                                                                           10 12
                                                        0    f 2           g 4 h 6


         In looking at the behaviour of i, we might be concerned about the precise
         timing of of the glitch on e and the rising ege on h. The figure below shows
         normal, slow, and fast timing of e. With slow timing, the first edge of glitch on
         e arrives after the rising edge on h. The timing of the second edge of the
         glitch remains unchanged. The value of i remains constant, which could lead
         us to believe (incorrectly!) that our critical path analysis needs to take into
         account the first edge of the glitch. However, this is in fact an illustration of
         monotone speedup. The fast timing scenario move the glitch earlier, such that
         the edge on h does in fact determine the timing of the circuit, in that h
         produces the last edge on i. In summary, with the glitch on e and the rising
         edge on h, either h causes the last edge on i or there is no edge on i.
                                                                             4       8
                                                                                                       8 10
                   Normal timing                                       e
                                                                                 6                                 i
                                                                       h

                                                                             4       8
                                                                                                       8 10
                 Slow timing on e                                      e
                                                                                 6                                 i
                                                                       h

                                                                             4       8
                                                                                                       8 10
                 Fast timing on e                                      e
                                                                                 6                                 i
                                                                       h


Complete Example 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
This example demonsrates that a late side path must be viable to be helpful in making a true path.


    Question:            Find the critical path in the circuit below.
374                                                         CHAPTER 5. TIMING ANALYSIS


                              a         e
                                                g
                              b                         i
                                                                k
                              c
                                                h       j
                                    f
                              d


  Answer:


      Find that the two longest paths are false paths, because of contradiction
      between g[d] and i[c].
                              a         e
                                                g
                              b             0           i
                                                    1           k
                              c
                                                h       j
                                    f
                              d

      Try third longest path d, f, h, j, k using early side inputs. Find contradiction
      between k[i] and j[c].
                              a         e
                                                g
                              b                         i 0 1
                                                    0       1   k
                              c                     0
                                            0   h       j
                                    f
                              d

      Try using late side paths a, e, g, i, k or b, e, g, i, k . Find that neither path is
      viable by itself, because of contradiction between g[d] and i[c]. Also,
      neither path is viable in conjunction with the candidate path, because of
      contradiction between i[c] on late side path and j[c] on candidate path.
      Either one of these contradictions by itself is sufficient to prevent the late side
      path from helping to make the candidate path a true path.
                              a         e
                                                g
                              b                         i
                                                    0       1   k
                              c                     0
                                            0   h       j
                                    f
                              d



5.3.6 Further Extensions to Critical Path Analysis

McGeer and Brayton’s paper includes two extensions to the critical path algorithm presented here
that we will not cover.
5.3.7 Increasing the Accuracy of Critical Path Analysis                                         375


• gates with more than two inputs
• finding all input values that will exercise the critical path
• multiple paths with the same delay to the same gate


5.3.7 Increasing the Accuracy of Critical Path Analysis

When doing critical path calculations, it is often useful to strike a balance between accuracy and
effort. In the examples so far, we assumed that all signals had the same wire and load delays. This
assumption simplifies calculations, but reduces accuracy. Section 5.4 discusses how the analog
world affects timing analysis.



5.4 Elmore Timing Model
There are many different models used to describe the timing of circuits. In the section on critical
paths, we used a timing model that was based on the size of the gate. The timing model ignored
interconnect delays and treated all gates as if they had the same fanout. For example, the delay
through an AND gate was 4, independent of how many gates were in its immediate fanout.
In this section and the next (section 5.4) we discuss two timing models. In this section, we discuss
the detailed analog timing model, which reflects quite accurately the actual voltages on different
nodes. The SPICE simulation program uses very detailed analog models of transistors (dozens of
parameters to describe a single transistor). In the next section, we describe the Elmore delay
model, which achieves greater simplicity than the analog model, but at a loss of accuracy.


5.4.1 RC-Networks for Timing Analysis
                                               Cross-Section of
   Transistor Level      Mask Level (P-Tran) Fabricated Transistor
                                                                   Switch Level (P-Tran)
      (P-Tran)                    source              poly   contact
                           poly                                                        source
              source                       contact
                           gate                                     p-diff      gate
       gate                                 p-diff
                                  drain                                                drain
              drain
                                                        substrate

                                               Cross-Section of
   Transistor Level      Mask Level (N-Tran) Fabricated Transistor             Switch Level
      (N-Tran)                    source              poly   contact            (N-Tran)
                           poly
              source                                                                   source
                                           contact
                           gate                                     p-diff
       gate                                 n-diff                              gate
              drain               drain                                                drain
                                                        substrate
376                                                                        CHAPTER 5. TIMING ANALYSIS


Different Levels of Abstraction for Inverter            .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . ...
                                                                                    Mask Level
                                                                       contact                              metal

                              Transistor Level                  VDD
                                       VDD
                                                                  poly                                           p-diff
       Gate Level
       a             b           a             b                       a                                              b
                                                                                                                 n-diff
                                       GND
                                                                GND
                                                                                                            metal
From the electrical characteristics of fabricated transistors, VLSI and device engineers derive
models of how transistors behave based on mask-level descriptions. For our purposes, we will use
the very simple resistor-capacitor model shown below.
Each of the P- and N-transistor models contains a resistor (“pullup” for the P-transistor and
“pulldown” for the N-transistor) and a parasitic capacitor.
When we combine a P-transistor and an N-transistor to create an invertor, we combine the
capacitors into a single parasitic capacitor that is the sum of the two individual capacitors.
                                                                  RC-Network for Timing Analysis
                                                                                                VDD
 RC-Network models of P- and N-transistors
             source                  drain                                                           Rpu
                 Rpu
                                                   Cp
      gate                    gate                                a                                                         b
                                                                              CL                           Cp
                                         Rpd
                         Cp
             drain                   source                                                          Rpd


                                                                                                GND
• Contacts (vias) have resistance (RV )
• Metal areas (wires) have resistance (RW ) and capacitance (CW ).
  – The resistance is dependent upon the geometry of the wire.
  – The capacitance is dependent upon the geometry of the wire and the other wires adjacent to
    it.
• For most circuits, the via resistance is much greater than the wire resistance (RV ≫ RW )
To reduce area, modern wires tend to have tall and narrow cross sections. When wires are packed
close together (e.g. a signal that is an array or vector), the wires act like capacitors.
5.4.1 RC-Networks for Timing Analysis                                                                                                              377


A Pair of Inverters       ................................................................... .
                                                               Transistor Level
                                                                                    VDD

                      Gate Level
                          b                                                                                 b
                  a                 c                                                    a                                          c



                                                                                    GND

                                                     Mask Level
                                                                b
                                                 a                              c


A Pair of Inverters (Cont’d)        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
                                                      Mask Level
                                   VDD


                                                                    b                        c
                                      a



                                   GND

                                   RC-Network for Timing Analysis
              VDD

                              Rpu                                                                               Rpu

                                                     b RW                    RV
         a                                                                                                                               c
             CL                           Cp               CW                        CL                                         Cp

                              Rpd                                                                               Rpd
              GND

To analyze the delay from one inverter to the next, we analyze how long it takes the capacitive
load of the second (destination) inverter to charge up from ground to VDD, or to discharge from
VDD to ground. In doing this analysis, the gate side of the driving inverter is irrelevant and can be
removed (trimmed). Similarly, the pullup resistor, pulldown resistor, and parasitic capacitance of
the destination inverter can also be removed.

RC-Network for Timing Analysis (trimmed)
378                                                                                             CHAPTER 5. TIMING ANALYSIS


                    VDD

                                        Rpu

                                                             b RW                  RV

                                                    Cp              CW                       CL

                                        Rpd
                   GND



A Circuit with Fanout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
We will look at one more example of inverters and their RC-network before beginning the timing
analysis of these networks.
                            Gate Level
                                                    c                                       Gate Level (physical layout)
                                  b                                                                 b            c
                     a                                                                  a                                                d
                                                                                                                                         c
                                                    d

                                                             Transistor Level
                                 VDD



                                                        b
                                       a                                            c           b                    d


                                                                                                                     c
                                 GND

                                                                 Mask Level
                                 VDD

                                                                                                b
                                                                                            c
                                       a                              b                                              d

                                                                                                                     c
                                 GND
5.4.1 RC-Networks for Timing Analysis                                                       379


                              RC-Network for Timing Analysis
      VDD



              Rpu                               Rpu                                   Rpu
                                                             b
                      b RW1     RV                       c       RW2    RV
a                                                                                                d
               Cp             CW1                Cp              CW2                   Cp
      CL                                CL                                   CL

              Rpd                               Rpd                                   Rpd
                                                             RW3
                                                                                                 c
                                                                       CW3


     GND

                        RC-Network for Timing Analysis (trimmed)
      VDD



              Rpu
                                                             b
                      b RW1     RV                               RW2    RV

               Cp             CW1                                CW2
                                        CL                                    CL
              Rpd




      GND

We will use this circuit as our primary example for the analog and Elmore timing models, so we
draw a simplified version of the trimmed RC-network before proceeding.
380                                                                                     CHAPTER 5. TIMING ANALYSIS

                                  RC-Network for Timing Analysis (cleaned up)
                                                                             b RW2         RV
                                   VDD

                                              Rpu                               CW2                     CL

                                                          b RW1          RV

                                                Cp                   CW1
                                                                                    CL

                                              Rpd
                                  GND


5.4.2 Derivation of Analog Timing Model

The primary purpose of our timing model is provide a mechanism to calculate the approximate
delay of a circuit. For example, to say that a gate has a delay of 100 ps. The actual gate behaviour
is a complicated function of the input signal behaviour.

The waveforms below are all possible behaviours of the same circuit. From these various
waveforms, it would be very difficult to claim that the circuit has a specific delay value.

                          Slow input                                                            Fast input

        input                                                                  input
       voltage                                                                voltage

                                    time                                                                   time

       output                                                                  input
       voltage                                                                voltage

                                    time                                                                   time


Steps Toward Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
We begin with two simplifications as steps toward calculating a single delay value for a circuit.

    1. Look at the circuit’s response to a step-function input.
    2. Measure the delay to go from GND to 65% of VDD and from VDD to 35% of VDD.

These values of 65% VDD and 35% VDD are “trip points”.
5.4.2 Derivation of Analog Timing Model                                                                                               381


    Definition Trip Points: A high or ’1’ trip point is the voltage level where an upwards
      transition means the signal represents a ’1’.

        A low or ’0’ trip point is the voltage level where a downwards transition means the
        signal represents a ’0’.


In the figure below the gray line represents the actual voltage on a signal. The black line is digital
discretization of the analog signal.

                         a


                         b




Node Numbering, Initial Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
To motivate our derivation of the analog timing model, we will use the inverter that fans out to
two other inverters as our example circuit.
• The source (VDD in our case) and each capacitor is a node. We number the nodes, capacitors,
  and resistors. Resistors are numbered according to the capacitor to their right. Multiple
  resistors in series without an intervening capacitor are lumped into a single resistor.
• All nodes except the source start at GND.
• We calculate the voltage at a node when we turn on the P-transistor (connect to VDD).
The process for analyzing a transition from VDD to GND on a node is the dual of the process just
described. The source node is GND, all other nodes start at VDD, we calculate the voltage when
we turn on the N-transistor (connect it to GND).

                                           b RW2 3 RV
                                             R3    R4
  VDD 0                                                            4

             Rpu
             R1                               CW2                      CL

                    1 b RW12
                       R2               RV
                                        R5     5
               Cp                   CW1
                                                   CL

             Rpd
 GND



Define: Path and Downstream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
We still have a few more preliminaries to get through. To discuss the structure of a network, we
introduce two terms: path and downstream
382                                                           CHAPTER 5. TIMING ANALYSIS


  Definition path: The path from the source node to a node i is the set of all resistors
    between the source and i. Example: path(3) = {R1 , R2 , R3 }


  Definition down: The set of capactitors downstream from a node is the set of all
    capacitors where current would flow through the node to charge the capacitor. You
    can think of this as the set of capacitors that are between the node and ground.
    Example: down(2) = {C2 ,C3 ,C4 ,C5 }. Example: down(3) = {C3 ,C4 }


5.4.2.1 Example Derivation: Equation for Voltage at Node 3

As a concrete example of deriving the analog timing model, we derive the equation for the voltage
at Node 3 in our example circuit. After this concrete example, we do the general derivation.


             V3 (t) = V0 (t) − voltage drop fromNode0toNode3

                         The voltage drop is the sum of the voltage drops across the
                         resistors on the path from Node0 to Node3
                    = V0 (t) −        ∑   Rr ×Ir (t)
                               r∈path(3)
                    = V0 (t) − (R1I1 (t) + R2I2 (t) + R3I3 (t))

                         The current through a resistor is the sum of the currents
                         through all of the downstream capacitors
              Ir (t) =      ∑        Ic
                         c∈down(r)
              I1 (t) = Ic1 + Ic2 + Ic3 + Ic4 + Ic5
              I2 (t) = Ic2 + Ic3 + Ic4 + Ic5
              I3 (t) = Ic3 + Ic4



                      Substitute Ir into the equation for V3
                                                                       
                                      R1 (Ic1 + Ic2 + Ic3 + Ic4 + Ic5 )
             V3 (t) = V0 (t) −  + R2 (Ic2 + Ic3 + Ic4 + Ic5 )          
                                 + R3 (Ic3 + Ic4 )

                      Use associativity to group terms by currents.
                                                       
                                     Ic1 (R1 )
                                + Ic2 (R1 + R2 )       
                                                       
                                + Ic3 (R1 + R2 + R3 ) 
             V3 (t) = V0 (t) −                         
                                + Ic4 (R1 + R2 + R3 ) 
                                 + Ic5 (R1 + R2 )
5.4.2 Derivation of Analog Timing Model                                                           383



                        Current through a capacitor
                           ∂Vc (t)
               Ic (t) = Cc
                             ∂t

                       Substitute Ic into equation for V3
                                                                      
                                                ∂Vc1 (t)
                                       (R1 )Cc1
                                                 ∂t                   
                                                      ∂Vc2 (t)
                                                                      
                                                                      
                                 + (R1 + R2 )Cc2                      
                                                        ∂t            
                                 + (R + R + R )C ∂Vc3 (t)
                                                                      
              V3 (t) = V0 (t) −          1    2     3 c3
                                                                       
                                                               ∂t      
                                                            ∂Vc4 (t)
                                                                      
                                                                      
                                 + (R1 + R2 + R3 )Cc4                 
                                                              ∂t      
                                                     ∂Vc5 (t)         
                                  + (R1 + R2 )Cc5
                                                         ∂t

                 In each of the resistance-capacitance terms (e.g., (R1 +
                 R2 )Cc2 ), the resistors are the set of resistors on the path
                 to the capacitor that are also on the path to Node3 .
                 We capture this observation by defining the Elmore resis-
                 tance Ri,k for a pair of nodes i and k to be the resistors on
                 the path to Nodei that are also on the path to Nodek .
       Ri,k =            ∑             Rr
                 r∈(path(k)∩path(k))
      R3,1   =   R1
      R3,2   =   R1 + R2
      R3,3   =   R1 + R2 + R3
      R3,4   =   R1 + R2 + R3
      R3,5   =   R1 + R2

              Substitute Ri,k into V3
                                      ∂Vc1 (t)           ∂Vc2 (t)           ∂Vc3 (t)
                                                                                    
                              R3,1Cc1          + R3,2Cc2          + R3,3Cc3
                       
     V3 (t) = V0 (t) −                 ∂t                 ∂t                 ∂t     
                                      ∂Vc4 (t)           ∂Vc5 (t)                    
                         + R3,4Cc4             + R3,5Cc5
                                        ∂t                 ∂t
We are left with a system of dependent equations, in that V3 is dependent upon all of the voltages
in the circuit. In the general derivation that follows next, we repeat the steps we just did, and then
show how the Elmore delay is an approximation of this system of dependent differential
equations.


5.4.2.2 General Derivation

We derive the equation for the voltage at Nodei as a function of the voltage at Node0 .
384                                                           CHAPTER 5. TIMING ANALYSIS



      Vi (t) = V0 (t) − voltage drop fromNode0toNodei

                 The voltage drop is the sum of the voltage drops across the
                 resistors on the path from Node0 to Nodei
            = V0 (t) −        ∑       Rr ×Ir (t)
                          r∈path(i)



                 The current through a resistor is the sum of the currents
                 through all of the downstream capacitors
      Ir (t) =      ∑        Ic
                 c∈down(r)
                 Substitute Ir into equation for Vi
                                   the

      Vi (t) = V0 (t) −       ∑       Rr ×         ∑       Ic 
                          r∈path(i)            c∈down(r)


                 Use associativity to push Rr into the summation over c
      Vi (t) = V0 (t) −       ∑            ∑       Rr ×Ic
                          r∈path(i) c∈down(r)



               Current through a capacitor
                  ∂Vc (t)
      Ic (t) = Cc
                    ∂t

                 Substitute Ic into equation for Vi
                                                            ∂Vc (t)
      Vi (t) = V0 (t) −       ∑            ∑       Rr ×Cc
                                                              ∂t
                          r∈path(i) c∈down(r)


                 A little bit of handwaving to prepare for Elmore resistance
                                                       
                                                                         ∂Vc (t)
      Vi (t) = V0 (t) −       ∑                   ∑          Rr  ×Ck
                                                                           ∂t
                          k∈Nodes         r∈path(i)∩path(k)



                 Define Elmore resistance Ri,k
       Ri,k =             ∑               Rr
                 r∈(path(k)∩path(k))


                 Substitute Ri,k into Vi
                                                 ∂Vc (t)
      Vi (t) = V0 (t) −       ∑       Ri,k ×Ck
                                                   ∂t
                          k∈Nodes
5.4.3 Elmore Timing Model                                                                                     385




The final equation above is an exact description of the behaviour of the RC-network model of a
circuit. More accurate models would result in more complicated equations, but even this equation
is more complicated than we want for calculating a simple number for the delay through a circuit.
The equation is actually a system of dependent equations, in that each voltage Vi is dependent
upon all of the voltages Vc in the circuit. Spice and other analog simulators use numerical
methods to calculate the behaviour of these systems. Elmore’s contribution was to find a simple
approximation of the behaviour of such systems.


5.4.3 Elmore Timing Model
•    Assume that V0 (t) is a step function from 0 to 1 at time 0.
•    Derive upper and lower bounds for Vi (t).
•    Find RC time constants for upper and lower bounds.
•    Elmore delay is guaranteed to be between upper and lower bounds.



                              Upper and lower bounds
                              Elmore model
                              RC-network model




    TD-TRi    TRi   TP

       TP-TRi TD




Equations for Curves               ..................................................................
             Time : 0                    TDi − TRi                 TP − TRi                               ∞
                                                                            TDi − TP − t
                              t − TDi                                 TRi       TRi
             Upper       1+                                       1−      e
                                TP                                    TP

             Elmore                                      1 − e−t/TDi


                                                                                           TP − TRi − t
                                                          TDi                   TD             TP
             Lower             0                     1−                       1− ie
                                                        t + TRi                 TP

Fact: 0 ≤ TRi ≤ TDi ≤ TP
386                                                           CHAPTER 5. TIMING ANALYSIS


Definitions of Time Constants       ..........................................................
                                 R2 Ck
                          ∑
                                  k,i
              TRi =                       Mathematical artifact, no intuitive meaning
                                  Ri,i
                       k∈Nodes

              TDi =       ∑      Rk,iCk   Elmore delay
                       k∈Nodes

               TP =       ∑      Rk,kCk RC-time constant for lumped network
                       k∈Nodes



Picking the Trip Point     ................................................................ .

                    Vi (t) = VDD(1 − e−t/TDi )
                             Pick trip point of Vi (t) = 0.65VDD, then solve for t
               0.65VDD = VDD(1 − e−t/TDi )
                     0.35 = e−t/TDi
                             Take ln of both sides
                  ln 0.35 = ln(e−t/TDi )
                             ln 0.35 = −1.05 ≈ −1.0
                    −1.0 = −t/TDi
                         t = TDi

By picking a trip point of 0.65VDD, the time for Vi to reach the trip is the Elmore delay.
5.4.4 Examples of Using Elmore Delay                                          387


5.4.4 Examples of Using Elmore Delay

5.4.4.1 Interconnect with Single Fanout



 G1                       G2

                          C3 Rw3                Ra3
             Ra4
                   G2                  C2 Rw2
                 C1 Rw1                Ra2
 Ra1
               G1

       G1                                                          G2
                  Rpu

                               Ra1 Rw1 Ra2 Rw2 Ra3 Rw3 Ra4
 Vi
                    Cp                     C1            C2   C3        CG2

                  Rpd


 G*      gate
 C*      capacitance on wire
 Ra*     resistance through antifuse
 Rw*     resistance through wire




  Question:      Calculate delay from gate 1 to gate 2


  Answer:


       Gate 2 represents node 4 on the RC tree.
388                                                       CHAPTER 5. TIMING ANALYSIS



                            4
                 τD4 =    ∑ ERk,iCk
                          k=1
                      = ER         C + ER C2 + ER C3 + ER C4
                                1,4 1    2,4     3,4     4,4
                      = (Ra1 + Rw1 + Ra2 + Rw2 + Ra3 + Rw3 + Ra4 )CG2

                                +(Ra1 + Rw1 + Ra2 + Rw2 + Ra3 + Rw3 )C3

                                +(Ra1 + Rw1 + Ra2 + Rw2 )C2

                                +(Ra1 + Rw1 )C1

                 approximate Ra ≫ Rw

                      = (Ra1 )C1 + (Ra1 + Ra2 )C2 + (Ra1 + Ra2 + Ra3 )C3
                        + (Ra1 + Ra2 + Ra3 + Ra4 )CG2

                 approximate Rai = Ra j

                      = 4(Ra)CG2 + 3(Ra)C3 + 2(Ra)C2 + (Ra)C1



  Question: If you double the number of antifuses and wires needed to connect two
    gates, what will be the approximate effect on the wire delay between the gates?


  Answer:


                                 n
                    τDi =       ∑ ERk,iCk
                                k=1
                 Assume all resistances and capacitances are the same
                 values (R and C), and assume that all intermediate
                 nodes are along path between the two gates of inter-
                 est.
                 ER     = k×R
                    k,i
                                     n
                    τDi = ( ∑ k)RC
                                 k=1


      Using the mathematical theorem:
5.4.4 Examples of Using Elmore Delay                                                         389



                                        n
                                               (n + 1)n
                                       ∑i    =
                                                   2
                                       i=1
                                             ≈ n 2



      We simplify delay equation:

                                                  n
                                       τDi = ( ∑ k)RC
                                                  k=1
                                             =   n2 RC

      We see that the delay is propotional to the square of the number of antifuses
      along the path.


5.4.4.2 Interconnect with Multiple Gates in Fanout


                                                                   G2
              G1                      G2
                                                                                 G3
                           G3                                 G1


  Question: Assuming that wire resistance is much less than antifuse resistance and
    that all antifuses have equal resistance, calculate the delay from the source inverter
    (G1) to G2


  Answer:


       1. There are a total of 7 nodes in the circuit (n = 7).
       2. Label interconnect with resistance and capacitance identifiers.
                                             R3 C3
          R4               C4                    R5
         C5                                              R6
             G2
           C1                          G3                     C7
 R1
           G1                   R2 C2 C6
390                                                                 CHAPTER 5. TIMING ANALYSIS


           3. Draw RC tree

      G1                                                                 G2
                Rpu

                          R1 n1 R2 n2               n3 R3 n4 R4                n5
Vi
                  Cp              C1         C2      C3        C4                   C5

                Rpd
                                                                    G3


                                                   R5 n6  R6             n7
                                                       C6
                                                                              C7



           4. G2 is node 5 in the circuit (i = 5).
           5. Elmore delay equations
                                        7
                              τD5 =    ∑ ERk,5Ck
                                       k=1
                                  = ER          C + ER C2 + ER C3 + ER C4
                                             1,5 1    2,5     3,5     4,5
                                       +ER       C + ER C6 + ER C7
                                              5,5 5    6,5     7,5

           6. Elmore resistances
               ER      = R1                          =    R
                  1,5
               ER         =    R1 + R2               =    2R
                    2,5
               ER         =    R1 + R2               =    2R
                    3,5
               ER         =    R1 + R2 + R3          =    3R
                    4,5
               ER         =    R1 + R2 + R3 + R4 =        4R
                    5,5
               ER         =    R1 + R2               =    2R
                    6,5
               ER         =    R1 + R2               =    2R
                    7,5
           7. Plug resistances into delay equations
                              τD5 = (R)C1 + (2R)C2 + (2R)C3 + (3R)C4 + (4R)C5
                                    + (2R)C6 + (2R)C7
5.4.4 Examples of Using Elmore Delay                                                         391


Delay from G1 to G3         ................................................................. .


  Question: Assuming that wire resistance is much less than antifuse resistance and
    that all antifuses have equal resistance, calculate the delay from the source inverter
    (G1) to G3


  Answer:


       1. G3 is node 7 in the circuit (i = 7).
       2. Elmore delay equations
                                    n
                           τDi =   ∑ ERk,iCk
                                   k=1
                                    7
                           τD7 =   ∑ ERk,7Ck
                                   k=1
                               = ER         C + ER C2 + ER C3 + ER C4
                                         1,7 1    2,7     3,7     4,7
                                   +ER       C + ER C6 + ER C7
                                          5,7 5    6,7     7,7

       3. Elmore resistances
           ER      = R1                           =   R
              1,7
            ER         =    R1 + R2               =   2R
                 2,7
            ER         =    R1 + R2               =   2R
                 3,7
            ER         =    R1 + R2               =   2R
                 4,7
            ER         =    R1 + R2               =   2R
                 5,7
            ER         =    R1 + R2 + R5          =   3R
                 6,7
            ER         =    R1 + R2 + R5 + R6 =       4R
                 7,7

       4. Plug resistances into delay equations

                           τD7 = (R)C1 + (2R)C2 + (2R)C3 + (2R)C4 + (2R)C5
                               = + (3R)C6 + (4R)C7
392                                                        CHAPTER 5. TIMING ANALYSIS


Delay to G2 vs G3   .....................................................................


  Question: Assuming all wire segments at same level have roughly the same
    capacitance, which is greater, the delay to G2 or the delay to G3?


  Answer:


       1. Equations for delay to G2 (τD5 ) and G3 (τD7 )

            τD5 = (R)C1 + (2R)C2 + (2R)C3 + (3R)C4 + (4R)C5 + (2R)C6 + (2R)C7

            τD7 = (R)C1 + (2R)C2 + (2R)C3 + (2R)C4 + (2R)C5 + (3R)C6 + (4R)C7


       2. Difference in delays

                            τD5 − τD7 = RC4 + 2RC5 − RC6 − 2RC7


       3. Compare capacitances

                                          C4 ≈ C6

                                          C5 ≈ C7


       4. Conclusion: delays are approximately equal.



5.5 Practical Usage of Timing Analysis
Speed Grading
    • Fabs sort chips according to their speed (sorting is known as speed grading or speed
       binning)
    • Faster chips are more expensive
    • In FPGAs, sorting is based usualy on propagation delay through an FPGA cell. As wires
       become a larger portiono of delay, some analysis of wire delays is also being done.
    • Propagation delay is the average of the rising and falling propagation delays.
    • Typical speed grades for FPGAs:
       Std standard speed grade
       1 15% faster than Std
       2 25% faster than Std
5.5.1 Speed Binning                                                                             393


         3 35% faster than Std

Worst-Case Timing
    • Maximum Delay in CMOS. When?
      – Minimum voltage
      – Maximum temperature
      – Slow-slow conditions (process variation/corner which result in slow p-channel and
        slow n-channel). We could also have fast-fast, slow-fast, and fast-slow process corners
      • Increasing temperature increases delay
        – ⇑ Temp =⇒ ⇑ resistivity
        – ⇑ resistivity =⇒ ⇑ electron vibration
        – ⇑ electron vibration =⇒ ⇑ colliding with current electrons
        – ⇑ colliding with current electrons =⇒ ⇑ delay
      • Increasing supply voltage decreases delay
        – ⇑ supply voltage =⇒ ⇑ current
        – ⇑ current =⇒ ⇓ load capacitor charge time
        – ⇓ load capacitor charge time =⇒ ⇓ total delay
      • Derating factor is a number used to adjust timing number to account for voltage and temp
        conditions
      • ASIC manufacturers classes, based on variety of environments:
                             VDD       TA (ambient temp) TC (case temp)
         Commercial 5V ± 5%                  0 to +70C
         Industrial      5V ± 10%          –40 to +85C
         Military        5V ± 10%                               –55 to +125C
      • What is important is the transistor temperature inside the chip, TJ (junction temperature)


5.5.1 Speed Binning

Speed binning is the process of testing each manufactured part to determine the maximum clock
speed at which it will run reliably.

Manufacturers sell chips off of the same manufacturing line at different prices based on how fast
they will run.

A “speed bin” is the clock speed that chips will be labeled with when sold.

Overclocking: running a chip at a clock speed faster than what it is rated for (and hoping that your
software crashes more frequently than your over-stressed hardware will).
394                                                          CHAPTER 5. TIMING ANALYSIS


5.5.1.1 FPGAs, Interconnect, and Synthesis

On FPGAs 40-60% of clock cycle is consumed by interconnect.
When synthesizing, increasing effort (number of iterations) of place and route can significantly
reduce the clock period on large designs.


5.5.2 Worst Case Timing

5.5.2.1 Fanout delay

In Smith’s book, Table 5.2 (Fanout delay) combines two separate parameters:

   • capacitive load delay
   • interconnect delay

into a single parameter (fanout). This is common, and fine.
But, when reading a table such as this, you need to know whether fanout delay is combining both
capacitive load delay and interconnect delay, or is just capacitive load.


5.5.2.2 Derating Factors

Delays are dependent upon supply voltage and temperature.

                                   ⇑ Temp       =⇒ ⇑ Delay
                               ⇑ Supply voltage =⇒ ⇓ Delay


Temperature     ..........................................................................

   • ⇑ Temp =⇒ ⇑ Delay

         – ⇑ Temp =⇒ ⇑ Resistivity of wires
         – As temp goes up, atoms vibrate more, and so have greater probability of colliding with
           electrons flowing with current.

   • ⇑ Supply voltage =⇒ ⇓ Delay

         – ⇑ Supply voltage =⇒ ⇑ current (V = IR)
         – ⇑ current =⇒ ⇓ time to charge load capacitors to threshold voltage
5.5.2 Worst Case Timing                                                                                                                      395


Derating Factor Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A “derating factor” is a number to adjust timing numbers to account for different temperature and
voltage conditions.

Excerpt from table 5.3 in Smith’s book (Actel Act 3 derating factors):


                                                  Derating factor           Temp         Vdd
                                                       1.17                 125C         4.5V
                                                       1.00                  70C         5.0V
                                                       0.63                 -55C         5.5V
396                                                            CHAPTER 5. TIMING ANALYSIS


5.6 Timing Analysis Problems
P5.1 Terminology

For each of the terms: clock skew, clock period, setup time, hold time, and clock-to-q, answer
which time periods (one or more of t1 – t9 or NONE) are examples of the term.

NOTES:
  1. The timing diagram shows the limits of the allowed times (either minimum or maximum).
    2. All timing parameters are non-negative.
    3. The signal “a” is the input to a rising-edge flop and “b” is the output. The clock is “clk1”.

                                      t7
                     t4                                               signal is stable
                    t3                t6
                                                                    signal may change
                t1 t2                      t9


clk1
                                       t8
clk2                             t5


a


b


b

                                t10    t11


 clock skew
 clock period
 setup time
 hold time


P5.2 Hold Time Violations

P5.2.1 Cause

What is the cause of a hold time violation?
P5.3 Latch Analysis                                                                              397


P5.2.2 Behaviour

What is the bad behaviour that results if a hold time violation occurs?


P5.2.3 Rectification

If a circuit has a hold time violation, how would you correct the problem with minimal effort?


P5.3 Latch Analysis

Does the circuit below behave like a latch? If not, explain why not. If so, calculate the clock-to-Q,
setup, and hold times; and answer whether it is active-high or active-low.

                                                  d
                 Gate Delays
                                                                                    q
                 AND     4
                                                 en
                 OR      2
                 NOT     1
398                                                     CHAPTER 5. TIMING ANALYSIS


P5.4 Critical Path and False Path

Find the critical path through the following circuit:

               b
               c


                            a
                   d
        g
        f
                      e
                h
           i
                j
              l

                        k
                   m
P5.5 Critical Path                                                                                   399


P5.5 Critical Path

              a
                              d            f                 j

              b                            g
                                                                            gate delay
                              e                           k                 NOT      2
              c
                                                                            AND      4
                                           h                                OR       4
                                                             l
                                                                            XOR      6
                                                          m
                                           i

Assume all delay and timing factors other than combinational logic delay are negligible.


P5.5.1 Longest Path

List the signals in the longest path through this circuit.


P5.5.2 Delay

What is the combinational delay along the longest path?


P5.5.3 Missing Factors

What factors that affect the maximum clock speed does your analysis for parts 1 and 2 not take
into account?


P5.5.4 Critical Path or False Path?

Is the longest path that you found a real critical path, or a false path? If it is a false path, find the
real critical path. If it is a critical path, find a set of assignments to the primary inputs that
exercises the critical path.
400                                                        CHAPTER 5. TIMING ANALYSIS


P5.6 YACP: Yet Another Critical Path

Find the critical path in the circuit below.
                                               d       f
                             a
                             b
                                                       g         h
                                                   e
                              c
P5.7 Timing Models                                                                                        401


P5.7 Timing Models

In your next job, you have been told to use a “fanout” timing model, which states that the delay
through a gate increases linearly with the number of gates in the immediate fanout. You dimly
recall that a long time ago you learned about a timing model named Elmo, Elmwood, Elmore,
El-Morre, or something like that.

For the circuit shown below as a schematic and as a layout, answer whether the fanout timing
model closely matches the delay values predicted by the Elmore delay model.

                             G2

                             G3
                                                    Symbol       Description        Capacitance   Resistance
                   G1
                             G4                              Interconnect level 2     Cx              0


                             G5
                                                             Interconnect level 1     Cy              0
    G1
                                                             Gate                     Cg              0


                                                             Antifuse                  0              R




             G2         G3        G4    G5

Assumptions:
• The capacitance of a node on a wire is independent of where the node is located on the wire.
402                                                           CHAPTER 5. TIMING ANALYSIS


P5.8 Short Answer

P5.8.1 Wires in FPGAs

In an FPGA today, what percentage of the clock period is typically consumed by wire delay?


P5.8.2 Age and Time

If you were to compare a typical digital circuit from 5 years ago with a typical digital circuit
today, would you find that the percentage of the total clock period consumed by capacative load
has increased, stayed the same, or decreased?


P5.8.3 Temperature and Delay

As temperature increases, does the delay through a typical combinational circuit increase, stay
the same, or decrease?


P5.9 Worst Case Conditions and Derating Factor

Assume that we have a ’Std’ speed grade Actel A1415 (an ACT 3 part) Logic Module that drives
4 other Logic Modules:


P5.9.1 Worst-Case Commercial

Estimate the delay under worst-case commercial conditions (assume that the junction temperature
is the same as the ambient temperature)


P5.9.2 Worst-Case Industrial

Find the derating factor for worst-case industrial conditions and calculate the delay (assume that
the junction temperature is the same as the ambient temperature).


P5.9.3 Worst-Case Industrial, Non-Ambient Junction Temperature

Estimate the delay under the worst-case industrial conditions (assuming that the junction
temperature is 105C).
Chapter 6

Power Analysis and Power-Aware Design

6.1 Overview
6.1.1 Importance of Power and Energy
• Laptops, PDA, cell-phones, etc — obvious!
• For microprocessors in personal computers, every watt above 40W adds $1 to manufacturing
  cost
• Approx 25% of operating expense of server farm goes to energy bills
• (Dis)Comfort of Unix labs in E2
• Sandia Labs had to build a special sub-station when they took delivery of Teraflops massively
  parallel supercomputer (over 9000 Pentium Pros)
• High-speed microprocessors today can run so hot that they will damage themselves — Athlon
  reliability problems, Pentium 4 processor thermal throttling
• In 2000, information technology consumed 8% of total power in US.
• Future power viruses: cell phone viruses cause cell phone to run in full power mode and
  consume battery very quickly; PC viruses that cause CPU to meltdown batteries


6.1.2 Industrial Names and Products

All of the articles and papers below are linked to from the Documentation page on the E&CE 327
web site.

Overview white paper by Intel:

PC Energy-Efficiency Trends and Technologies An 8-page overview of energy and power trends,
written in 2002. Available from the web at an intolerably long URL.



                                             403
404                        CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN


AMD’s Athlon PowerNow!
   Reduce power consumption in laptops when running on battery by allowing software to
   reduce clock speed and supply voltage when performance is less important than battery life.

Intel Speedstep
      Reduce power consumption in laptops when running on battery by reducing clock speed to
      70-80% of normal.

Intel X-Scale
      An ARM5-compatible microprocessor for low-power systems:
      http://developer.intel.com/design/intelxscale/

Synopsys PowerMill
    A simulator that estimates power consumption of the circuit as it is simulated:
      http://www.synopsys.com/products/etg/powermill ds.html

DEC / Compaq / HP Itsy A tiny but powerful PDA-style computer running linux and
   X-windows. Itsy was created in 1998 by DEC’s Western Research Laboratory to be an
   experimental platform in low-power, energy-efficient computing. Itsy lead to the iPAQ
   PocketPC.
      www.hpl.hp.com/techreports/Compaq-DEC/WRL-2000-6.html
      www.hpl.hp.com/research/papers/2003/handheld.html

Satellites Satellites run on solar power and batteries. They travel great distances doing very
     little, then have a brief period very intense activity as they pass by an astronomical object of
     interest. Satellites need efficient means to gather and store energy while they are flying
     through space. Satellites need powerful, but energy efficient, computing and
     communication devices to gather, process, and transmit data. Designing computing devices
     for satellites is an active area of research and business.


6.1.3 Power vs Energy

Most people talk about “power” reduction, but sometimes they mean “power” and sometimes
“energy.”
• Power minimization is usually about heat removal
• Energy minimization is usually about battery life or energy costs
  Type     Units Equivalent Types Equations
 Energy Joules            Work         = Volts × Coulombs
                                       = 2 ×C × Volts2
                                          1

 Power      Watts     Energy / Time       = Volts × I
                                          = Joules/ sec
6.1.4 Batteries, Power and Energy                                                            405


6.1.4 Batteries, Power and Energy

6.1.4.1 Do Batteries Store Energy or Power?


                                Energy = Volts × Coulombs
                                              Energy
                                 Power =
                                               Time

Batteries rated in Amp-hours at a voltage.


                           battery = Amps × Seconds × Volts
                                   = Coulombs × Seconds × Volts
                                      Seconds
                                   = Coulombs × Volts
                                   = Energy

Batteries store energy.


6.1.4.2 Battery Life and Efficiency

To extend battery life, we want to increase the amount of work done and/or decrease energy
consumed.

Work and energy are same units, therefore to extend battery life, we truly want to improve
efficiency.

“Power efficiency” of microprocessors normally measured in MIPS/Watt. Is this a real measure of
efficiency?

                          MIPs = millions of instructions × Seconds
                          Watts          Seconds             Energy
                                = millions of instructions
                                          Energy

Both instructions executed and energy are measures of work, so MIPs/Watt is a measure of
efficiency.

(This assumes that all instructions perform the same amount of work!)
406                      CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN


6.1.4.3 Battery Life and Power

  Question: Running a VHDL simulation requires executing an average of 1 million
    instructions per simulation step. My computer runs at 700MHz, has a CPI of 1.0, and
    burns 70W of power. My battery is rated at 10V and 2.5AH. Assuming all of my
    computer’s clock cycles go towards running VHDL simulations, how many
    simulation steps can I run on one battery charge?


  Answer:


      Outline of approach:
       1.   Unify the units
       2.   Calculate amount of energy stored in battery
       3.   Calculate energy consumed by each simulation step
       4.   Calculate number of simulation steps that can be run

      Unify the units:

       Amp (current)                                      Coulomb/sec
       Volt (potential difference, energy per charge)   Joule/Coulomb
       Watt (power)                                          Joule/sec

      Energy stored in battery:


                     Ebatt = check equation by checking the units
                           = AmpHours ×Vbatt
                                           sec
                             = Amp × hour ×     × Volt
                                           hour
                               Coulomb          sec    Joule
                             =         × hour ×     ×
                                 sec            hour Coulomb
                             = Joule

                               unit match, do the math
                             = 2.5AH × 3600sec/hour × 10V

                             = 90 000Joules



      Energy per simulation step:
6.1.4 Batteries, Power and Energy                                                    407




          Estep     check the units
                  = Watts × . . .
                    Joule sec cyc instr
                  =       ×       ×   ×
                     sec     cyc instr step
                    Joule
                  =
                     step
                    units check, do the math
                                       1
                  = 70Watts ×                   × 1.0cyc/instr × 106 instr/step
                              700 × 106 cyc/sec
                  = 0.1Joule/step

                  = 0.1Joule/step


     Number of steps:

                                            Ebatt
                              NumSteps =
                                            Estep
                                            90 000
                                          =
                                             0.1
                                          = 900, 000steps



  Question: If I use the SpeedStep feature of my computer, my computer runs at
    600MHz with 60W of power. With SpeedStep activated, much longer can I keep the
    computer running on one battery?


  Answer:


     Approach:
       1. Calculate uptime with Speedstep turned off (high power)
       2. Calculate uptime with Speedstep turned on (low power)
       3. Calculate difference in uptimes

     High-power uptime:
408                     CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN



                                       Ebatt
                                TH =
                                        PH
                                       90 000Watt-Secs
                                     =
                                            70Watt
                                     = 1285Secs

                                     = 21minutes


      Low-power uptime:

                                       Ebatt
                                TL =
                                        PL
                                       90 000Watt-Secs
                                     =
                                            60Watt
                                     = 1500Secs

                                     = 25minutes


      Difference in uptimes:


                                   Tdiff = TL − TH

                                          = 25 − 21

                                          = 4minutes



      Analysis:

      This question is based on data from a typical laptop. So, why are the
      predicted uptimes so much shorter than those experienced in reality?

      Answer: The power consumption figures are the maximum peak power
      consumption of the laptop: disk spinning, fan blowing, bus active, all
      peripherals active, all modules on CPU turned on. In reality, laptop almost
      never experience their maximum power consumption.


  Question: With SpeedStep activated, how many more simulation steps can I run on
    one battery?
6.2. POWER EQUATIONS                                                                           409


   Answer:


      Clock speed is proportional to power consumption. In both high-power and
      low-power modes, the system runs the same number of clock cycles on the
      energy stored in the battery. So, we are run the same number of simulation
      steps both with and without SpeedStep activated.

      Analysis:

      In reality, with SpeedStep activated, I am able to run more simulation steps.
      Why does the theoretical calculation disagree with reality?

      Answer: In reality, the processor does not use 100% of the clock cycles for
      running the simulator. Many clock cycles are “wasted” while waiting for I/O
      from the disk, user, etc. When reducing the clock speed, a smaller number of
      clock cycles are wasted as idle clock cycles.



6.2 Power Equations

                Power = SwitchPower + ShortPower + LeakagePower
                                    DynamicPower                   StaticPower

Dynamic Power dependent upon clock speed
    Switching Power useful — charges up transistors
    Short Circuit Power not useful — both N and P transistors are on
Static Power independent of clock speed
      Leakage Power not useful — leaks around transistor

Dynamic power is proportional to how often signals change their value (switch).
• Roughly 20% of signals switch during a clock cycle.
• Need to take glitches into account when calculating activity factor. Glitches increase the
  activity factor.
• Equations for dynamic power contain clock speed and activity factor.
410                        CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN


6.2.1 Switching Power



            1->0                                           0->1
                                  0->1                                           1->0
                                  CapLoad                                        CapLoad


                Charging a capacitor                          Disharging a capacitor


                                                       1
                   energy to (dis)charge capacitor =     × CapLoad × VoltSup2
                                                       2

                                                                                 1
When a capacitor C is charged to a voltage V , the energy stored in capacitor is 2 CV 2 .
                                                                                       1
The energy required to charge the capacitor from 0 to V is CV 2 . Half of the energy ( 2 CV 2 is
dissipated as heat through the pullup resistance. Half of energy is transfered to the capacitor.
                                                                                1
When the capacitor discharges from V to 0, the energy stored in the capacitor ( 2 CV 2 ) is
dissipated as heat through the pulldown resistance.

f ′ : frequency at which invertor goes through complete charge-discharge cycle. (eqn 15.4 in
Smith)


                      average switching power = f ′ × CapLoad × VoltSup2


       ClockSpeed      clock speed
       ActFact         average number of times that signal switches from 0 → 1 or from
                       1 → 0 during a clock cycle



                                     1
        average switching power =      × ActFact × ClockSpeed × CapLoad × VoltSup2
                                     2
6.2.2 Short-Circuited Power                                                                                     411


6.2.2 Short-Circuited Power
                                                         VoltSup

                                             VoltSup - VoltThresh


                                                      VoltThresh
                                                            GND


                                                         P-trans on
                      IShort
                                                        N-trans on
      Vi                          Vo                    TimeShort



                                                                      Gate Voltage




           PwrShort = ActFact × ClockSpeed × TimeShort × IShort × VoltSup


6.2.3 Leakage Power
                             Vi
                             Vo
                                                                                         I
              N       N                 P     P
                                                                                 ILeak
                  P
                                                                                             V

                          N-substrate

                                                                      Leakage current through parasitic diode
  Cross section of invertor showing parasitic
                     diode

                                            PwrLk = ILeak × VoltSup

                                                                    −q × VoltThresh
                                                                         k×T
                                            ILeak ∝ e
412                    CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN


6.2.4 Glossary

 ClockSpeed      def   Clock speed
                 aka   f
 ActFact         def   activity factor
                 aka   A
                                 NumTransitions
                 =
                       NumSignals × NumClockCycles
                 =     Per signal: percentage of clock cycles when signal changes value.

                 =     Per clock cycle: percentage of signals that change value per clock
                       cycle. Note: When measuring per circuit, sometimes approximate by
                       looking only at flops, rather than every single signal.

 TimeShort       def   short circuit time
                 aka   τ
                  =    Time that both N and P transistors are turned on when signal changes
                       value.
 MaxClockSpeed   def   Maximum clock speed that an implementation technology can sup-
                       port.
                 aka    fmax
                       (VoltSup − VoltThresh)2
                 ∝
                                 VoltSup
 VoltSup         def   Supply voltage
                 aka   V
 VoltThresh      def   Threshold voltage
                 aka   Vth
                  =    voltage at which P transistors turn on
 ILeak           def   Leakage current
                 aka   IS (reverse bias saturation current)
                           −q × VoltThresh
                                 k×T
                  ∝    e
 IShort          def   Short circuit current
                 aka   Ishort
                  =    Current that goes through transistor network while both N and P tran-
                       sistors are turned on.
 CapLoad         def   load capacitance
                 aka   CL
 PwrSw           def   switching power (dynamic)
                       1                                                    2
                  =    2 × ActFact × ClockSpeed × CapLoad × VoltSup
 PwrShort        def   switching power (dynamic)
                  =    ActFact × ClockSpeed × TimeShort × IShort × VoltSup
 PwrLk           def   leakage power (static)
                  =    ILeak × VoltSup
 Power           def   total power
                  =    PwrSw + PwrShort + PwrLk
6.2.5 Note on Power Equations                                                                 413


 q     def    electron charge
        =     1.60218 × 10−19C
 k     def    Boltzmann’s constant
        =     1.38066 × 10−23 J/K
 T     def    temperature in Kelvin


6.2.5 Note on Power Equations

The power equation:

             Power = DynamicPower + StaticPower
                   = PwrSw + PwrShort + PwrLk
                   =    (ActFact × ClockSpeed × 1 CapLoad × VoltSup2 )
                                                2
                     + (ActFact × ClockSpeed × TimeShort × IShort × VoltSup)
                     + (ILeak × VoltSup)

is for an individual signal.

To calculate dynamic power for n signals with different CapLoad, TimeShort, and IShort:

                                  n
                                              1
     DynamicPower =             ( ∑ ActFacti × CapLoadi × ClockSpeed × VoltSup2 )
                                 i=1          2
                                  n
                           + ( ∑ ActFacti × ClockSpeed × TimeShorti × IShorti × VoltSup)
                                 i=1


If know the average CapLoad, TimeShort, and IShort for a collection of n signals, then the
above formula simplifies to:


 DynamicPower =                (n × ActFactAV G × 2 CapLoadAV G × ClockSpeed × VoltSup2 )
                                                  1


                          + (n × ActFactAV G × ClockSpeed × TimeShortAV G × IShortAV G × VoltSup)


If capacitances and short-circuit parameters don’t have an even distribution, then don’t average
them. If high-capacitance signals have high-activity factors, then averaging the equations will
result in erroneously low predictions for power.
414                        CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN


6.3 Overview of Power Reduction Techniques
We can divide power reduction techniques into two classes: analog and digital.
analog
       Parameters to work with:
           capacitance for example, Silicon on Insulator (SOI)
           resistance for example, copper wires
           voltage low-voltage circuits
          Techniques:
             dual-VDD Two different supply voltages: high voltage for performance-critical
                 portions of design, low voltage for remainder of circuit. Alternatively, can vary
                 voltage over time: high voltage when running performance-critical software and
                 low voltage when running software that is less sensitive to performance.
             dual-Vt Two different threshold voltages: transistors with low threshold voltage for
                 performance-critical portions of design (can switch more quickly, but more
                 leakage power), transistors with high threshold voltage for remainder of circuit
                 (switches more slowly, but reduces leakage power).
             exotic circuits Special flops, latches, and combinational circuitry that run at a high
                 frequency while minimizing power
             adiabatic circuits Special circuitry that consumes power on 0 → 1 transitions, but
                 not 1 → 0 transitions. These sacrifice performance for reduced power.
             clock trees Up to 30% of total power can be consumed in clock generation and
                 clock tree

digital
          Parameters to work with:
             capacitance (number of gates)
             activity factor
             clock frequency
          Techniques:
             multiple clocks Put a high speed clock in performance-critical parts of design and a
                 low speed clock for remainder of circuit
             clock gating Turn off clock to portions of a chip when it’s not being used
             data encoding Gray coding vs one-hot vs fully encoded vs ...
             glitch reduction Adjust circuit delays or add redundant circuitry to reduce or
                 eliminate glitches.
             asynchronous circuits Get rid of clocks altogether....

Additional low-power design techniques for RTL from a Qualis engineer:
http://home.europa.com/˜celiac/lowpower.html
6.4. VOLTAGE REDUCTION FOR POWER REDUCTION                                                                 415


6.4 Voltage Reduction for Power Reduction
If our goal is to reduce power, the most promising approach is to reduce the supply voltage,
because, from:


           Power =          (ActFact × ClockSpeed × 1 CapLoad × VoltSup2 )
                                                    2
                          + (ActFact × ClockSpeed × TimeShort × IShort × VoltSup)
                          + (ILeak × VoltSup)

we observe:


                                           Power ∝ VoltSup2


Reducing Difference Between Supply and Threshold Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . .
As the supply voltage decreases, it takes longer to charge up the capacitive load, which increases
the load delay of a circuit.

In the chapter on timing analysis, we saw that increasing the supply voltage will decrease the
delay through a circuit. (From V = IR, increasing V causes an increase in I, which causes the
capacitive load to charge more quickly.) However, it is more accurate to take into account both
the value of the supply voltage, and the difference between the supply voltage and the threshold
voltage.


                                                      (VoltSup − VoltThresh)2
                           MaxClockSpeed ∝
                                                             VoltSup


   Question: If the delay along the critical path of a circuit is 20 ns, the supply voltage
     is 2.8 V, and the threshold voltage is 0.7 V, calculate the critical path delay if the
     supply voltage is dropped to 2.2 V.


   Answer:

                              d   20ns    current delay along critical path
                             d′      ??   new delay along critical path
                             V    2.8V    current supply voltage
                             V′   2.2V    new supply voltage
                             Vt   0.7V    threshold voltage
416                            CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN



                  MaxClockSpeed ∝ 1/d

                                          (VoltSup − VoltThresh)2
                  MaxClockSpeed ∝
                                                  VoltSup
                                             V
                                      d ∝
                                          (V −Vt )2
                                     d′   (V −Vt )2     V′
                                        =           × ′
                                     d       V       (V −Vt )2
                                                    (V −Vt )2     V′
                                     d′ = d ×                 × ′
                                                       V       (V −Vt )2
                                                        (2.8V − 0.7V)2        2.2V
                                          = 20ns ×                     ×
                                                             2.8V        (2.2V − 0.7V)2
                                          = 31ns


Reducing Threshold Voltage Increases Leakage Current . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
If we reduce the supply voltage, we want to also reduce the threshold voltage, so that we do not
increase the delay through the circuit. However, as threshold voltage drops, leakage current
increases:

                                               −q × VoltThresh
                                     ILeak ∝ e      k×T


And increasing the leakage current increases the power:

                                              Power ∝ ILeak

So, need to strike a balance between reducing VoltSup (which has a quadratic affect on reducing
power), and increasing ILeak, which has a linear affect on increasing power.



6.5 Data Encoding for Power Reduction
6.5.1 How Data Encoding Can Reduce Power

Data encoding is a technique that chooses data values so that normal execution will have a low
activity factor.
The most common example is “Gray coding” where exactly one bit changes value each clock
cycle when counting.
6.5.1 How Data Encoding Can Reduce Power                                                    417


                          Two ways to understand the pattern for Gray-code count-
                          ing. Both methods are based on noting when a bit in the
                          Gray code toggles from 0 to 1 or 1 to 0.

                             • To convert from binary to Gray, a bit in the Gray
                               code toggles whenever the corresponding bit in the
                               binary code goes from 0 to 1. (US Patent 4618849
                               issued in 1984).
 Decimal    Gray Binary
                             • To implement a Gray code counter from scratch,
    0       0000 0000
                               number the bits from 1 to n, with a special less-than-
    1       0001 0001
                               least-significant bit q0 . The output of the counter
    2       0011 0010
                               will be qn . . . q1 .
    3       0010 0011
    4       0110 0100             1. create a flop that toggles in each clock cycle:
    5       0111 0101                q0 <= not q0
    6       0101 0110
    7       0100 0111             2. bit 1 toggles whenever q0 is 1.
    8       1100 1000             3. For each bit i ∈ 2..n, the counter bit qi toggles
    9       1101 1001                whenever qi−1 is 1 and all of the bits qi−2 . . . q0
   10       1111 1010                are 0.
   11       1110 1011             4. This behaviour can be implemented in a
   12       1010 1100                ripple-carry style by introducing carry (ci ) and
   13       1011 1101
                                     toggle (qti ) signals for each bit.
   14       1001 1110
   15       1000 1111                    q0   <=     not(q0 )              reg asn
                                         c0   <=     not(q0 )              comb asn
                                         ci   <=     ci−1 and not(qi )     comb asn
                                        qti   <=     qi−1 and ci−2         comb asn

                                     We create a toggle flip-flop by xoring the out-
                                     put of a D-flop with its toggle signal:

                                              qi <= qi xor qti reg asn


  Question: For an eight-bit counter, how much more power will a binary counter
    consume than a Gray-code counter?


  Answer:


     Power consumption is dependent on area and activity factor. The original
     purpose of this problem was to focus on activity factor. The problem
     was created under the mistaken assumption that a Gray code counter
     and a binary code counter will both use the same area (1 fpga cell per
418                       CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN


      bit) and so the power difference comes from the difference in activity
      factors. This mistake is addressed at the end of the solution.

      For Gray coding, exactly one-bit toggles in each clock cycle. Thus, the activity
                                               1
      factor for an n-bit Gray counter will be n .

      For binary coding, the least significant bit toggles in every clock cycle, so it
      has an activity factor of 1. The 2nd least-significant bit toggles in every other
                                                   1
      clock cycle, so it has an activity factor of 2 . We study the other bits and try to
      find a pattern based on the bit position, i, where i = 0 for the least-significant
      bit and n − 1 for the most significant bit of an n-bit counter. We see that for bit
                                1
      i, the activity factor is 2i .

      For an n-bit binary counter, the average activity factor is the sum of the
      activity factors for the signals over the number of signals:

                                                   1 + 1 + 1 +···+ 1
                                                  20 21 22        2n−1
                            BinaryActFact =                n
                                               1 n−1 i−1
                                             =  ×∑2
                                               n i=0


      The limit of the summation term as n goes to infinity is 2. We can see this as
      an instance of Zeno’s paradox, in that with each step we halve the distance to
      2.

                                                      1
                                  BinaryActFact ≈       ×2
                                                      n
                                                      2
                                                    ≈
                                                      n

      Find the ratio of the binary activity factor to the Gray-code activity factor.

                                  BinaryActFact   2 n
                                                =  ×
                                   GrayActFact    n 1
                                                    = 2


      In reality, the ripple-carry Gray code counter will always have two
      transitions per clock cycle: one for the q0 toggle flop and one for the
      actual signal in the counter that toggles. Thus the Gray code counter
      will consume more power than the binary counter. The overall power
      reduction comes from the circuit that uses the Gray code.
6.5.2 Example Problem: Sixteen Pulser                                                          419


   Question: For completely random eight-bit data, how much more power will a binary
     circuit consume than a Gray-code circuit?


   Answer:


      If the data is completely random, then the Gray code loses its feature that
      consecutive data will differ in only one bit position. In fact, the activity factor
      for Gray code and binary code will be the same. There will not be any power
      saving by using Gray code. A binary counter will consume the same power as
      a Gray-code circuit.

      On average, half of the bits will be 1 and half will be 0. For each bit, there are
      four possible transitions: 0→0, 0→1, 1→0, and 1→1. In these four
      transitions, two causes changes in value and two do not cause a change. Half
      of the transitions result in a change in value, therefore for random data the
      activity factor will be 0.5, independent of data encoding or the number of bits.


6.5.2 Example Problem: Sixteen Pulser

6.5.2.1 Problem Statement

Your task is to do the power analysis for a circuit that should send out a one-clock-cycle pulse on
the done signal once every 16 clock cycles. (That is, done is ’0’ for 15 clock cycles, then ’1’ for
one cycle, then repeat with 15 cycles of ’0’ followed by a ’1’, etc.)
                    1    2     3      15   16     17      31   32   33
     clk
   done

                             Required behaviour
You have been asked to consider three different types of counters: a binary counter, a Gray-code
counter, and a one-hot counter. (The table below shows the values from 0 to 15 for the different
encodings.)


   Question: What is the relative amount of power consumption for the different
     options?
420                                      CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN


6.5.2.2 Additional Information

Your implementation technology is an FPGA where each cell has a programable combinational
circuit and a flip-flop. The combinational circuit has 4 inputs and 1 output. The capacitive load of
the combinational circuit is twice that of the flip-flop.




              PLA


                     cell
    1. You may neglect power associated with clocks.
    2. You may assume that all counters:
           (a) are implemented on the same fabrication process
           (b) run at the same clock speed
           (c) have negligible leakage and short-circuit currents


6.5.2.3 Answer

Outline of Thinking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Factors to consider that distinguish the options: capacitance and activity factor:

Capacitance is dependent upon the number of signals, and whether a signal is combinational or a
flop.

Sketch out the circuitry to evaluate capacitance.


Sketch the Circuitry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Name the output “done” and the count digits “d()”.
6.5.2 Example Problem: Sixteen Pulser                                                                                                                       421


                     d(0)
         PLA




                     d(1)
         PLA




                     d(2)
         PLA




                     d(3)                  PLA
                                                             done
         PLA




 Block diagram for Gray and Binary Counters
                 d(0)                    d(1)
                                                                          d(15)      done
       PLA                  PLA                          PLA




               Block diagram for One-Hot
Observation:


     The Gray and Binary counters have the same design, and the Gray counter will have
     the lower activity factor. Therefore, the Gray counter will have lower power than the
     Binary counter.

     However, we don’t know how much lower the power of the Gray counter will be, and
     we don’t know how much power the One-Hot counter will consume.


Capacitance    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
422                      CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN


                          cap   number subtotal cap
  Gray    d()  PLAs        2      4         8
               Flops       1      4         4
          done PLAs        2      1         2
               Flops       1      0         0
 1-Hot    d() PLAs         2      0         0
               Flops       1      16       16
          done PLAs        2      0         0
               Flops       1      0         0
 Binary   d() PLAs         2      4         8
               Flops       1      4         4
          done PLAs        2      1         2
               Flops       1      0         0


Activity Factors   .......................................................................
  clk                                         clk

  d(0)                               8/16     d(0)                            2/16

  d(1)                               4/16     d(1)                            2/16

  d(2)                               2/16     d(2)                            2/16

  d(3)                               2/16                                     2/16

  done                               2/16     done                            2/16

                Gray coding                           One-hot coding
  clk

  d(0)                                16/16

  d(1)                                 8/16

  d(2)                                 4/16

  d(3)                                 2/16

  done                                 2/16

              Binary coding
6.5.2 Example Problem: Sixteen Pulser                                                     423


                                           act fact
 Gray      d()  PLAs          1/4 signals in each clock cycle
                Flops         1/4 signals in each clock cycle
           done PLAs           2 transitions / 16 clock cycles
                Flops                         —
 1-Hot     d() PLAs                           —
                Flops          2 transitions / 16 clock cycles
           done PLAs                          —
                Flops                         —
                             16 + 8 + 4 + 2 transitions
 Binary    d()     PLAs                                    = 0.47
                            4 signals × 16 clock cycles
                             16 + 8 + 4 + 2 transitions
                   Flops                                   = 0.47
                            4 signals × 16 clock cycles
           done PLAs           2 transitions / 16 clock cycles
                Flops                         —

          Note: Activity factor for One-Hot counter     Because all signals have same
          capacitance, and all clock cycles have the same number of transitions for the
          One-Hot counter, could have calculated activity factor as two transitions per
          sixteen signals.
424                        CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN


Putting it all Together    ................................................................ .
                            subtotal cap act fact power
  Gray     d() PLAs              8         1/4       2
               Flops             4         1/4       1
          done PLAs              2        2/16     4/16
               Flops             0         —         0
               Total                               3.25
 1-Hot    d() PLAs               0         —         0
               Flops            16        2/16       2
          done PLAs              0         —         0
               Flops             0         —         0
               Total                                 2
 Binary   d() PLAs               8        0.47     3.76
               Flops             4        0.47     1.88
          done PLAs              2        2/16     0.25
               Flops             0         —         0
               Total                               5.87
If choose Binary counting as baseline, then relative amounts of power are:
 Gray    54%
 One-Hot 35%
 Binary  100%
If choose One-Hot counting as baseline, then relative amounts of power are:
 Gray    156%
 One-Hot 100%
 Binary  288%


6.6 Clock Gating
The basic idea of clock gating is to reduce power by turning off the clock when a circuit isn’t
needed. This reduces the activity factor.


6.6.1 Introduction to Clock Gating

                                    Examples of Clock Gating
             Condition                                 Circuitry turned off
   O/S in standby mode              Everything except “core” state (PC, registers, caches, etc)
   No floating point instructions    floating point circuitry
   for k clock cycles
   Instruction cache miss           Instruction decode circuitry
   No instruction in pipe stage i   Pipe stage i
6.6.2 Implementing Clock Gating                                                                                                 425


Design Tradeoffs            ..................................................................... .


+ Can significantly reduce activity factor (Synopsys PowerCompiler claims that can cut power
   to be 50–80% of ungated level)

− Increases design complexity
    • design effort
    • bugs!

− Increases area

− Increases clock skew


Functional Validation and Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
It’s a functional bug to turn a clock off when it’s needed for valid data.

It’s functionally ok, but wasteful to turn a clock on when it’s not needed.

(About 5% of the bugs caught on Willamette (Intel Pentium 4 Processor) were related to clock
gating.) Nicolas Mokhoff. EE Times. June 27, 2001.
http://www.edtn.com/story/OEG20010621S0080


6.6.2 Implementing Clock Gating

Clock gating is implemented by adding a component that disables the clock when the circuit isn’t
needed.
                                      i_data                                           o_data

                                    i_valid
                                        clk                                            o_valid

                                                    Without clock gating


                          i_data                                                                    o_data
                        i_valid
                            clk                 cool_clk                                            o_valid



                                        clk_en

                                               Clock Enable
                       i_wakeup
                                               State Machine

                                                      With clock gating
426                                       CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN


The total power of a circuit with clock gating is the sum of the power of the main circuit with a
reduced activity factor and the power of the clock gating state machine with its activity factor.
The clock-gating state machine must always be on, so that it will detect the wakeup signal — do
not make the mistake of gating the clock to your clock gating circuit!


6.6.3 Design Process

Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
• What level of granularity for gated clocks?
  – entire module?
  – individual pipe stages?
  – something in between?
•   When should the clocks turn off?
•   When should the clocks turn on?
•   Protocol for incoming wakeup signal?
•   Protocol for outgoing wakeup signal?


Wakeup Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Designers negotiate incoming and outgoing wakeup protocol with environment.

An example wakeup protocol:
• wakeup in will arrive 1 clock cycle before valid data
• wakeup in will stay high until have at least 3 cycles of invalid data


Design Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
When designing clock gating circuitry, consider the two extreme case:
• a constant stream of valid data
• circuit is turned off and receives a single parcel of valid data

For a constant stream of valid data, the key is to not incur a large overhead in design complexity,
area, or clock period when clocks will always be toggling.

For a single parcel of valid data, the key is to make sure that the clocks are toggling so that data
can percolate through circuit. Also, we want to turn off the clock as soon as possible after data
leaves.
6.6.4 Effectiveness of Clock Gating                                                              427


6.6.4 Effectiveness of Clock Gating

We can measure the effectiveness of clock gating by comparing the percentage of clock cycles
when the clock is not toggling to the percentage of clock cycles that the circuit does not have
valid data (i.e. the clock does not need to toggle).

The most ineffective clock gating scheme is to never turn off the clock (let the clock always
toggle). The most effective clock gating scheme is to turn off the clock whenever the circuit is not
processing valid data.

Parameters to characterize effectiveness of clock gating:

 Eff         =   effectiveness of clock gating
 PctValid    =   percentage of clock cycles with valid data in the circuit — the clock
                 must be toggling
 PctClk      =   percentage of clock cycles that clock toggles
Effectiveness measures the percentage of clock cycles with invalid data in which the clock is
turned off. Equation for effectiveness of clock gating:

                                              PctClkOff
                                      Eff =
                                              PctInvalid
                                               1 − PctClk
                                            =
                                              1 − PctValid


   Question:     What is the effectiveness if the clock toggles only when there is valid data?


   Answer:


      PctClk = PctValid and the effectiveness should be 1:

                                              1 − PctClk
                                      Eff =
                                             1 − PctValid
                                             1 − PctValid
                                           =
                                             1 − PctValid
                                           = 1


   Question:     What is the effectiveness of a clock that always toggles?
428                        CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN


   Answer:


      If the clock is always toggling, then PctClk = 100% and the effectiveness
      should be 0.

                                                  1 − PctClk
                                            Eff =
                                                 1 − PctValid
                                                     1−1
                                               =
                                                 1 − PctValid
                                               = 0

   Question:     What does it mean for a clock gating scheme to be 75% effective?

   Answer:
      75% of the time that the there is invalid data, the clock is off.

   Question:     What happens if PctClk < PctValid?

   Answer:
      If PctClk < PctValid, then:
      1 − PctClk > 1 − PctValid
      so, effectiveness will be greater than 100%.
      In some sense, it makes sense that the answer would be nonsense, because
      a clock gating scheme that is more than 100% effective is too effective: it is
      turning off the clock sometime when it shouldn’t!

We can see the effect of the effectiveness of a clock-gating scheme on the activity factor:
                                    A
                                                             PctValid * A
                               A’

                                    0
                                        0                1
                                                Eff

When the effectiveness is zero, the new activity factor is the same as the original activity factor.
For a 100% effective clock gating scheme, the activity factor is A × PctValid . Between 0% and
100% effectiveness, the activity factor decreases linearly.
The new activity factor with a clock gating scheme is:

                                 A′ = A − (1 − PctValid ) × Eff × A
6.6.5 Example: Reduced Activity Factor with Clock Gating                                         429


6.6.5 Example: Reduced Activity Factor with Clock Gating

   Question:     How much power will be saved in the following clock-gating scheme?
• 70% of the time the main circuit has valid data
• clock gating circuit is 90% effective (90% of the time that the circuit has invalid data, the clock
  is off)
• clock gating circuit has 10% of the area of the main circuit
• clock gating circuit has same activity factor as main circuit
• neglect short-circuiting and leakage power


   Answer:


        1. Set up main equations
430                     CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN




            PwrMain = power for main circuit without clock gating
            Pwr′Main = power for main circuit with clock gating
          PwrClkFsm = power for clock enable state machine

               PwrTot = PwrMain + PwrClkFsm

                 Pwr = PwrSw + PwrLk + PwrShort
                       1
               PwrSw =   × A ×C ×V 2
                       2
              PwrLk = negligible
            PwrShort = negligible
                       1
                Pwr =    × A ×C ×V 2
                       2
                               1                       1
               PwrTot =          × AMain ×CMain ×V 2 +   × AClkFsm ×CClkFsm ×V 2
                               2                       2
                AMain = A

                CMain = C

             AClkFsm = A

             CClkFsm = 0.1C

                A′    = A′
                 Main
             A′      = A
              ClkFsm
                          1                  1
                   ′
               PwrTot       × A′ ×C ×V 2 +     × A × 0.1C ×V 2
                          2                  2
                      =
               PwrTot                1
                                       × A ×C ×V 2
                                     2
                        A′ + 0.1A
                      =
                            A


      2. Find new activity factor for main circuit (A′ ):

                               A′ = (1 − Eff(1 − PctValid)) × A
                                  = (1 − 0.9(1 − 0.7)) × A
                                  = 0.73A

      3. Find ratio of new total power to previous total power:
6.6.6 Clock Gating with Valid-Bit Protocol                                                    431




                                        Pwr′
                                           Tot         A′ + 0.1A
                                                   =
                                        PwrTot             A
                                                       0.73A + 0.1A
                                                   =
                                                            A
                                                   = 0.83

        4. Final answer: new power is 83% of original power


6.6.6 Clock Gating with Valid-Bit Protocol

A common technique to determine when a circuit has valid data is to use a valid-bit protocol. In
section 6.6.6.1 we review the valid-bit procotol and then in section 6.6.6.3 we add clock-gating
circuitry to a circuit that uses the valid-bit protocol.


6.6.6.1 Valid-Bit Protocol

Need a mechanism to tell circuit when to pay attention to data inputs — e.g. when is it supposed
to decode and execute an instruction, or write data to a memory array?

                                  clk
                              i_valid                           o_valid
                               i_data                           o_data



       clk
  i_valid
   i_data                 α     β          γ
  o_valid

   o_data                                  α       β        γ




i valid: high when i data has valid data — signifies whether circuit should pay attention to
or ignore data.

o valid: high when o data has valid data — signifies whether whether environment should
pay attention to output of circuit.

For more on circuit protocols, see section 2.12.
432                                      CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN


Microscopic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Which clock edges are needed?

                                                      i_valid                      o_valid
                                                          clk


                                                        clk
                                                i_valid
                                                o_valid
6.6.6 Clock Gating with Valid-Bit Protocol                                                         433


6.6.6.2 How Many Clock Cycles for Module?

Given a module with latency Lat, if the module receives a stream of NumPcls consecutive valid
parcels, how many clock cycles must the clock-enable signal be asserted?
Latency   NumPcls   NumClkEn                              Latency   NumPcls   NumClkEn


                               i_valid
                               o_valid
                                clk_en




                               i_valid                                                   i_valid
                               o_valid                                                   o_valid
                                clk_en                                                    clk_en




                               i_valid                                                   i_valid
                               o_valid                                                   o_valid
                                clk_en                                                    clk_en




                               i_valid                                                   i_valid
                               o_valid                                                   o_valid
                                clk_en                                                    clk_en


 ti1      time of first i valid
 to1      time of first o valid
 tik      time of last i valid
 tok      time of last o valid
 tstart   first clock cycle with clock enabled
 tlast    last clock cycle with clock enabled

Initial equations to describe relationships between different points in time:


                                           to1    =    ti1 + Lat
                                           tok    =    to1 + NumPcls − 1
                                         tfirst ti1 + 1
                                         tlast tok + 1


To understand the −1 in the equation for tok , examine the situation when NumPcls = 1. With just
one parcel going through the system to1 = ti1 + Lat , so we have: tok = to1 + 1 − 1.

In the equation for tlast , we need the +1 to clear the last valid bit.

Solve for the length of time that the clock must be enabled. The +1 at the end of this equation is
becuase if tlast = tfirst , we would have the clock enabled for 1 clock cycle.
434                             CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN




                                ClkEnLen =        tlast − tfirst + 1
                                         =        tok + 1 − (ti1 + 1) + 1
                                         =        tok − ti1 + 1
                                         =        to1 + NumPcls − 1 − ti1 + 1
                                         =        to1 + NumPcls − ti1
                                         =        ti1 + Lat + NumPcls − ti1
                                         =        Lat + NumPcls

We are left with the formula that the number of clock cycles that the module’s clock must be
enabled is the latency through the module plus the number of consecutive parcels.


6.6.6.3 Adding Clock-Gating Circuitry

Before Clock Gating          ...................................................................
                     data_in                                                   data_out


                     valid_in                                                  valid_out
                          clk


   clk

 valid_in
  data_in                α            β       γ                            δ

valid_out                                                                                  don’t care
 data_out                                     α       β    γ                               uninitialized



After Clock Gating: Circuitry             ........................................................ .
      data_in                                                  data_out


   valid_in                                                    valid_out




   hot_clk                         cool_clk
                clk_en

                   Clock Enable
wakeup_in                                                      wakeup_out
                   State Machine

• hot clk: clock that always toggles
6.6.6 Clock Gating with Valid-Bit Protocol                         435


• cool clk: gated clock — sometimes toggles, sometimes stays low
• wakeup: alerts circuit that valid data will be arriving soon
• clk en: turns on cool clk
436                    CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN


After Clock Gating: New Signals   . . . .. . . . . .. . . . . .. . . . . .. . . . . . .. . . . . .. . . . . .. . . . . .. . . . . ...
         hot_clk
       wakeup_in
         valid_in
         data_in       α          β      γ                                  δ
          clk_en

        cool_clk
        valid_out
        data_out                         α                β        γ
      wakeup_out
6.6.7 Example: Pipelined Circuit with Clock-Gating                                                                                              437


6.6.7 Example: Pipelined Circuit with Clock-Gating

Design a “clock enable state machine” for the pipelined component described below.
• capacitance of pipelined component = 200
• latency varies from 5 to 10 clock cycles, even distribution of latencies
• contains a maximum of 6 instructions (parcels of data).
• 60% of incoming parcels are valid
• average length of continuous sequence of valid parcels is 80
• use input and output valid bits for wakeup
• leakage current is negligible
• short-circuit current is negligible
• LUTs have a capacitance of 1, flops have a capacitance of 2
The two factors affecting power are activity factor and capacitance.

  1. Scenario: turned off and get one parcel.
       (a) Need to turn on and stay on until parcel departs
       (b) idea #1 (parcel count):
           • count number of parcels inside module
           • keep clocks toggling if have non-zero parcels.
       (c) idea #2 (cycle count):
           • count number of clock cycles since last valid parcel entered module
           • once hit 10 clock cycles without any valid parcels entering, know that all parcels
              have exited.
           • keep clocks toggling if counter is less than 10

  2. Scenario: constant stream of parcels
       (a) parcel count would require looking at input and output stream and conditionally
           incrementing or decrementing counter
       (b) cycle count would keep resetting counter


Waveforms     .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . . .. . . . . .. . . . . ...
                      1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
        i_valid
        o_valid

  parcel_count
 parcel_clk_en
438                                      CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN


                           1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
         i_valid
         o_valid

   cycle_count




                           0 1 2 0 0 0 1 2 3 4 5 0 1 2 3 4 5 6 7 8 9 10
  cycle_clk_en


Outline:

    1. sketch out circuitry for parcel count and cycle count state machine
    2. estimate capacitance of each state machine
    3. estimate activity factor of main circuit, based on behaviour


Parcel Count Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Need to count (0..6) parcels, therefore need 3 bits for counter.

Counter must be able to increment and decrement.

Equations for counter action (increment/decrement/no-change):


                                                i valid o valid   action
                                                   0       0    no change
                                                   0       1    decrement
                                                   1       0    increment
                                                   1       1    no change
6.7. POWER PROBLEMS                                                                             439


6.7 Power Problems
P6.1 Short Answers

P6.1.1 Power and Temperature

As temperature increases, does the power consumed by a typical combinational circuit increase,
stay the same, or decrease?


P6.1.2 Leakage Power

The new vice president of your company has set up a contest for ideas to reduce leakage power in
the next generation of chips that the company fabricates. The prize for the person who submits
the suggestion that makes the best tradeoff between leakage power and other design goals is to
have a door installed on their cube. What is your door-winning idea, and what tradeoffs will your
idea require in order to achieve the reduction in leakage power?


P6.1.3 Clock Gating

In what situations could adding clock-gating to a circuit increase power consumption?


P6.1.4 Gray Coding

What are the tradeoffs in implementing a program counter for a microprocessor using Gray
coding?


P6.2 VLSI Gurus

The VLSI gurus at your company have come up with a way to decrease the average rise and fall
time (0-to-1 and 1-to-0 transitions) for signals. The current value is 1ns. With their fabrication
tweaks, they can decrease this to 0.85ns .


P6.2.1 Effect on Power

If you implement their suggestions, and make no other changes, what effect will this have on
power? (NOTE: Based on the information given, be as specific as possible.)
440                       CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN


P6.2.2 Critique

A group of wannabe performance gurus claim that the above optimization can be used to improve
performance by at least 15%. Briefly outline what their plan probably is, critique the merits of
their plan, and describe any affect their performance optimization will have on power.


P6.3 Advertising Ratios

One day you are strolling the hallways in search of inspiration, when you bump into a person
from the marketing department. The marketing department has been out surfing the web and has
noticed that companies are advertising the MIPs/mm2, MIPs/Watt, and Watts/cm3 of their
products. This wide variety of different metrics has confused them.

Explain whether each metric is a reasonable metric for customers to use when choosing a system.

If the metric is reasonable, say whether “bigger is better” (e.g. 500 MIPs/mm2 is better than 20
MIPs/mm2) or “smaller is better” (e.g. 20 MIPs/mm2 is better than 500 MIPs/mm2), and which
one type of product (cell phone, desktop computer, or compute server) is the metric most relevant
to.
• MIPs/mm2
• MIPs/Watt
• Watts/cm3


P6.4 Vary Supply Voltage

As the supply voltage is scaled down (reduced in value), the maximum clock speed that the circuit
can run at decreases.

The scaling down of supply voltage is a popular technique for minimizing power. The maximum
clock speed is related to the supply voltage by the following equation:


                                              (VoltSup − VoltThresh)2
                        MaxClockSpeed ∝
                                                     VoltSup

Where VoltSup is supply voltage and VoltThresh is threshold voltage.

With a supply voltage of 3V and a threshold voltage of 0.8V, the maximum clock speed is
measured to be 200MHz. What will the maximum clock speed be with a supply voltage of 1.5V?
P6.5 Clock Speed Increase Without Power Increase                                         441


P6.5 Clock Speed Increase Without Power Increase

The following are given:
• You need to increase the clock speed of a chip by 10%
• You must not increase its dynamic power consumption
• The only design parameter you can change is supply voltage
• Assume that short-circuiting current is negligible


P6.5.1 Supply Voltage

How much do you need to decrease the supply voltage by to achieve this goal?


P6.5.2 Supply Voltage

What problems will you encounter if you continue to decrease the supply voltage?


P6.6 Power Reduction Strategies

In each low power approach described below identify which component(s) of the power equation
is (are) being minimized and/or maximized:


P6.6.1 Supply Voltage

Designers scaled down the supply voltage of their ASIC


P6.6.2 Transistor Sizing

The transistors were made larger.


P6.6.3 Adding Registers to Inputs

All inputs to functional units are registered


P6.6.4 Gray Coding

Gray coding of signals is used for address signals.
442                        CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN


P6.7 Power Consumption on New Chip

While you are eating lunch at your regular table in the company cafeteria, a vice president sits
down and starts to talk about the difficulties with a new chip.

The chip is a slight modification of existing design that has been ported to a new fabrication
process. Earlier that day, the first sample chips came back from fabrication. The good news is that
the chips appear to function correctly. The bad news is that they consume about 10% more power
than had been predicted.

The vice president explains that the extra power consumption is a very serious problem, because
power is the most important design metric for this chip.

The vice president asks you if you have any idea of what might cause the chips to consume more
power than predicted.


P6.7.1 Hypothesis

Hypothesize a likely cause for the surprisingly large power consumption, and justify why your
hypothesis is likely to be correct.


P6.7.2 Experiment

Briefly describe how to determine if your hypothesized cause is the real cause of the surprisingly
large power consumption.


P6.7.3 Reality

The vice president wants to get the chips out to market quickly and asks you if you have any ideas
for reducing their power without changing the design or fabrication process. Describe your ideas,
or explain why her suggestion is infeasible.
Chapter 7

Fault Testing and Testability

7.1 Faults and Testing
7.1.1 Overview of Faults and Testing

7.1.1.1 Faults

During manufacturing, faults can occur that make the physical product behave incorrectly.

Definition: A fault is a manufacturing defect that causes a wire, poly, diffusion, or via to either
break or connect to something it shouldn’t.




                Good wires                   Shorted wires                  Open wire



7.1.1.2 Causes of Faults
• Fabrication process (initial construction is bad)
  – chemical mix
  – impurities
  – dust
• Manufacturing process (damage during construction)
  – handling
    ∗ probing
    ∗ cutting
    ∗ mounting

                                                443
444                                        CHAPTER 7. FAULT TESTING AND TESTABILITY


  – materials
    ∗ corrosion
    ∗ adhesion failure
    ∗ cracking
    ∗ peeling


7.1.1.3 Testing

Definition Testing is the process of checking that the manufactured wafer/chip/board/system has
the same functionality as the simulations.


7.1.1.4 Burn In

Some chips that come off the manufacturing line will work for a short period of time and then fail.

Definition Burn-in: The process of subjecting chips to extreme conditions (high and low temps,
high and low voltages, high and low clock speeds) before and during testing.

The purpose is to cause (and catch) failures in chips that would pass a normal test, but fail in early
use by customers.




Soon to break wire


The hope is that the extreme conditions will cause chips to break that would otherwise have
broken in the customers system soon after arrival.

The trick is to create conditions that are extreme enough that bad chips will break, but not so
extreme to cause good chips to break.


7.1.1.5 Bin Sorting

Each chip (or wafer) is run at a variety of clock speeds. The chips are grouped and labeled
(binned) by the maximum clock frequency at which they will work reliably.

For example, chips coming off of the same production line might be labelled as 800MHz,
900MHz, and 1000MHz.

Overclocking is taking a chip rated at nMHz and running it at 1.x × nMHz. (Sure your computer
often crashes and loses your assignment, but just think how much more productive you are when it
is working...)
7.1.1 Overview of Faults and Testing                                                                                                                       445


7.1.1.6 Testing Techniques
 Scan Testing or Boundary Scan Testing (BST, JTAG)
      • Load test vector from tester into chip
      • Run chip on test data
      • Unload result data from chip to tester
      • Compare results from chip against those produced by simulation
      • If results are different, then chip was not manufactured correctly
 Built In Self Test (BIST)
      • Build circuitry on chip that generates tests and compares actual and expected results
 IDDQ Testing
    • Measure the quiescent current between VDD and GND.
    • Variations from expected values indicate faults.


Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The challenges in testing:
• test circuitry consumes chip area
• test circuitry reduces performance
• decrease fault escapee rate of product that ships while having minimal impact on production
  cost and chip performance
• external tester can only look at I/O pins
• ratio of internal signals to I/O pins is increasing
• some faults will only manifest themselves at high-clock frequencies

“The crux of testing is to use yesterday’s technology to find faults in tomorrow’s chips.” Agilent
engineer at ARVLSI 2001.


7.1.1.7 Design for Testability (DFT)

Scan testing and self-testing require adding extra circuitry to chips.

Design for test is the process of adding this circuitry in a disciplined and correct manner.

A hot area of research, that is becoming mainstream practice, is developing synthesis tools to
automatically add the testing circuitry.
446                                      CHAPTER 7. FAULT TESTING AND TESTABILITY


7.1.2 Example Problem: Economics of Testing

Given information:
• The ACHIP costs $10 without any testing
• Each board uses one ACHIP (plus lots of other chips that we don’t care about)
• 68% of the manufactured ACHIPS do not have any faults
• For the ACHIP, it costs $1 per chip to catch half of the faults
• Each 50% reduction in fault escapees doubles cost of testing (intuition: doubles number of tests
  that are run)
• If board-level testing detects a bad ACHIP, it costs $200 to replace the ACHIP
• Board-level testing will detect 100% of the faults in an ACHIP


  Question:     What escapee fault rate will minimize cost of the ACHIP?


  Answer:



             TotCost = NoTestCost + TestCost + EscapeeProb × ReplaceCost


        NoTestCost      Testcost EscapeeProb              ReplaceCost           TotCost
               $10           $0      32%              (200 × 0.32 = $64)            $74
               $10           $1      16%              (200 × 0.16 = $32)            $43
               $10           $2       8%              (200 × 0.08 = $16)            $28
               $10           $4       4%              (200 × 0.04 = $8)             $22
               $10           $8       2%              (200 × 0.02 = $4)             $22
               $10          $16       1%              (200 × 0.01 = $2)             $28
               $10          $32     0.5%             (200 × 0.005 = $1)             $43


      The lowest total cost is $22. There are option with a total cost of $22: $4 of
      testing and $8 of testing. Economically, we can choose either option.


For high-volume, small-area chips, testing can consume more than 50% of the total cost.
7.1.3 Physical Faults                                                                          447


7.1.3 Physical Faults

7.1.3.1 Types of Physical Faults

 Good Circuit                         Bad Circuits
    a         c                                      a              c
    b         d      open                            b              d
                                                     a              c
                     wired-AND bridging short        b              d
                                                     a              c
                     wired-OR bridging short         b              d
                                                     a              c
                     stronger wins bridging short    b              d
                     (b is stronger)
                                                     a              c
                     short to VDD                    b              d
                                                     a              c
                                                     b              d
                     short to GND


7.1.3.2 Locations of Faults

Each segment of wire, poly, diffusion, via, etc is a potential fault location.

Different segments affect different gates in the fanout.

A potential fault location is a segment or segments where a fault at any position affects the same
set of gates in the same way.




b                                                                            b




               BAD                                  OK                               BAD




b                           BAD      b                        BAD        b                     OK



Three different locations for potential faults.
448                                         CHAPTER 7. FAULT TESTING AND TESTABILITY


When working with faults, we work with wire segments, not signals. In the circuit below, there
are 8 different wire segments (L1–L8). Each wire segment corresponds to a logically distinct fault
location. All physical faults on a segment affect the same set of signals, so they are grouped
together into a “logical fault”. If a signal has a fanout of 1, then there is one wire segment. A
signal with a fanout of n, where n > 1, has at least n + 1 wire segments — one for the source
signal and one for each gate of fanout. As shown in section 7.1.3.3, the layout of the circuit can
have more than n + 1 segments.
                               a   L1
                                                      L6
                                           L4
                               b
                                   L2                                   L8   z
                                           L5
                                                      L7
                               c   L3




7.1.3.3 Layout Affects Locations

a                 f                                             e                                 e
           e                                     L2                                  L2

b                 g                              L1        L3       g                L1   L3 L5       g
                               i            b                                    b
c
                                                 L4                                       L4
                  h                                                 h                                 h
d

For the signal b in the schematic above, we can have either four or five different locations for
potential faults, depending upon how the circuit is layed out.


7.1.3.4 Naming Fault Locations

Two ways to name a fault location:
pin-fault model Faults are modelled as occuring on input and output pins of gates.
net-fault model Faults are modelled as occuring on segments of wires.

In E&CE 327, we’ll use the net-fault model, because it is simpler to work with and is closer to
what actually happens in hardware.


7.1.4 Detecting a Fault

To detect a fault, we compare the actual output of the circuit against the expected value.

To find a test vector that will detect a fault:
7.1.4 Detecting a Fault                                                                         449


   1. build Boolean equation (or Karnaugh map) of correct circuit

   2. build Boolean equation (or Karnaugh map) of faulty circuit

   3. compare equations (or Karnaugh maps), regions of difference represent test vectors that
      will detect fault


7.1.4.1 Which Test Vectors will Detect a Fault?

   Question: For the good circuit and faulty circuit shown below, which test vectors will
     detect the fault?


  a        d                                                              a          d
  b                                                                       b
                       e                                                                         e
  c                                                                       c
      Good circuit                                                             Faulty circuit


   Answer:

       a       b   c good       faulty
       0       0   0   0          0
       0       0   1   1          1
       0       1   0   0          0
                                              The only test vector that will detect
       0       1   1   1          1
                                              the fault in the circuit is 110.
       1       0   0   0          0
       1       0   1   1          1
       1       1   0   1          0      ←−
       1       1   1   1          1

Sometimes multiple test vectors will catch the same fault.

Sometimes a single test vector can catch multiple faults.

  a        d
  b                                                                a b c      good faulty
                            e
  c                                                                1 1 0       1     0    ←−
       Another fault
The test vector 110 can catch both this fault and the previous one.

With testing, we are primarily concerned with determining whether a circuit works correctly or
not — detecting whether there is a fault. If the circuit has a fault, we usually do not care where
450                                         CHAPTER 7. FAULT TESTING AND TESTABILITY


the fault is — diagnosing the fault. To detect the two faults above, the test vector 110 is
sufficient, because if either of the two faults is present, 110 will detect that the circuit does not
work correctly.

          Note: Detect vs. diagnose      Testing detects faults. Testing does not diagnose
          which fault occurred.

If we have a higher-than-expected failure rate for a chip, we might want to investigate the cause of
the failures, and so would need to diagnose the faults. In this case, we might do more exhaustive
analysis to see which test vectors pass and which fail. We might also need to examine the chip
physically with probes to test a few individual wires or transistors. This is done by removing the
top layers of the chip and using very small and very sensitive probes, analogous to how we use a
multimeter to test a circuit on a breadboard.


7.1.5 Mathematical Models of Faults

Goal: develop reliable and predictable technique for detecting faults in circuits.

Observations:
• The possible faults in a circuit are dependent upon the physical layout of the circuit.
• A very wide variety of possible faults
• A single test vector can catch many different faults

Need: a mathematical model for faults that is abstracted from complexities of circuit layout and
plethora of possible faults, yet still detects most or all possible faults.


7.1.5.1 Single Stuck-At Fault Model

Although there are many different bad behaviours that faults can lead to, the simple model of
single-stuck-at-faults has proven very capable of finding real faults in real circuits.

Two simplifying assumptions:

   1. A maximum of one fault per tested circuit (hence “single”)
   2. All faults are either:
       (a) stuck-at 1: short to VDD
       (b) stuck-at 0: short to GND

      hence, “stuck at”
7.1.6 Generate Test Vector to Find a Mathematical Fault                                               451


Example of Stuck-At Faults             ............................................................
     L1
 a                  L9
          L5   L8
     L2 L6
 b                  L10      L12
     L3                            i
 c
          L7
                    L11
     L4
 d

12 fault locations × 2 types of faults = 24 possible faults.

If restrict to single stuck-at fault model, then have 24 faulty circuits to consider.

If allowed multiple faults, then the circuit above could have up to 12 different faults. How many
faulty circuits would need to be considered?

Each of the 12 locations has three possible values: good, stuck-at-1, stuck-at-0. Therefore,
312 = 5.3 × 105 different circuits would need to be considered!

If allowed multiple faults of 4 different types at 12 different locations, then would have
512 − 1 = 2.4 × 108 different faulty circuits to consider!
               4
There are 22 = 6.6 × 104 different Boolean functions of four inputs (A k-map of four variables is
                                                                  4
a grid of 24 squares; each square is either 0 or 1, which gives 22 different combinations). There
are 6.6 × 104 possible equations for circuits with four inputs and one output. This is much less
than the number of faulty circuit models that would be generated by the
simultaneous-faults-at-every-location models. So both of the
simultaneous-faults-at-every-location models are too extreme.


7.1.6 Generate Test Vector to Find a Mathematical Fault

Faults are detected by stimulating circuits (real, manufactured circuit, not a simulation!) with
test-vectors and checking that the real circuit gives the correct output.

Standard practice in testing is to test circuits for single stuck-at faults. Mathematics and empirical
evidence demonstrate that if a circuit appears to be free of single stuck-at faults, then probably it
also free of other types of faults. That is, testing a circuit for single stuck-at faults will also detect
many other types of faults and will often detect multiple faults.


7.1.6.1 Algorithm
     1. compute Karnaugh map for correct circuit
     2. compute Karnaugh map for faulty circuit
     3. find region of disagreement
452                                             CHAPTER 7. FAULT TESTING AND TESTABILITY


   4. any assignment in region of disagreement is a test vector that will detect fault
   5. any assignment outside of region of disagreement will result in same output on both correct
      and faulty circuit


7.1.6.2 Example of Finding a Test Vector

                       a         d                                 a             d
                       b                                           b
                                                    e                                          e
                       c                                            c

                  a         b               ab ab ab ab                     a
                                            10 11 01 00                               b
              c                                                         c
                                       c1
                                       c0

                             Good circuit                                   Faulty circuit

                                                a
                                                        b
                                            c




                            Difference between good and faulty circuits


7.1.7 Undetectable Faults

Not all faults are detectable.


   1. If a circuit is irredundant then all single stuck-at faults can be detected.

            A redundant circuit is one where one or more gates can be removed without
            affecting the functional behaviour.

   2. If not trying to find all of the faults in a circuit, then a fault that you aren’t looking for can
      mask a fault that you are looking for.


7.1.7.1 Redundant Circuitry

Some faults are undetectable. Undetectable stuck-at faults are located in redundant parts of a
circuit.
7.1.7 Undetectable Faults                                                                                                                             453


Timing Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
           Static hazard                                            Timing hazards are often removed by adding
          Dynamic hazard                                            redundant circuitry.


Redundant Circuitry                                   ................................................................. .
                                                                              a

                                                                              b

             1,1                                                              c
         a                                  e   1,0
             1,0                                                              d
         b
                                                           1,0,1   g          e
                           d   0,1          f                                  f
                                                0,1
             1,1
         c                                                                    g

                       Irredundant circuit                                 Illustration of timing hazard
Glitch on g is caused because the AND gate for e turns off before f turns on.


         Question: Add one or more gates to the circuit so that the static hazard is guaranteed
           to be prevented, independent of the delay values through the gates

In this sum-of-products style circuit, each AND gate corresponds to a cube in the Karnaugh map.
         a         b
 c


We can prevent this transition from causing a glitch by adding a cube that covers the two squares
of the transition from 111 to 101. This cube is 1-1, which is the black cube in the Karnaugh map
below and the signal h in the redundant circuit below.
         a         b                    a         b
 c                                 c



                                                                       a

                                                                       b
                                                                       c

     a                         e                                       d
     b                                                                 e
                               h       L1                   g          f

                       d                                               h
                               f
     c                                                                 g

                   Redundant circuit                               No more timing hazards
454                                           CHAPTER 7. FAULT TESTING AND TESTABILITY


    Question: Has the redundant circuitry introduced any undetectable faults? If so,
      identify an undetectable fault.


                                       L1@0 is undetectable.

 Correct circuit                                                                  Faulty circuit
    ab + bc                                                                       ab + bc + ac
                                                                             With L1@0, ac −→ 0
                                                                                   ab + bc + 0
                                                                                     ab + bc
                                                                          Same equation as correct circuit
A stuck-at fault in redundant circuitry will not affect the steady state behaviour of the circuit, but
could allow timing glitches to occur.


7.1.7.2 Curious Circuitry and Fault Detection

The two circuits below have the same steady-state behaviour.

a                                                                     a
                                                                                      b
            L2
                                          a                       c
      L1
b                                 z
            L3                                            z
c                                         c

Because the two circuits have the same behaviour, it might appear that the leftmost two XOR gates
are redundant. However, these gates are not redundant. In the test for redundancy, when we
remove a gate, we delete it; we do not replace it with other circuitry.

Curiously, the stuck-at fault at L1 is undetectable, but faults at either L2 or L3 are detectable.

    fault    eqn        K-map                   diff w/ ckt
                             a                       a
                                      b                       b
                         c                       c



 L2@0 a ⊕ (b ⊕ c)
                             a                       a
                                      b                       b
                         c                       c



 L2@1 a ⊕ (b ⊕ c)
7.2. TEST GENERATION                                                                                 455


7.2 Test Generation
7.2.1 A Small Example

Throughout this section we will use the circuit below:

                                                         ab + bc
                                                 a
                                                                   b
a                                    c
            L4
b
    L2                          z
            L5
c


At first, we will consider only the following faults: L2@1, L4@1, L5@1.

         fault   eqn    K-map         diff w/ ckt           test vectors
                            a                a
                                b                    b
                        c                c



 1) L2@1 a + c                                              101, 001, 100
                            a                a
                                b                    b
                        c                c



 2) L4@1 a + bc                                             101, 100
                            a                a
                                b                    b
                        c                c



 3) L5@1 ab + c                                             101, 001


Choose Test Vector      ................................................................... .
                                                                                  a
                                                                                         b
                                                                              c
 If we choose 101, we can detect all three faults. Choosing either 001
 or 100 will miss one of the three faults.


7.2.2 Choosing Test Vectors

The goal of test vector generation is to find the smallest set of test vectors that will detect the
faults of interest.

Test vector generation requires analyzing the faults.

We can simplify the task of fault analysis by reducing the number of faults that we have to
analyze.

Smith has examples of this in Figures 14.13 and 14.14.
456                                       CHAPTER 7. FAULT TESTING AND TESTABILITY


7.2.2.1 Fault Domination

      fault   eqn    K-map        Diff w/ ckt test vectors
                         a            a
                             b            b
                     c            c



 1) L5@1 ab+c                                   101, 001
                         a            a
                             b            b
                     c            c



 2) L6@1 1                                      101, 001, 100, 010, 000
Any test vector that detects L5@1 will also detect L6@1: L5@1 is detected by 101 and 001, each
of which will detect L6@1. L6@1 does not dominate L5@1, because there is at least one test
vector that detecs L6@1 but does not detect L5@1 (e.g. each of 100, 010, 000 detect L6@1 but
not L5@1).


  Definition dominates: f1 dominates f2 : any test vector that detects f1 will also detect
    f2 .


When choosing test vectors, we can ignore the dominated fault, but must keep the dominant fault.

L5@1 dominates L6@1.

When choosing test vectors we can ignore L6@1 and just include L5@1.


  Question:     To detect both L5@1 and L6@1, can we ignore one of the faults?


  Answer:
     We can ignore L6@1, because L5@1 dominates L6@1: each test vector
    that detects L5@1 also detects L6@1.


  Question:     What would happen if we ignored the “wrong” fault?


  Answer:
     If we ignore L5@1, but keep L6@1, we can choose any of 5 test vectors that
    detect L6@1. If we chose 100, 010, or 000 as our test vector to detect L6@1,
    then we would not detect L5@1.
7.2.2 Choosing Test Vectors                                                                       457


7.2.2.2 Fault Equivalence

      fault    eqn    K-map        Diff w/ ckt
                          a            a
                              b            b
                      c            c



 1) L1@1 b
                          a            a
                              b            b
                      c            c



 2) L3@1 b
The two faults above are “equivalent”.

   Definition fault equivalence: f1 is equivalent to f2 : f1 and f2 are detected by exactly
     the same set of test vectors. That is, all of the test vectors that detect f1 will also
     detect f2 , and vice versa.

When choosing test vectors we can ignore one of the faults and just include the other.


7.2.2.3 Gate Collapsing

A controlling value on an input to a gate forces the output to be the controlled value. If a stuck-at
fault on the input causes the input to have a controlling value, then that fault is equivalent to the
output having a stuck-at fault of being at the controlled value.
For example, a 1 on the input to an OR gate will force the output to be 1. So, a stuck-at-1 fault on
either input to an OR gate is equivalent to a stuck-at-1 fault on the output of the gate, and is
equivalent to a stuck-at-1 fault on any other input to the OR gate.
A stuck-at-1 fault on the input to an OR gate is equivalent to a stuck-at-1 fault on the output of the
OR gate.


   Definition Gate collapsing: : The technique of looking at the functionality of a gate and
     finding equivalent faults between inputs and outputs.

          Sets of collapsable faults for common gates
                 @0
                                                          @0
                                  @0
 AND
                 @1
                                                          @1
                                  @1
 OR


 QuestionWhat is the set of collapsible faults for a NAND gate?



 NAND
458                                        CHAPTER 7. FAULT TESTING AND TESTABILITY


   Answer:
      To determine the collapsible faults, treat the NAND gate as an AND gate
     followed by an inverter, then invert the faults on the output of the gate.
                        @0
                                                @0
       AND + NOT                  @0

                             @0
                                               @1
       NAND                       @0




7.2.2.4 Node Collapsing

         Note:      Node collapsing is relevant only for the pin-fault model

When two segments affect the same set of gates (ignoring any gates between the two segments),
then faults on the two segments can be collapsed.

With an invertor or buffer, the segment on the input affects the same gates as the output.
Therefore, faults on the input and output segments are equivalent.

      Sets of collapsable faults for nodes
            @1                             @0
 NOT-1
            @0                             @1
 NOT-0

With the net-fault model, which is the one we are using in E&CE 327, inverters and buffers are
the only gates where node collapsing is relevant.

With the pin-fault model, where faults are modelled as occuring on the pins of gates, there are
other instances where node collapsing can be used.


7.2.2.5 Fault Collapsing Summary

When calculating the test-vectors to detect a set of faults, apply the fault collapsing techniques of:
• gate collapsing
• node collapsing (if using pin-fault model)
• general fault equivalence (intelligent collapsing)
• fault domination
to reduce the number of faults that you must examine.

Fault collapsing is an optimization. If you skip this step, you will still get the correct answer, it
will just take more work to get the correct answer, because in each step you will analyze a greater
number of faults than if you do fault collapsing.
7.2.3 Fault Coverage                                                                                459


7.2.3 Fault Coverage

Definition Fault coverage: percentage of detectable faults that are detected by a set of test vectors.


                                                     DetectedFaults
                               FaultCoverage =
                                                    DetectableFaults

Some people’s definition of fault coverage has a denominator of AllPossibleFaults, not just those
that are detectable.

If the denominator is AllPossibleFaults, then, if a circuit has 100% single stuck-at fault coverage
with a suite of test vectors, then each stuck-at fault in the circuit can be detected by one or more
vectors in the suite. This also means that the circuit has no undetectable faults, and hence, no
redundant circuitry.

Even if the denominator is AllPossibleFaults, it is possible that achieving 100% coverage for
single stuck at faults will allow defective chips to pass if they have faults that are not stuck-at-1 or
stuck-at-0.

I think, but haven’t seen a proof, that achieving 100% single stuck-at coverage will detect all
combinations of multiple stuck-at faults. But, if you do not achieve 100% coverage, then a
stuck-at fault that you aren’t testing for can mask (hide) a fault that you are testing for.

NOTE: In Smith’s book, undetectable faults don’t hurt your coverage. This is not universally true.


7.2.4 Test Vector Generation and Fault Detection

There are two ways to generate vectors and check results: built-in tests and scan testing.

Both require:
• generate test vectors
• overide normal datapath to send test-vectors, rather than normal inputs, as inputs to flops
• compare outputs of flops to expected result


7.2.5 Generate Test Vectors for 100% Coverage

In this section we will find the test vectors to achieve 100% coverage of single stuck at faults for
the circuit of the day.

We will use a simple algorithm, there are much more sophisticated algorithms that are more
efficient.
460                                                CHAPTER 7. FAULT TESTING AND TESTABILITY


The problem of test vector generation is often called Automatic Test Pattern Generation (ATPG)
and continues to be an active area of research.
A trendy idea is to use Genetic Algorithms (inspired by how DNA works) to generate test vectors
that catch the maximum number of faults.
The “classic” algorithm is the D algorithm invented by Roth in 1966 (Smith 14.5.1, 14.5.2).

An enhanced version is the Path-Oriented D Algorithm (PODEM), which supports reconvergent
fanout and was developed by Goel in 1981 (Smith 14.5.3).

a       L1                                                        ab + bc
                        L6                                    a
              L4                                                            b
b                                                         c
        L2                               L8    z
              L5
                        L7
c       L3


               Figure 7.1: Example Circuit with Fault Locations and Karnaugh Map


7.2.5.1 Collapse the Faults

                                              a        L1@0,1
                                                                                L6@0,1
                                                                  L4@0,1
                                              b
                                                       L2@0,1                            L8@0,1   z
                                                                  L5@0,1
                                                                                L7@0,1
                                              c        L3@0,1
Initial circuit with potential faults:
                              Gate collapsing
    a    L1        @0
                        @0    L6
              L4 @0
    b
         L2                               L8       z
               L5
                         L7
    c    L3
                                                          L1@0, L4@0, L6@0
    a    L1
                              L6
              L4
    b
         L2                               L8       z
              L5 @0
                        @0
    c    L3        @0         L7                          L3@0, L5@0, L7@0
    a    L1
                              L6
              L4
    b                              @1
         L2                              @1        z
                                   @1     L8
              L5
    c    L3                   L7                          L6@1, L7@1, L8@1
7.2.5 Generate Test Vectors for 100% Coverage                                                                                                          461


Node Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Node collapsing: none applicable (no invertors or buffers).


                                 a     L1@1
                                                               L6@0
                                                   L4@1
                                 b
                                       L2@0,1                                          L8@0,1 z
                                                   L5@1
                                                               L7@0
                                 c     L3@1
Remaining faults:




Intelligent Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sometimes, after the regular forms of fault collapsing have been done, there will still be some sets
of equivalent faults in the circuit. It is usually beneficial to quickly look for patterns or
symmetries in the circuit that will indicate a set of potentially equivalent faults.

                                Intelligent Collapsing
     a

     b
           L2@0                                            L8@0     z
                                                                            L2@0, L8@0 Both L2@0 and L8@0 result in the
     c                                                                                 equation 0.
     a     L1@1

     b
                                                                    z
                                                                            L1@1, L3@1 Both L1@1 and L3@1 result in the
     c     L3@1                                                                        equation b


                                 a
                                                               L6@0
                                                   L4@1
                                 b
                                       L2@1                                            L8@0,1 z
                                                   L5@1
                                                               L7@0
                                 c     L3@1
Remaining faults:
462                                     CHAPTER 7. FAULT TESTING AND TESTABILITY


7.2.5.2 Check for Fault Domination

      fault   eqn   K-map       Diff w/ ckt
                        a           a
                            b             b
                    c           c



 1) L2@1 a+c                                   dominated by L4@1, L5@1
                        a           a
                            b             b
                    c           c



 2) L3@1 b
                        a           a
                            b             b
                    c           c



 3) L4@1 a+bc
                        a           a
                            b             b
                    c           c



 4) L5@1 ab+c
                        a           a
                            b             b
                    c           c



 5) L6@0 bc
                        a           a
                            b             b
                    c           c



 6) L7@0 ab
                        a           a
                            b             b
                    c           c



 7) L8@0 0                                     dominated by L6@0, L7@0
                        a           a
                            b             b
                    c           c



 8) L8@1 1                                     dominated by L2@1, L3@1, L4@1, L5@1
7.2.5 Generate Test Vectors for 100% Coverage                                                             463


Remove dominated faults                 ..............................................................
Current faults:
a
                      L6@0
              L4@1
b
     L2@1                                   L8@0,1 z
              L5@1
                      L7@0
c    L3@1


 Dominated faults: (L2@1, L8@0, L8@1).
     fault    eqn     K-map     Diff w/ ckt
                                a                   a
                                        b                   b
                            c                   c


                                                                  a
    1) L3@1 b                                                                        L6@0
                                a                   a                        L4@1
                            c
                                        b
                                                c
                                                            b
                                                                  b
                                                                                                          z
    2) L4@1 a+bc                                                             L5@1
                                a                   a                                L7@0
                            c
                                        b
                                                c
                                                            b     c   L3@1

    3) L5@1 ab+c
                                a                   a
                                        b                   b
                            c                   c



    4) L6@0 bc
                                a                   a
                                        b                   b
                            c                   c



    5) L7@0 ab


7.2.5.3 Required Test Vectors

 If we have any faults that are detected by just one test-vector, then we            Required vectors
 must include that test vector in our suite.                                         L3@1 010
                                                                                     L6@0 110
     Definition required test vector: A test vector tv is required                    L7@0 011
       if there is a fault for which tv is the only test vector that
       will detect the fault.



7.2.5.4 Faults Not Covered by Required Test Vectors

      fault     eqn     K-map               Diff w/ ckt         The intersection of the two difference regions
                            a                   a

                        c
                                    b
                                            c
                                                        b
                                                                is 101.
                                                                Choosing 101 detects both L4@1 and L5@1.
 1) L4@1 a+bc
                            a
                                    b
                                                a
                                                        b
                                                                Add 101 to suite of test vectors.
                        c                   c
                                                                Final set of test vectors is:
 2) L5@1 ab+c                                                   010, 110, 011, 101.
464                                                     CHAPTER 7. FAULT TESTING AND TESTABILITY


7.2.5.5 Order to Run Test Vectors

The order in which the test vectors are run is important because it can affect how long a faulty
chip stays in the tester before the chip’s fault is detected.

The first vector to run should be the one that detects the most faults.

Build a table for which faults each test vector will detect.
                                              Test Vector
                                  a                a               a                a
                                          b                b                b                b
                              c                c               c                c



          fault
                                      110              010             011              101
                      a
                          b
                  c



 1)    L1@0                           1
                      a
                          b
                  c



 2)    L1@1                                            1
                      a
                          b
                  c



 3)    L2@0                           1                                 1
                      a
                          b
                  c



 4)    L2@1                                                                              1
                      a
                          b
                  c



 5)    L3@0                                                             1
                      a
                          b
                  c



 6)    L3@1                                            1
                      a
                          b
                  c



 7)    L4@0                           1
                      a
                          b
                  c



 8)    L4@1                                                                              1
                      a
                          b
                  c



 9)    L5@0                                                             1
                      a
                          b
                  c



 10)   L5@1                                                                              1
                      a
                          b
                  c



 11)   L6@0                           1
                      a
                          b
                  c



 12)   L6@1                                            1                                 1
                      a
                          b
                  c



 13)   L7@0                                                             1
                      a
                          b
                  c



 14)   L7@1                                            1                                 1
                      a
                          b
                  c



 15)   L8@0                           1                                 1
                      a
                          b
                  c



 16)   L8@1                                            1                                 1
         Faults detected              5                5                5                6

101 detects the most faults, so we should run it first.
7.2.5 Generate Test Vectors for 100% Coverage                                                465


This reduces the faults found by 010 from 5 to 2 (because L6@1, L7@1, and L8@1 will be found
by 101).

This leaves 110 and 011 with 5 faults each, we can run them in either order, then run 010.

We settle on a final order for our test suite of: 101, 011, 110, 010.


7.2.5.6 Summary of Technique to Find and Order Test Vectors

   1. identify all possible faults

   2. gate collapsing

   3. node collapsing

   4. intelligent collapsing

   5. fault domination

   6. determine required test vectors

   7. choose minimal set of test vectors to detect remaining faults

   8. order test vectors based on number of faults detected (NOTE: when iterating through this
      step, need to take into account faults detected by earlier test vectors)
466                                           CHAPTER 7. FAULT TESTING AND TESTABILITY


7.2.5.7 Complete Analysis

In case you don’t trust the fault collapsing analysis, here’s the complete analysis.

       fault    eqn     K-map         Diff w/ ckt
                            a             a
                                b              b
                        c             c



  1) L1@0 bc
                            a             a
                                b              b
                        c             c



  2) L1@1 b
                            a             a
                                b              b
                        c             c



  3) L2@0 0                                         dominated by 1, 5
                            a             a
                                b              b
                        c             c



  4) L2@1 a+c                                       dominated by 8, 10
                            a             a
                                b              b
                        c             c



  5) L3@0 ab
  6) L3@1 b             same as 2
  7) L4@0 bc            same as 1
                            a             a
                                b              b
                        c             c



  8) L4@1 a+bc
  9) L5@0 ab            same as 5
                            a             a
                                b              b
                        c             c



 10) L5@1 ab+c
 11) L6@0 bc            same as 1
                            a             a
                                b              b
                        c             c



 12)   L6@1     1                                   dominated by 8, 10
 13)   L7@0     ab      same as 5
 14)   L7@1     1       same as 12
 15)   L8@0     0       same as 3
 16)   L8@1     1       same as 12
7.2.6 One Fault Hiding Another                                                                   467


7.2.6 One Fault Hiding Another

a    L1
                       L6
             L4
b
     L2                                          L8               z
             L5
                       L7
c    L3


Assume that we are not trying to detect all faults — L1 is viewed as not being at risk for faults,
but L3 is at risk for faults.
a    L1                                                                   a        L1

b                                                                         b
                                                                  z                       z

c    L3                                                                   c   L3



Problem: If L1 is stuck-at 1, the test vectors that normally detect L3@0 will not detect L3@0.

In the presence of other faults, the set of test vectors to detect a fault will change.

 fault(s)         eqn K-map                       Diff w/ ckt
                                 a                        a
                                         b                            b
                         c                        c



 L3@0             ab
                                     a                        a
                                             b                            b
                             c                        c



 L1@1,L3@0 b


7.3 Scan Testing in General
Scan testing is based on the techniques described in section 7.2.5. The generation of test vectors
and the checking of the result are done off-chip. In comparison, built-in self test (section 7.5)
does test-vector generation and result checking on chip. Scan testing has the advantage of
flexibility and reduced on-chip hardware, but increases the length of time required to run a test. In
scan testing, we want to individually drive and read every flop in the circuit.

Even without using any I/O pins for testing purposes, chips are already I/O bound, so scan-testing
must be very frugal in its use of pins. Flops are connected together in “scan chains” with one
input pin and one output pin.
468                                                                                                                        CHAPTER 7. FAULT TESTING AND TESTABILITY


7.3.1 Structure and Behaviour of Scan Testing

                                            data_in(3)                                           zeta_in(3)
                     another circuit #0




                                                                                                                                another circuit #1
                                            data_in(2)                                           zeta_in(2)
                                                                                circuit
                                                                                under
                                                                                 test
                                            data_in(1)                                           zeta_in(1)



                                            data_in(0)                                           zeta_in(0)



                                                                    Normal Circuit
                                          mode0                  scan_in0                     mode1                  scan_in1
                                                                                                                                        yet another circuit
       another circuit




                                                  scan chain 0




                                                                                                      scan chain 1




                                                                                circuit
                                                                                under
                                                                                 test




                                                                 scan_out0                                           scan_out1

                                          Circuit with Scan Chains Added


7.3.2 Scan Chains

7.3.2.1 Circuitry in Normal and Scan Mode

                              mode0 scan_in0                                                  mode1 scan_in1


                                                                                                                                                              zeta_in(3)
  data_in(3)



                                                                                                                                                              zeta_in(2)
  data_in(2)                                                                        circuit
                                                                                    under
                                                                                     test
                                                                                                                                                              zeta_in(1)
  data_in(1)



                                                                                                                                                              zeta_in(0)
  data_in(0)

                                                                    scan_out0                                              scan_out1

                                                                             Normal Mode
7.3.2 Scan Chains                                                                                                                     469

  mode0 scan_in0                                       mode1 scan_in1




                                          circuit
                                          under
                                           test




                  scan_out0                                                      scan_out1

                                     Scan Mode


7.3.2.2 Scan in Operation

            mode0                    scan_in0           mode1                  scan_in1
                      scan chain 0




                                                                scan chain 0




                                            circuit
       another




                                                                                   another
        circuit




                                                                                    circuit




                                            under
                                                                                      yet




                                             test                                                   Sequence of load; test; unload

                     scan_out0                                  scan_out1

        Circuit under test with scan chains


 Load Test Vector                           Run Test Vector                                Unload Result
 (1 cycle per bit)                          Through Circuit                               (1 cycle per bit)


Unload and Load and Same Time                                                       ......................................................

  Unload Prev Result                                 Run Cur Test Vector                            Unload Cur Result
 Load Cur Test Vector                                 Through Circuit                              Load New Test Vector
   (1 cycle per bit)                                                                                 (1 cycle per bit)

         clk

     mode0
  scan_out0         previous                        current
                    results0                        results0
   scan_in0         current                         next test
                    vector0                         vector0
  scan_out1         previous                        current
                    results1                        results1
   scan_in1         current                         next test
                    vector1                         vector1


 Sequence of load; run; unload
470                                             CHAPTER 7. FAULT TESTING AND TESTABILITY


7.3.2.3 Scan in Operation with Example Circuit

                       mode0 scan_in0                   mode1 scan_in1


                                         a

   a                                                     y
   b
                   y                     b


                   z                                     z
   c                                     c

   d

  Circuit under test                     d


                                    scan_out0                        scan_out1

                        Circuit under test with scan test circuitry
7.3.2 Scan Chains                                                        471


  mode0 scan_in0                                mode1 scan_in1

           δ   δ         a

                                                  y
                         b

                                                  z

                         c



                         d


                scan_out0                                    scan_out1


               clk
           mode0

                     Start Loading Test Vector (Load δ)
  mode0 scan_in0                                mode1 scan_in1

           γ    γ δ a

                                                  y
           δ             b
                δ

                                                  z
                         c



                         d


                scan_out0                                    scan_out1


               clk
           mode0

                                  Load γ
472                                          CHAPTER 7. FAULT TESTING AND TESTABILITY

  mode0 scan_in0                     mode1 scan_in1

           β    β γ a

                                      y
           γ    γ    δ b

                                      z
           δ           c
                δ



                       d


                scan_out0                         scan_out1


               clk
           mode0

                            Load β
  mode0 scan_in0                     mode1 scan_in1

           α   α     β a

                                      y
           β    β    γ b

                                      z
           γ    γ    δ c



           δ    δ      d


                scan_out0                         scan_out1


               clk
           mode0

                            Load α
7.3.2 Scan Chains                                                       473


  mode0 scan_in0                               mode1 scan_in1


                     α



           α    α    β



           β    β    γ



           γ    γ    δ


                scan_out0                                   scan_out1


               clk
           mode0

                            Run Test Vector
  mode0 scan_in0                               mode1 scan_in1


                     α
                                        αβ
                                                    __
                                              αβ+βγ
                                    __
           α    α    β              αδ
                                               __
                                   __          αδ+βδ
                                   βγ
           β    β    γ
                                        βδ

           γ    γ    δ


                scan_out0                                   scan_out1


               clk
           mode0

                         Test Values Propagate
474                                                       CHAPTER 7. FAULT TESTING AND TESTABILITY

  mode0 scan_in0                              mode1 scan_in1

          δ’    δ’
                                                                          __
                                                           -    − αβ+βγ


                                                                    __
                                                                    αδ+βδ




                scan_out0                                       scan_out1
                                                                     __
                                                                    (αδ+βδ)
               clk
           mode0

               Flop-In Result, Start (Un)loading Test Vector
  mode0 scan_in0                              mode1 scan_in1

          γ’    γ’ δ’
                                                           −    −   −

           δ’ δ’
                                                                           __
                                                           −    −   αβ+βγ




                scan_out0                                       scan_out1
                                                           __             __
                                                          (αδ+βδ, αβ+βγ)
               clk
           mode0

                     Continue (Un)loading Test Vector
7.3.3 Summary of Scan Testing                                                                  475


  mode0 scan_in0                              mode1 scan_in1

           β’ β’ γ’
                                                       ζ
                                                            ζ   −

           γ’ γ’ δ’

                                                       −    − −

           δ’ δ’




                 scan_out0                                  scan_out1
                                                       __             __
                                                       (αδ+βδ, αβ+βγ)
                clk
           mode0

                      Finish (Un)loading Test Vector
  mode0 scan_in0                              mode1 scan_in1

           α’    α’ β’
                                                       ψ ψ ζ

           β’ β’ γ’
                                                        ζ   ζ     −

           γ’ γ’ δ’



           δ’ δ’ δ’


                 scan_out0                                  scan_out1
                                                       __             __
                                                       (αδ+βδ, αβ+βγ)
                clk
           mode0

                          Run Next Test Vector




7.3.3 Summary of Scan Testing
   • Adding scan circuitry

        1. Registers around circuit to be tested are grouped into scan chains
        2. Replace each flop with mux + flop
        3. Flops and muxes wired together into scan chains
        4. Each scan chain is connected to dedicated I/O pins for loading and unloading test
           vectors
476                                        CHAPTER 7. FAULT TESTING AND TESTABILITY


   • Running test vectors

        1. Put scan chain in “scan” mode
        2. Load in test vector (one element of vector per clock cycle)
        3. Put scan chain in “normal” mode
        4. Run circuit for one clock cycle — load result of test into flops
        5. Unload results of current test vector while simultaneously loading in next test vector
           (one element of vector per clock cycle)


7.3.4 Time to Test a Chip

If the length (number of flops) of a scan chain is n, then it takes 2n + 1 clock cycles to run a single
test: n clock cycles to scan in the test vector, 1 clock cycle to execute the test vector, and n cycles
to scan out the results. Once the results are scanned out, they can be compared to the expected
results for a correctly working circuit.

If we run 2 or more tests (and chips generally are subjected to hundreds of thousands of tests),
then we speed things up by scanning in the next test vector while we scan out the previous result.

 ScanLength       =   number of flip flops in a scan chain
 NumVectors       =   number of test vectors in test suite
 TimeScan         =   number of clock cycles to run test suite
                  =   NumVectors × (ScanLength + 1) + ScanLength


7.3.4.1 Example: Time to Test a Chip

A 800MHz chip has scan chains of length 20,000 bits, 18,000 bits, 21,000 bits, 22,000 bits, and
two of 15,000 bits.

500,000 test vectors are used for each scan chain.

The tests are run at 80% of full speed.


   Question:     Calculate the total test time.


   Answer:


      We can load and unload all of the scan chains at the same time, so time will
      be limited by the longest (22,000 bits).
7.4. BOUNDARY SCAN AND JTAG                                                                                                                477


         For the first test vector, we have to load it in, run the circuit for one clock
         cycle, then unload the result.
         Loading the second test vector is done while unloading the first.

               TimeTot = ClockPeriod
                           ×(MaxLengthVec + NumVecs × (MaxLengthVec + 1))
                       = (1/(0.80 × 800 × 106)) × (22, 000 + 500, 000 × (22, 000 + 1))
                       = 17secs


7.4 Boundary Scan and JTAG
Boundary scan originated as technique to test wires on printed circuit boards (PCBs).
Goal was to replace “bed-of-nails” style testing with technique that would work for high-density
PCBs (lots of small wires close together)
Now used to test both boards and chip internals.
Used both on boundaries (I/O pins) and internal flops.


Boundary Scan with JTAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Standardized by IEEE (1149) and previously by JTAG:
• 4 required signals (Scan Pins: TDI, TDO, TCK, TMS)
• 1 optional signal (Scan Pin: TRST)
• protocol to connect circuit under test to tester and other circuits
• state machine to drive test circuitry on chip
• Boundary Scan Description Language (BSDL): structural language used to describe which
  features of JTAG a circuit supports
JTAG circuitry now commonly built-into FPGAs and ASICS, or part of a cell-library. Rarely is a
JTAG circuit custom-built as part of a larger part. So, you’ll probably be choosing and using
JTAG circuits, not constructing new ones.
Using JTAG circuitry is usually done by giving a description of your printed circuit board (PCB)
and the JTAG components on each chip (in BSDL) to test generation software. The software then
generates a sequence of JTAG commands and data that can be used to test the wires on the circuit
board for opens and shorts.


7.4.1 Boundary Scan History
1985 JETAG: Joint European Test Action Group
1986 JTAG (North American companies joined)
1990 JTAG 2.0 formed basis for IEEE 1491 “Test access port and boundary scan architecture”
478                                                              CHAPTER 7. FAULT TESTING AND TESTABILITY


7.4.2 JTAG Scan Pins

 TDI            −→
         test data input:
         input testvector to chip
 TDO  ←− test data output:
         output result of test
 TCK  −→ test clock:
         clock signal that test runs on
 TMS  −→ test mode select:
         controls scan state machine
 TRST −→ test reset (optional):
         resets the scan state machine
                                                                              chip
                                                                                     BSR


                                                                                     BSC                           BSC
                                                                                                   circuit
                                                                                                   under
                                                                                                    test
                                                                                     BSC                           BSC

                                                                                     BSC                           BSC

                 chip                                                                  control
                    scan registers                                     TDI                        BR                     TDO


                                                                                            Instruction Decoder
     normal                      circuit                normal
       input                     under                  output
        pins                                            pins                               IR
                                  test                                                           IRC         IRC

                                                                       TCK
                                                                                                  IDCODE
          TDI                                        TDO
         TCK                    control
                                                                                            TAP Controller
         TMS                                                           TMS



                     High-level view                                                       Detailed view


7.4.3 Scan Registers and Cells

Basic Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
 TDR             Test data register
                 The boundary scan registers on a chip
 DR    Fig 14.2 Data register cell
                 Often used as a Boundary scan cell (BSC)


JTAG Components                        ....................................................................
7.4.4 Scan Instructions                                                                      479


               Fig 14.8
                    Top level diagram
 BSR           Fig 14.5
                    Boundary scan register
                    A chain of boundary scan cells (BSCs)
 BSC     Fig 14.2   Boundary scan cell
                    Connects external input and scan signal to internal circuit. Acts as
                    wire between external input and internal circuit in normal mode.
 BR      Fig 14.3   Bypass-register cell
                    Allows direct connection from TDI to TDO. Acts as a wire when
                    executing BYPASS instruction.
 IDCODE             Device identification register
                    data register to hold manufacturer’s name and chip identifier. Used
                    in IDCODE instruction.
 IR cell Fig 14.4   Instruction register cell
                    Cells are combined together as a shift register to form an instruction
                    register (IR)
 IR      Fig 14.6   Instruction register
                    Two or more IR cells in a row. Holds data that is shifted in on TDI,
                    sends this data in parallel to instruction decoder.
 IDecode Table 14.4 Instruction decoder
                    Reads instruction stored in instruction register (IR) and sends control
                    signals to bypass register (BR) and boundary scan register (BSR)
         Fig 14.7   TAP Controller
                    State machine that, together with instruction decoder, controls the
                    scan circuitry.


7.4.4 Scan Instructions

This the set of required instructions, other instructions are optional.

 EXTEST  Test board-level interconnect. Drive output pins of chip with hard-
         coded test vector. Sample results on inputs.
 SAMPLE  Sample result data
 PRELOAD Load test vector
 BYPASS  Directly connect TDI to TDO. This is used when several chips are
         daisy chained together to skip loading data into some chips.
 IDCODE  Output manufacturer and part number


7.4.5 TAP Controller

The TAP controller is required to have 16 states and obey the state machine shown in Fig 14.7 of
Smith.
480                               CHAPTER 7. FAULT TESTING AND TESTABILITY


7.4.6 Other descriptions of JTAG/IEEE 1194.1

Texas Instruments introductory seminar on IEEE 1149.1
http://www.ti.com/sc/docs/jtag/seminar1.pdf

Texas Instruments intermediate seminar on IEEE 1149.1
http://www.ti.com/sc/docs/jtag/seminar2.pdf

Sun midroSPARC-IIep scan-testing documentation
http://www.sun.com/microelectronics/whitepapers/wpr-0018-01/

Intellitech JTAG overview:
http://www.intellitech.com/resources/technology.html

Actel’s JTAG description:
http://www.actel.com/appnotes/97s05d15.pdf

Description of JTAG support on Motorola Coldfile microprocessor:
http://e-www.motorola.com/collateral/MCF5307TR-JTAG.pdf
7.5. BUILT IN SELF TEST                                                                            481


7.5 Built In Self Test
With built-in self test, the circuit tests itself. Both test vector generation and checking are done
using linear feedback shift registers (LFSRs).


7.5.1 Block Diagram

                              mode
              test gen LFSR
                     test                                                        signature ok(0)
                  generator                                                      analyzer0
                                                 diz(0)
                                                                      d_out(0)
data_in(0)
                                                                                 signature ok(1)
                                                 diz(1)                          analyzer1
                                                                      d_out(1)
data_in(1)                                                 circuit
                                                           under                 signature ok(2)
                                                            test                 analyzer2
                                                 diz(2)
                                                                      d_out(2)
data_in(2)
                                                                                 signature ok(3)
                                                 diz(3)                          analyzer3
                                                                      d_out(3)
data_in(3)


                                                                                                  result
                                                                                                 checker


                                                                                                 all_ok
                                                  BIST


7.5.1.1 Components

There is one test generator per group of inputs (or internal flops) that drive the same circuit to be
tested.

There is one signature analyzer per output (or internal flop).

          Note: MISR       An exception to the above rule is a multiple input signature
          register (MISR), which can be used to analyze several outputs of the circuit
          under test.
The test generator and signature analyzer are both built with linear-feedback shift registers.
482                                                                CHAPTER 7. FAULT TESTING AND TESTABILITY


Test Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
• generates a psuedo-random set of test vectors
• for n output bits, generates all vectors from 1 to 2n − 1 in a pseudo random order
• built with a linear-feedback shift register (shift-register portion is the input flops)
The figure below shows an LFSR that generates all possible 3-bit vectors except 000. (An n bit
LFSR that generates 2n − 1 different vectors is called a “maximal-length LFSR”.)


Assume that reset initializes the circuit to
111. The sequence that is generated is: 111,
011, 001, 100, 010, 101, 110. This sequence                                                                                                               q2
is repeated, so the number after 110 is 111.                                                                                                              q1
                                                                                                                                                          q0


    Question:             Why not just use a counter to generate 1..2n − 1?


    Answer:

         • An LFSR has less area than an incrementer. Just a few XOR gates for an
           LFSR, compared to a half-adder per bit for an incrementer.
         • There is a strong correlation between consecutive test vectors generated
           by an incrementer, while there is no correlation between consecutive test
           vectors generated by an LFSR. When doing speed binning, if consecutive
           test vectors should generate the same output, we cannot distringuish
           between a slow critical path and a correctly working circuit.


Signature Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Checking is done by building one signature analyzer circuit for each signal tested. The circuit
returns true if the signal generates the correct sequence of outputs for the test vectors. Doing this
with complete accuracy would require storing 2n bits of information for each output for a circuit
with n inputs. This would be as expensive as the original circuit. So, BIST uses mathematics
similar to error correction/detection to approximate whether the outputs are correct. This
technique is called “signature analysis” and originated with Hewlett-Packard in the 1970s.

The checking is done with an LFSR, similar to the BIST generation circuit. The checking circuit
is designed to output a 1 at the end of the sequence of 2n − 1 test results if the sequence of results
matches the correct circuit. We could do this with an LFSR of 2n − 1 flops, but as said before, this
would be at least as expensive as duplicating the original circuit.

The checking LFSR is designed similarly to a hashing function or parity checking circuit. If it
returns 0, then we know that there is a fault in the circuit. If it returns a 1, then there is probably
not a fault in the circuit, but we can’t say for sure.
7.5.1 Block Diagram                                                                                                                                    483


There is a tradeoff between the accuracy of the analyzer and it’s area. The more accurate it is, the
more flip flops are required.
Summary: the signature analyzer:
• checks that the output it is examining has the correct results for the complete set of tests that
  are run
• only has a meaningful result at the end of the entire test sequence.
• built with a linear-feedback shift register
• similar to a hash function or a lossy compression function
• if there are no faults, the signature analyzer will definitely say “ok” (no false negatives)
• if there is a fault, the signature analyzer might say “ok” or might say “bad” (false positives are
  possible)
• design tradeoff: more accurate signature analyzers require more hardware


Result Checker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
• signature analyzers output “ok”/”bad” on every clock cycle, but the result is only meaningful at
  the end of running the complete set of test vectors
• the result checker looks at test vector inputs to detect the end of the test suite and outputs
  “all ok” if all signature analyzers report “ok” at that moment
• implemented as an AND gate


7.5.1.2 Linear Feedback Shift Register (LFSR)

Basically, a shift register (sequence of flip-flops) with the output of the last flip-flop fed back into
some of the earlier flip-flops with XOR gates.
Design parameters:
• number of flip-flops
• external or internal XOR
• feedback taps (coefficients)
• external-input or self-contained
• reset or set


Example LFSRs                   .......................................................................
     reset


                                                                                                         d0     R   q0 d1     R    q1 d2     R   q2
                              d0     R   q0 d1     R    q1 d2     R   q2
         i
                                                                                                                S             S              S

                                     S             S              S
                                                                                                 set
                 External-XOR, input, reset                                                    External-XOR, no input, set
484                                                                          CHAPTER 7. FAULT TESTING AND TESTABILITY



                                                                                    reset
                      d0       R    q0             d1       R    q1 d2       R     q2
    i
                                                                                            d0   R   q0    d1   R   q1          d2    R      q2
                                                                                        i
                               S                            S                S


  set                                                                                            S              S                     S


                 Internal-XOR, input, set                                                   Internal-XOR, input, reset
In E&CE 327, we use internal-XOR LFSR’s, because the circuitry matches the mathematics of
Galois fields.

External-XOR LFSR’s work just fine, but they are more difficult to analyze, because their
behaviour can’t be treated as Galois fields.


7.5.1.3 Maximal-Length LFSR

    Definition maximal-length linear feedback shift register: An LFSR that outputs a
      pseudo-random sequence of all representable bit-vectors except 0...00.


    Definition pseudo random: The same elements in the same order every time, but the
      relationship between consecutive elements is apparantly random.


Maximal-length linear feedback shift registers are used to generate test vectors for built-in self
test.


Maximal-Length LFSR Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The figures below illustrate the two maximal-length internal-XOR linear feedback shift registers
that can be constructed with 3 flops.

            d0   R     q0 d1        R    q1                 d2   R   q2


                 S                  S                            S


      set

      Maximal-length internal-XOR LFSR


                 d0        R   q0             d1        R    q1 d2       R    q2


                           S                            S                S


   set

         Maximal-length internal-XOR LFSR
7.5.2 Test Generator                                                                           485


   Question:                Why do maximal-length LFSRs not generate the test vector 0...00?


   Answer:
      If all flops had 0 as output, then the LFSR would get stuck at 0 and would
     generate only 0...00 in the future.


Maximal-length LFSRs:
• set to all 1s initially
• self contained (no external i input)


                        1       2    3   4        5       6    7   8
    reset
         clk
         d0
         q0

         d1
         q1
         q2
         val   7        6       4    1   2        5       3    7   6

 Timing diagram for a 3-flop maximal-length LFSR


7.5.2 Test Generator

The test generator component is a maximal-length LFSR with multiplexors on the inputs to each
flip-flop. In test mode, the data input on each flip flop is connected to the output of the previous
flip flop. In normal mode, the input of each flip flop is connected to the environment.
 mode


                   d0       R   q0           d1       R   q1       d2   R   q2


                            S                         S                 S
i_d(0)
i_d(1)
i_d(2)
   set

                                                                                 q0
                                                                                 q1
                                                                                 q2
486                                         CHAPTER 7. FAULT TESTING AND TESTABILITY


7.5.3 Signature Analyzer

There are four things that change between different signature analyzers:
• number of flops (⇑ flops =⇒ ⇑ area, ⇑ accuracy)
• choice of feedback taps: a good choice can improve accuracy (more isn’t necessarily better)
• bubbles on input to AND gate for “ok”: determined by expected result from simulating test
  sequence through circuit under test and LFSR of analyzer.
                                              reset

This circuit:
                                                                 d0    R    q0           d1    R   q1
• Two flops, most analyzers use                    i
  more — the HP boards in the 1970s                                    S                       S

  used 37 flops!                                                                                                ok
• Feedback taps on both flops. Differ-
                                               reset
  ent signature analyzers have differ-
                                                 clk
  ent configurations of feedback taps.
                                                      i     i6    i5       i4     i3   i2     i1    i0   -
• Also contains “ok” tester (AND
  gate). Expected output of LFSR at               d0        i6    i5   i4⊕i6 356       245 1346 02356    -

  end of test sequence is: q0=1 and               q0        0     i6       i5    i4⊕i6 356    245 1346 02356
  q1=1, or 01. (We know this be-                  d1        0     i6   i5⊕i6 i4⊕i5 346 2356 1245         -
  cause of bubble on AND gate. To see             q1        0     0        i6    i5⊕i6 i4⊕i5 346 2356 1245
  why this is the expected output of the
  signature analyzer, we would need to
  know the correct sequence of outputs
                                                          356 = i3⊕i5⊕i6
  of the circuit under test.)                             2356 = i2⊕i3⊕i5⊕i6
                                                          etc...




7.5.4 Result Checker

The purpose of the result checker is to check the “ok” circuit at the end of the test sequence. To
do this, we need to recognize the end of the test sequence. The simplest way to do this is to notice
that the first test vector is all 1s and that the test vector sequence will repeat as long as the circuit
is in test mode.
We want to sample the “ok” signal one clock cycle after the sequence is over. This is the same as
the first clock cycle of the second test sequence. In this clock cycle, the output of the test
generator will be all 1s and reset will be 0. We need to look at reset, because otherwise we
could not distinguish the first sequence (when reset is 1) from the subsequenct sequences.
 reset
   q0
   q1                 all_ok
   q2
   ok
7.5.5 Arithmetic over Binary Fields                                                                                                                         487


7.5.5 Arithmetic over Binary Fields
•   Galois Fields!
•   Two operations: “+” and “×”
•   Two values: 0 and 1
•   Bit vectors and shift-registers are written as polynomials in terms of x.
                + represents XOR                        × represents concatenating shift registers

                        expression result                                                                  expression result
                          0+0        0                                                                       x4 × 1     x4
                          0+1        1                                                                       x2 × x3    x5
                          1+0        1
                          1+1        0
                          x+x        0


Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Calculate (x3 + x2 + 1) × (x2 + x)


                              x2 × (x3 + x2 + 1) = x5 + x4      + x2
                              x × (x3 + x2 + 1) =       x4 + x3      + x
                                                   x5      + x3 + x2 + x


7.5.6 Shift Registers and Characteristic Polynomials

Each linear feedback shift register has a corresponding characteristic polynomial.

The exponents in the polynomial correspond to the delay: x0 is the input to the shift register, x1 is
the output of the first flip-flop, x2 is the output of the second, etc. The coefficient is 1 if the output
feeds back into the flip flop. Usually (Internal flops or input flops with an external input), the
feedback is done via an XOR gate. For input flops without an external input signal, the feedback is
done directly, with a wire. The non-existant external input is equivalent to a 0, and 0 XOR a
simplifies to a, which is a wire.

From polynomials to hardware:
• The maximum exponent denotes the number of flops
• The other exponents denote the flops that tap off of feedback line from last flop
• From the characteristic polynomial, we cannot determine whether the shift register has an
  external input. Stated another way, two shift registers that are identical except that one has an
  external input and the other does not will have the same characteristic polynomial.
488                                                           CHAPTER 7. FAULT TESTING AND TESTABILITY

          reset


                                d0   R    q0                 R    q1                    R    q2
               i

                                     S                       S                          S                            p(x) = x3

          reset


                                d0   R    q0            d1   R    q1                    R    q2
               i
                            0                       1                      2
                        x                       x                      x                          x3
                                     S                       S                          S                            p(x) = x3 + x

          reset


                                d0   R    q0                 R    q1                    R    q2
               i
                        x0                      x1                     x2                         x3
                                     S                       S                          S                            p(x) = x3 + 1

          reset


                                d0   R    q0            d1   R    q1                    R    q2
               i
                        x0                      x1                     x2                         x3
                                     S                       S                          S                            p(x) = x3 + x + 1

          reset


                                d0   R    q0            d1   R    q1               d2   R    q2
               i
                        x0                      x1                     x2                         x3
                                     S                       S                          S                            p(x) = x3 + x2 + x + 1

  reset


                   d0   R       q0         d1    R      q1                     R   q2         d3       R   q3
      i
          x0                         x1                      x2                         x3                      x4
                        S                        S                             S                       S             p(x) = x4 + x3 + x + 1
7.5.7 Bit Streams and Characteristic Polynomials                                                    489


7.5.6.1 Circuit Multiplication

Redoing the multiplication example (x2 + x) × (x3 + x2 + 1) as circuits:


                       x2 + x

                  x3 + x2 + 1

     (x2 + x) × (x3 + x2 + 1)


 =          x × (x3 + x2 + 1)

      + x2 × (x3 + x2 + 1)

 =           x5 + x3 + x2 + x
The flop for the most-significant bit is represented by a coeffcient of 1 for the maximum exponent
in the polynomial. Hence, MSB of the first partial product cancels the x4 of the second partial
product, resulting in a coefficient of 0 for x4 in the answer.


7.5.7 Bit Streams and Characteristic Polynomials

A bit stream, or bit sequence, can be represented as a polynomial.

The oldest (first) bit in a sequence of n bits is represented by xn−1 and the youngest (last) bit is x0 .

The bit sequence 1010011 can be represented as x6 + x4 + x + 1:

                    1         0    1   0     0     1     1
                 = 1x 6 + 0x5 + 1x4 + 0x3 + 0x2 + 1x1 + 1x0

                 = x6 + x4 + x + 1


7.5.8 Division

With rules for multiplication and addition, we can define division.

A fundamental theorem of division defines q and r to be the quotient and remainder, respectively,
of m ÷ p iff:


                                      m(x) = q(x) × p(x) + r(x)
490                                       CHAPTER 7. FAULT TESTING AND TESTABILITY


In Galois fields, we do division just as with long division in elementary school.

Given:

                                      m(x) = x6 + x4 + x3
                                      p(x) = x4 + x

Calculate the quotient, q(x) and remainder r(x) for m(x) ÷ p(x):


                         x2       + 1
                x4 + x   x6 + 0x5 + 1x4 + 1x3 + 0x2 + 0x1 + 0x0

                         x6             + 1x3
                                    1x4
                                    1x4             + x
                                                       x


 Quotient       q(x) = x2 + 1
 Remainder      r(x) = x
Check result:

                            m(x) =   q(x)    ×      p(x) + r(x)
                                 = (x2 + 1) × (x4 + x) +    x
                                 = x6 + x3 + x4 + x      + x
                                 = x6 + x4 + x3


7.5.9 Signature Analysis: Math and Circuits

The input to the signature analyzer is a “message”, m(x), which is a sequence of n bits
represented as a polynomial.

After n shifts through an LFSR with l flops:
• The sequence of output bits forms a quotient, q(x), of length n − l
• The flops in the analyzer form a remainder, r(x), of length l


                                    m(x) = q(x) × p(x) + r(x)


The remainder is the signature.

The mathematics for an LFSR without an input i:
• same polynomial as if the circuit had an input
7.5.10 Summary                                                                                  491


• input sequence is all 0s

An input stream with an error can be represented as m(x) + e(x)
• e(x) is the error polynomial
• bits in the message that are flipped have a coefficient of 1 in e(x)


                                 m(x) + e(x) = q′ (x) × p(x) + r′ (x)


The error e(x) will be detected if it results in a different signature (remainder).

m(x) and m(x) + e(x) will have the same remainder iff


                                         e(x) mod p(x) = 0


That is e(x) must be a multiple of p(x).

The larger p(x) is, the smaller the chances that e(x) will be a multiple of p(x).


7.5.10 Summary
Adding test circuitry

        1. Pick number of flops for generator
        2. Build generator (maximal-length linear feedback shift register)
        3. Pick number of flops for signature analysis
        4. Pick coeffecients (feedback taps) for analyzer
        5. Based on generator, circuit under test, and signature analyzer; determine expected
           output of analyzer
        6. Based on expected output of analyzer, build result checker

Running test vectors

        1. Put circuit in test mode
        2. Set reset = 1
        3. Run one clock cycle, set reset = 0
        4. Run one clock cycle for each test vector
        5. At end of test sequence, all ok signals should be 1
        6. To run n test vectors requires n + 1 clock cycles.
492                                                               CHAPTER 7. FAULT TESTING AND TESTABILITY


BIST for a Simple Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Outline of steps to see if a fault will be detected by BIST:

     1. Output sequence from test generator
     2. Output sequence from correct circuit
     3. Remainder for signature analyzer with correct output sequence
     4. Output sequence from faulty circuit
     5. Remainder for signature analyzer with faulty output sequence
     6. Compare correct and faulty remainder, if different then fault detected


Components                ......................................................................... .
     L1
 a
           L4
                          L6
     L2                             L8
 b                        L7                z
           L5
     L3
 a



               t0              t1                            t2
      D    Q         D     Q                        D   Q




                               r0                           r1                        r2
 z                    D    Q                    D       Q                   D     Q
7.5.10 Summary                                                         493




                                                        correct
                                                        faulty
                                             t0 t1 t2
 t0 t1        t2                              a b c z z




  z      r0        r1      r2                  z   r0        r1   r2




  Question:   Determine if L2@1 will be detected
494                                      CHAPTER 7. FAULT TESTING AND TESTABILITY

                                                 Equation for correct circuit: ab + bc
                                                 Equation for faulty circuit: a + c
                                                 Output sequences for correct and faulty circuits




                                                                       correct
          Test Generation Sequence




                                                                       faulty
 t0 t1          t2
                                            t0               t1   t2
  1   1    0    1     initial values = 1     a                b    c   z    z
  1   1    1    0                            1                1    1   1    1
  0   1    0    1                            1                1    0   1    1
  1   0    0    0                            0                1    1   1    1
  0   1    1    0                            1                0    0   0    1
  0   0    1    1                            0                1    0   0    0
  1   0    1    1                            0                0    1   0    1
  1   1         1     final values are repeat1                0    1   0    1
                      of initial values

 Technique is to shift; then compute result of    vectors from               output sequences
                     XORs
                                                  test generation            from circuits
                                                  sequence
7.5.10 Summary                                                                              495


Signature analyzer sequence for correct Circuit Signature analyzer sequence for faulty circuit

  z        r0        r1       r2          z   r0  r1                           r2
  1    1   0     0   0    0    0          1 1 0 0 0                       0     0
  1    1   1     1   0    0   0           1 1 1 1 0                       0    0
  1    1   1     1   1    1   0 initial 1 1 1 1 1                         1    0 initial
                                 values = 0                                       values = 0
  0    1   1     0   1    0   1           1 0 1 0 1                       0    1
  0    0   1     1   0    0   0           0 0 0 0 0                       0    0
  0    0   0     0   1    1   0 remainder 1 0 0 0
                                          1                               0    0 remainder
  0    1   0     1   0    1   1           1 1 1 1 0                       0    0
           1         1        1               1   1                            0

      output sequence                                output sequence
      from correct circuit                           from correct circuit
496                                        CHAPTER 7. FAULT TESTING AND TESTABILITY


7.6 Scan vs Self Test
Scan

       ⇑ less hardware
       ⇓ slower
       ⇑ well defined coverage
       ⇑ test vectors are easy to modify

Self Test

       ⇓ more hardware
       ⇑ faster
       ⇓ ill defined coverage
       ⇓ test vectors are hard to modify
7.7. PROBLEMS ON FAULTS, TESTING
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference
VHDL Reference

VHDL Reference

  • 1.
    E&CE 327: DigitalSystems Engineering Course Notes (with Solutions) Mark Aagaard 2011t1–Winter University of Waterloo Dept of Electrical and Computer Engineering
  • 3.
    Contents I Course Notes 1 1 VHDL 3 1.1 Introduction to VHDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.1 Levels of Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.2 VHDL Origins and History . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.3 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.4 Synthesis of a Simulation-Based Language . . . . . . . . . . . . . . . . . 7 1.1.5 Solution to Synthesis Sanity . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.1.6 Standard Logic 1164 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2 Comparison of VHDL to Other Hardware Description Languages . . . . . . . . . 9 1.2.1 VHDL Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.2 VHDL Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.3 VHDL and Other Languages . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2.3.1 VHDL vs Verilog . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2.3.2 VHDL vs System Verilog . . . . . . . . . . . . . . . . . . . . . 10 1.2.3.3 VHDL vs SystemC . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2.3.4 Summary of VHDL Evaluation . . . . . . . . . . . . . . . . . . 11 1.3 Overview of Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.1 Syntactic Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.2 Library Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.3 Entities and Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.4 Concurrent Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.3.5 Component Declaration and Instantiations . . . . . . . . . . . . . . . . . . 16 1.3.6 Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.3.7 Sequential Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.3.8 A Few More Miscellaneous VHDL Features . . . . . . . . . . . . . . . . 18 1.4 Concurrent vs Sequential Statements . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.4.1 Concurrent Assignment vs Process . . . . . . . . . . . . . . . . . . . . . . 18 1.4.2 Conditional Assignment vs If Statements . . . . . . . . . . . . . . . . . . 18 1.4.3 Selected Assignment vs Case Statement . . . . . . . . . . . . . . . . . . . 19 1.4.4 Coding Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.5 Overview of Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.5.1 Combinational Process vs Clocked Process . . . . . . . . . . . . . . . . . 22 1.5.2 Latch Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 i
  • 4.
    ii CONTENTS 1.5.3 Combinational vs Flopped Signals . . . . . . . . . . . . . . . . . . . . . . 25 1.6 Details of Process Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.6.1 Simple Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.6.2 Temporal Granularities of Simulation . . . . . . . . . . . . . . . . . . . . 26 1.6.3 Intuition Behind Delta-Cycle Simulation . . . . . . . . . . . . . . . . . . 27 1.6.4 Definitions and Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.6.4.1 Process Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.6.4.2 Simulation Algorithm . . . . . . . . . . . . . . . . . . . . . . . 28 1.6.4.3 Delta-Cycle Definitions . . . . . . . . . . . . . . . . . . . . . . 30 1.6.5 Example 1: Process Execution (Bamboozle) . . . . . . . . . . . . . . . . . 31 1.6.6 Example 2: Process Execution (Flummox) . . . . . . . . . . . . . . . . . 40 1.6.7 Example: Need for Provisional Assignments . . . . . . . . . . . . . . . . 42 1.6.8 Delta-Cycle Simulations of Flip-Flops . . . . . . . . . . . . . . . . . . . . 44 1.7 Register-Transfer-Level Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 50 1.7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 1.7.2 Technique for Register-Transfer Level Simulation . . . . . . . . . . . . . . 52 1.7.3 Examples of RTL Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 53 1.7.3.1 RTL Simulation Example 1 . . . . . . . . . . . . . . . . . . . . 53 1.8 VHDL and Hardware Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . 58 1.8.1 Basic Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 1.8.2 Deprecated Building Blocks for RTL . . . . . . . . . . . . . . . . . . . . 59 1.8.2.1 An Aside on Flip-Flops and Latches . . . . . . . . . . . . . . . 59 1.8.2.2 Deprecated Hardware . . . . . . . . . . . . . . . . . . . . . . . 59 1.8.3 Hardware and Code for Flops . . . . . . . . . . . . . . . . . . . . . . . . 60 1.8.3.1 Flops with Waits and Ifs . . . . . . . . . . . . . . . . . . . . . . 60 1.8.3.2 Flops with Synchronous Reset . . . . . . . . . . . . . . . . . . 60 1.8.3.3 Flops with Chip-Enable . . . . . . . . . . . . . . . . . . . . . . 61 1.8.3.4 Flop with Chip-Enable and Mux on Input . . . . . . . . . . . . . 61 1.8.3.5 Flops with Chip-Enable, Muxes, and Reset . . . . . . . . . . . . 62 1.8.4 An Example Sequential Circuit . . . . . . . . . . . . . . . . . . . . . . . 62 1.9 Arrays and Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 1.10 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 1.10.1 Arithmetic Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 1.10.2 Shift and Rotate Operations . . . . . . . . . . . . . . . . . . . . . . . . . 68 1.10.3 Overloading of Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . 68 1.10.4 Different Widths and Arithmetic . . . . . . . . . . . . . . . . . . . . . . . 69 1.10.5 Overloading of Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . 69 1.10.6 Different Widths and Comparisons . . . . . . . . . . . . . . . . . . . . . . 69 1.10.7 Type Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 1.11 Synthesizable vs Non-Synthesizable Code . . . . . . . . . . . . . . . . . . . . . . 71 1.11.1 Unsynthesizable Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 1.11.1.1 Initial Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 1.11.1.2 Wait For . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 1.11.1.3 Different Wait Conditions . . . . . . . . . . . . . . . . . . . . . 72 1.11.1.4 Multiple “if rising edge” in Process . . . . . . . . . . . . . . . . 73
  • 5.
    CONTENTS iii 1.11.1.5 “if rising edge” and “wait” in Same Process . . . . . . . . . . . 73 1.11.1.6 “if rising edge” with “else” Clause . . . . . . . . . . . . . . . . 74 1.11.1.7 “if rising edge” Inside a “for” Loop . . . . . . . . . . . . . . . . 74 1.11.1.8 “wait” Inside of a “for loop” . . . . . . . . . . . . . . . . . . . 75 1.11.2 Synthesizable, but Bad Coding Practices . . . . . . . . . . . . . . . . . . . 76 1.11.2.1 Asynchronous Reset . . . . . . . . . . . . . . . . . . . . . . . . 76 1.11.2.2 Combinational “if-then” Without “else” . . . . . . . . . . . . . 77 1.11.2.3 Bad Form of Nested Ifs . . . . . . . . . . . . . . . . . . . . . . 77 1.11.2.4 Deeply Nested Ifs . . . . . . . . . . . . . . . . . . . . . . . . . 77 1.11.3 Synthesizable, but Unpredictable Hardware . . . . . . . . . . . . . . . . . 78 1.12 Synthesizable VHDL Coding Guidelines . . . . . . . . . . . . . . . . . . . . . . . 78 1.12.1 Signal Declarations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 1.12.2 Flip-Flops and Latches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 1.12.3 Inputs and Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 1.12.4 Multiplexors and Tri-State Signals . . . . . . . . . . . . . . . . . . . . . . 79 1.12.5 Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 1.12.6 State Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 1.12.7 Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 1.13 VHDL Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 P1.1 IEEE 1164 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 P1.2 VHDL Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 P1.3 Flops, Latches, and Combinational Circuitry . . . . . . . . . . . . . . . . 85 P1.4 Counting Clock Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 P1.5 Arithmetic Overflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 P1.6 Delta-Cycle Simulation: Pong . . . . . . . . . . . . . . . . . . . . . . . . 89 P1.7 Delta-Cycle Simulation: Baku . . . . . . . . . . . . . . . . . . . . . . . . 89 P1.8 Clock-Cycle Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 P1.9 VHDL — VHDL Behavioural Comparison: Teradactyl . . . . . . . . . . . 92 P1.10 VHDL — VHDL Behavioural Comparison: Ichtyostega . . . . . . . . . . 93 P1.11 Waveform — VHDL Behavioural Comparison . . . . . . . . . . . . . . . 95 P1.12 Hardware — VHDL Comparison . . . . . . . . . . . . . . . . . . . . . . 97 P1.13 8-Bit Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 P1.13.1 Asynchronous Reset . . . . . . . . . . . . . . . . . . . . . . . . 98 P1.13.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 P1.13.3 Testbench for Register . . . . . . . . . . . . . . . . . . . . . . . 98 P1.14 Synthesizable VHDL and Hardware . . . . . . . . . . . . . . . . . . . . . 99 P1.15 Datapath Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 P1.15.1 Correct Implementation? . . . . . . . . . . . . . . . . . . . . . 101 P1.15.2 Smallest Area . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 P1.15.3 Shortest Clock Period . . . . . . . . . . . . . . . . . . . . . . . 104
  • 6.
    iv CONTENTS 2 RTL Design with VHDL 105 2.1 Prelude to Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 2.1.1 A Note on EDA for FPGAs and ASICs . . . . . . . . . . . . . . . . . . . 105 2.2 FPGA Background and Coding Guidelines . . . . . . . . . . . . . . . . . . . . . . 106 2.2.1 Generic FPGA Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 2.2.1.1 Generic FPGA Cell . . . . . . . . . . . . . . . . . . . . . . . . 106 2.2.2 Area Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 2.2.2.1 Interconnect for Generic FPGA . . . . . . . . . . . . . . . . . . 112 2.2.2.2 Blocks of Cells for Generic FPGA . . . . . . . . . . . . . . . . 112 2.2.2.3 Clocks for Generic FPGAs . . . . . . . . . . . . . . . . . . . . 114 2.2.2.4 Special Circuitry in FPGAs . . . . . . . . . . . . . . . . . . . . 114 2.2.3 Generic-FPGA Coding Guidelines . . . . . . . . . . . . . . . . . . . . . . 115 2.3 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 2.3.1 Generic Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 2.3.2 Implementation Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 2.3.3 Design Flow: Datapath vs Control vs Storage . . . . . . . . . . . . . . . . 118 2.3.3.1 Classes of Hardware . . . . . . . . . . . . . . . . . . . . . . . . 118 2.3.3.2 Datapath-Centric Design Flow . . . . . . . . . . . . . . . . . . 119 2.3.3.3 Control-Centric Design Flow . . . . . . . . . . . . . . . . . . . 120 2.3.3.4 Storage-Centric Design Flow . . . . . . . . . . . . . . . . . . . 120 2.4 Algorithms and High-Level Models . . . . . . . . . . . . . . . . . . . . . . . . . 120 2.4.1 Flow Charts and State Machines . . . . . . . . . . . . . . . . . . . . . . . 121 2.4.2 Data-Dependency Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 121 2.4.3 High-Level Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 2.5 Finite State Machines in VHDL . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 2.5.1 Introduction to State-Machine Design . . . . . . . . . . . . . . . . . . . . 123 2.5.1.1 Mealy vs Moore State Machines . . . . . . . . . . . . . . . . . 123 2.5.1.2 Introduction to State Machines and VHDL . . . . . . . . . . . . 123 2.5.1.3 Explicit vs Implicit State Machines . . . . . . . . . . . . . . . . 124 2.5.2 Implementing a Simple Moore Machine . . . . . . . . . . . . . . . . . . . 125 2.5.2.1 Implicit Moore State Machine . . . . . . . . . . . . . . . . . . . 126 2.5.2.2 Explicit Moore with Flopped Output . . . . . . . . . . . . . . . 127 2.5.2.3 Explicit Moore with Combinational Outputs . . . . . . . . . . . 128 2.5.2.4 Explicit-Current+Next Moore with Concurrent Assignment . . . 129 2.5.2.5 Explicit-Current+Next Moore with Combinational Process . . . 130 2.5.3 Implementing a Simple Mealy Machine . . . . . . . . . . . . . . . . . . . 131 2.5.3.1 Implicit Mealy State Machine . . . . . . . . . . . . . . . . . . . 132 2.5.3.2 Explicit Mealy State Machine . . . . . . . . . . . . . . . . . . . 133 2.5.3.3 Explicit-Current+Next Mealy . . . . . . . . . . . . . . . . . . . 134 2.5.4 Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 2.5.5 State Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 2.5.5.1 Constants vs Enumerated Type . . . . . . . . . . . . . . . . . . 137 2.5.5.2 Encoding Schemes . . . . . . . . . . . . . . . . . . . . . . . . . 138 2.6 Dataflow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 2.6.1 Dataflow Diagrams Overview . . . . . . . . . . . . . . . . . . . . . . . . 139
  • 7.
    CONTENTS v 2.6.2 Dataflow Diagrams, Hardware, and Behaviour . . . . . . . . . . . . . . . 142 2.6.3 Dataflow Diagram Execution . . . . . . . . . . . . . . . . . . . . . . . . . 143 2.6.4 Performance Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 2.6.5 Area Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 2.6.6 Design Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 2.6.7 Area / Performance Tradeoffs . . . . . . . . . . . . . . . . . . . . . . . . 145 2.7 Design Example: Massey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 2.7.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 2.7.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 2.7.3 Initial Dataflow Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 2.7.4 Dataflow Diagram Scheduling . . . . . . . . . . . . . . . . . . . . . . . . 150 2.7.5 Optimize Inputs and Outputs . . . . . . . . . . . . . . . . . . . . . . . . . 152 2.7.6 Input/Output Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 2.7.7 Register Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 2.7.8 Datapath Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 2.7.9 Datapath for DP+Ctrl Model . . . . . . . . . . . . . . . . . . . . . . . . . 158 2.7.10 Peephole Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 2.8 Design Example: Vanier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 2.8.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 2.8.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 2.8.3 Initial Dataflow Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 2.8.4 Reschedule to Meet Requirements . . . . . . . . . . . . . . . . . . . . . . 164 2.8.5 Optimize Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 2.8.6 Assign Names to Registered Values . . . . . . . . . . . . . . . . . . . . . 167 2.8.7 Input/Output Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 2.8.8 Tangent: Combinational Outputs . . . . . . . . . . . . . . . . . . . . . . . 170 2.8.9 Register Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 2.8.10 Datapath Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 2.8.11 Hardware Block Diagram and State Machine . . . . . . . . . . . . . . . . 173 2.8.11.1 Control for Registers . . . . . . . . . . . . . . . . . . . . . . . 173 2.8.11.2 Control for Datapath Components . . . . . . . . . . . . . . . . . 174 2.8.11.3 Control for State . . . . . . . . . . . . . . . . . . . . . . . . . . 175 2.8.11.4 Complete State Machine Table . . . . . . . . . . . . . . . . . . 175 2.8.12 VHDL Code with Explicit State Machine . . . . . . . . . . . . . . . . . . 176 2.8.13 Peephole Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 2.8.14 Notes and Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 2.9 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 2.9.1 Introduction to Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . 183 2.9.2 Partially Pipelined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 2.9.3 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 2.10 Design Example: Pipelined Massey . . . . . . . . . . . . . . . . . . . . . . . . . 188 2.11 Memory Arrays and RTL Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 2.11.1 Memory Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 2.11.2 Memory Arrays in VHDL . . . . . . . . . . . . . . . . . . . . . . . . . . 193 2.11.2.1 Using a Two-Dimensional Array for Memory . . . . . . . . . . 193
  • 8.
    vi CONTENTS 2.11.2.2 Memory Arrays in Hardware . . . . . . . . . . . . . . . . . . . 194 2.11.2.3 VHDL Code for Single-Port Memory Array . . . . . . . . . . . 195 2.11.2.4 Using Library Components for Memory . . . . . . . . . . . . . 196 2.11.2.5 Build Memory from Slices . . . . . . . . . . . . . . . . . . . . 197 2.11.2.6 Dual-Ported Memory . . . . . . . . . . . . . . . . . . . . . . . 199 2.11.3 Data Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 2.11.4 Memory Arrays and Dataflow Diagrams . . . . . . . . . . . . . . . . . . . 201 2.11.5 Example: Memory Array and Dataflow Diagram . . . . . . . . . . . . . . 204 2.12 Input / Output Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 2.13 Example: Moving Average . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 2.13.1 Requirements and Environmental Assumptions . . . . . . . . . . . . . . . 207 2.13.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 2.13.3 Pseudocode and Dataflow Diagrams . . . . . . . . . . . . . . . . . . . . . 210 2.13.4 Control Tables and State Machine . . . . . . . . . . . . . . . . . . . . . . 216 2.13.5 VHDL Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 2.14 Design Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 P2.1 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 P2.1.1 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . 221 P2.1.2 Own Code vs Libraries . . . . . . . . . . . . . . . . . . . . . . 221 P2.2 Design Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 P2.3 Dataflow Diagram Optimization . . . . . . . . . . . . . . . . . . . . . . . 222 P2.3.1 Resource Usage . . . . . . . . . . . . . . . . . . . . . . . . . . 222 P2.3.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 P2.4 Dataflow Diagram Design . . . . . . . . . . . . . . . . . . . . . . . . . . 223 P2.4.1 Maximum Performance . . . . . . . . . . . . . . . . . . . . . . 223 P2.4.2 Minimum area . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 P2.5 Michener: Design and Optimization . . . . . . . . . . . . . . . . . . . . . 224 P2.6 Dataflow Diagrams with Memory Arrays . . . . . . . . . . . . . . . . . . 224 P2.6.1 Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 P2.6.2 Algorithm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 P2.7 2-bit adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 P2.7.1 Generic Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 P2.7.2 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 P2.8 Sketches of Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
  • 9.
    CONTENTS vii 3 Performance Analysis and Optimization 227 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 3.2 Defining Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 3.3 Comparing Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 3.3.1 General Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 3.3.2 Example: Performance of Printers . . . . . . . . . . . . . . . . . . . . . . 229 3.4 Clock Speed, CPI, Program Length, and Performance . . . . . . . . . . . . . . . . 233 3.4.1 Mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 3.4.2 Example: CISC vs RISC and CPI . . . . . . . . . . . . . . . . . . . . . . 233 3.4.3 Effect of Instruction Set on Performance . . . . . . . . . . . . . . . . . . . 235 3.4.4 Effect of Time to Market on Relative Performance . . . . . . . . . . . . . 237 3.4.5 Summary of Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 3.5 Performance Analysis and Dataflow Diagrams . . . . . . . . . . . . . . . . . . . . 239 3.5.1 Dataflow Diagrams, CPI, and Clock Speed . . . . . . . . . . . . . . . . . 239 3.5.2 Examples of Dataflow Diagrams for Two Instructions . . . . . . . . . . . . 240 3.5.2.1 Scheduling of Operations for Different Clock Periods . . . . . . 241 3.5.2.2 Performance Computation for Different Clock Periods . . . . . . 241 3.5.2.3 Example: Two Instructions Taking Similar Time . . . . . . . . . 242 3.5.2.4 Example: Same Total Time, Different Order for A . . . . . . . . 243 3.5.3 Example: From Algorithm to Optimized Dataflow . . . . . . . . . . . . . 244 3.6 General Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 3.6.1 Strength Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 3.6.1.1 Arithmetic Strength Reduction . . . . . . . . . . . . . . . . . . 252 3.6.1.2 Boolean Strength Reduction . . . . . . . . . . . . . . . . . . . . 252 3.6.2 Replication and Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 3.6.2.1 Mux-Pushing . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 3.6.2.2 Common Subexpression Elimination . . . . . . . . . . . . . . . 253 3.6.2.3 Computation Replication . . . . . . . . . . . . . . . . . . . . . 253 3.6.3 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 3.7 Retiming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 3.8 Performance Analysis and Optimization Problems . . . . . . . . . . . . . . . . . . 256 P3.1 Farmer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 P3.2 Network and Router . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 P3.2.1 Maximum Throughput . . . . . . . . . . . . . . . . . . . . . . . 257 P3.2.2 Packet Size and Performance . . . . . . . . . . . . . . . . . . . 257 P3.3 Performance Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . 257 P3.4 Microprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 P3.4.1 Average CPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 P3.4.2 Why not you too? . . . . . . . . . . . . . . . . . . . . . . . . . 258 P3.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 P3.5 Dataflow Diagram Optimization . . . . . . . . . . . . . . . . . . . . . . . 258 P3.6 Performance Optimization with Memory Arrays . . . . . . . . . . . . . . 259 P3.7 Multiply Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 P3.7.1 Highest Performance . . . . . . . . . . . . . . . . . . . . . . . 260 P3.7.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . 261
  • 10.
    viii CONTENTS 4 Functional Verification 263 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 4.1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 4.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 4.2.1 Terminology: Validation / Verification / Testing . . . . . . . . . . . . . . . 264 4.2.2 The Difficulty of Designing Correct Chips . . . . . . . . . . . . . . . . . . 265 4.2.2.1 Notes from Kenn Heinrich (UW E&CE grad) . . . . . . . . . . 265 4.2.2.2 Notes from Aart de Geus (Chairman and CEO of Synopsys) . . . 265 4.3 Test Cases and Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 4.3.1 Test Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 4.3.2 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 4.3.3 Floating Point Divider Example . . . . . . . . . . . . . . . . . . . . . . . 268 4.4 Testbenches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 4.4.1 Overview of Test Benches . . . . . . . . . . . . . . . . . . . . . . . . . . 271 4.4.2 Reference Model Style Testbench . . . . . . . . . . . . . . . . . . . . . . 272 4.4.3 Relational Style Testbench . . . . . . . . . . . . . . . . . . . . . . . . . . 272 4.4.4 Coding Structure of a Testbench . . . . . . . . . . . . . . . . . . . . . . . 273 4.4.5 Datapath vs Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 4.4.6 Verification Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 4.5 Functional Verification for Datapath Circuits . . . . . . . . . . . . . . . . . . . . . 274 4.5.1 A Spec-Less Testbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 4.5.2 Use an Array for Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . 276 4.5.3 Build Spec into Stimulus . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 4.5.4 Have Separate Specification Entity . . . . . . . . . . . . . . . . . . . . . . 278 4.5.5 Generate Test Vectors Automatically . . . . . . . . . . . . . . . . . . . . . 280 4.5.6 Relational Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 4.6 Functional Verification of Control Circuits . . . . . . . . . . . . . . . . . . . . . . 281 4.6.1 Overview of Queues in Hardware . . . . . . . . . . . . . . . . . . . . . . 281 4.6.2 VHDL Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 4.6.2.1 Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 4.6.2.2 Other VHDL Coding . . . . . . . . . . . . . . . . . . . . . . . 283 4.6.3 Code Structure for Verification . . . . . . . . . . . . . . . . . . . . . . . . 283 4.6.4 Instrumentation Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 4.6.5 Coverage Monitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 4.6.6 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 4.6.7 VHDL Coding Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288 4.6.8 Queue Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 4.6.9 Queue Testbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 4.7 Example: Microwave Oven . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 4.8 Functional Verification Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 P4.1 Carry Save Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 P4.2 Traffic Light Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 P4.2.1 Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 P4.2.2 Boundary Conditions . . . . . . . . . . . . . . . . . . . . . . . 296 P4.2.3 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
  • 11.
    CONTENTS ix P4.3 State Machines and Verification . . . . . . . . . . . . . . . . . . . . . . . 297 P4.3.1 Three Different State Machines . . . . . . . . . . . . . . . . . . 297 P4.3.2 State Machines in General . . . . . . . . . . . . . . . . . . . . . 298 P4.4 Test Plan Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 P4.4.1 Early Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 P4.4.2 Corner Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 P4.5 Sketches of Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 5 Timing Analysis 301 5.1 Delays and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 5.1.1 Background Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 5.1.2 Clock-Related Timing Definitions . . . . . . . . . . . . . . . . . . . . . . 302 5.1.2.1 Clock Skew . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 5.1.2.2 Clock Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 5.1.2.3 Clock Jitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 5.1.3 Storage-Related Timing Definitions . . . . . . . . . . . . . . . . . . . . . 304 5.1.3.1 Flops and Latches . . . . . . . . . . . . . . . . . . . . . . . . . 304 5.1.3.2 Timing Parameters for a Flop . . . . . . . . . . . . . . . . . . . 305 5.1.3.3 Hold Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 5.1.3.4 Clock-to-Q Time . . . . . . . . . . . . . . . . . . . . . . . . . . 305 5.1.4 Propagation Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 5.1.4.1 Load Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 5.1.4.2 Interconnect Delays . . . . . . . . . . . . . . . . . . . . . . . . 306 5.1.5 Summary of Delay Factors . . . . . . . . . . . . . . . . . . . . . . . . . . 307 5.1.6 Timing Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 5.1.6.1 Minimum Clock Period . . . . . . . . . . . . . . . . . . . . . . 308 5.1.6.2 Hold Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . 309 5.1.6.3 Example Timing Violations . . . . . . . . . . . . . . . . . . . . 309 5.2 Timing Analysis of Latches and Flip Flops . . . . . . . . . . . . . . . . . . . . . . 311 5.2.1 Simple Multiplexer Latch . . . . . . . . . . . . . . . . . . . . . . . . . . 311 5.2.1.1 Structure and Behaviour of Multiplexer Latch . . . . . . . . . . 311 5.2.1.2 Strategy for Timing Analysis of Storage Devices . . . . . . . . . 313 5.2.1.3 Clock-to-Q Time of a Multiplexer Latch . . . . . . . . . . . . . 314 5.2.1.4 Setup Timing of a Multiplexer Latch . . . . . . . . . . . . . . . 315 5.2.1.5 Hold Time of a Multiplexer Latch . . . . . . . . . . . . . . . . . 323 5.2.1.6 Example of a Bad Latch . . . . . . . . . . . . . . . . . . . . . . 326 5.2.2 Timing Analysis of Transmission-Gate Latch . . . . . . . . . . . . . . . . 326 5.2.2.1 Structure and Behaviour of a Transmission Gate . . . . . . . . . 327 5.2.2.2 Structure and Behaviour of Transmission-Gate Latch . . . . . . 327 5.2.2.3 Clock-to-Q Delay for Transmission-Gate Latch . . . . . . . . . 328 5.2.2.4 Setup and Hold Times for Transmission-Gate Latch . . . . . . . 328 5.2.3 Falling Edge Flip Flop . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328 5.2.3.1 Structure and Behaviour of Flip-Flop . . . . . . . . . . . . . . . 329 5.2.3.2 Clock-to-Q of Flip-Flop . . . . . . . . . . . . . . . . . . . . . . 330 5.2.3.3 Setup of Flip-Flop . . . . . . . . . . . . . . . . . . . . . . . . . 331
  • 12.
    x CONTENTS 5.2.3.4 Hold of Flip-Flop . . . . . . . . . . . . . . . . . . . . . . . . . 332 5.2.4 Timing Analysis of FPGA Cells . . . . . . . . . . . . . . . . . . . . . . . 332 5.2.4.1 Standard Timing Equations . . . . . . . . . . . . . . . . . . . . 333 5.2.4.2 Hierarchical Timing Equations . . . . . . . . . . . . . . . . . . 333 5.2.4.3 Actel Act 2 Logic Cell . . . . . . . . . . . . . . . . . . . . . . . 333 5.2.4.4 Timing Analysis of Actel Sequential Module . . . . . . . . . . . 335 5.2.5 Exotic Flop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 5.3 Critical Paths and False Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 5.3.1 Introduction to Critical and False Paths . . . . . . . . . . . . . . . . . . . 336 5.3.1.1 Example of Critical Path in Full Adder . . . . . . . . . . . . . . 338 5.3.1.2 Preliminaries for Critical Paths . . . . . . . . . . . . . . . . . . 340 5.3.1.3 Longest Path and Critical Path . . . . . . . . . . . . . . . . . . 340 5.3.1.4 Timing Simulation vs Static Timing Analysis . . . . . . . . . . . 343 5.3.2 Longest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 5.3.3 Detecting a False Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 5.3.3.1 Preliminaries for Detecting a False Path . . . . . . . . . . . . . 345 5.3.3.2 Almost-Correct Algorithm to Detect a False Path . . . . . . . . . 349 5.3.3.3 Examples of Detecting False Paths . . . . . . . . . . . . . . . . 349 5.3.4 Finding the Next Candidate Path . . . . . . . . . . . . . . . . . . . . . . . 354 5.3.4.1 Algorithm to Find Next Candidate Path . . . . . . . . . . . . . . 354 5.3.4.2 Examples of Finding Next Candidate Path . . . . . . . . . . . . 355 5.3.5 Correct Algorithm to Find Critical Path . . . . . . . . . . . . . . . . . . . 362 5.3.5.1 Rules for Late Side Inputs . . . . . . . . . . . . . . . . . . . . . 362 5.3.5.2 Monotone Speedup . . . . . . . . . . . . . . . . . . . . . . . . 364 5.3.5.3 Analysis of Side-Input-Causes-Glitch Situation . . . . . . . . . 365 5.3.5.4 Complete Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 366 5.3.5.5 Complete Examples . . . . . . . . . . . . . . . . . . . . . . . . 367 5.3.6 Further Extensions to Critical Path Analysis . . . . . . . . . . . . . . . . . 374 5.3.7 Increasing the Accuracy of Critical Path Analysis . . . . . . . . . . . . . . 375 5.4 Elmore Timing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 5.4.1 RC-Networks for Timing Analysis . . . . . . . . . . . . . . . . . . . . . . 375 5.4.2 Derivation of Analog Timing Model . . . . . . . . . . . . . . . . . . . . . 380 5.4.2.1 Example Derivation: Equation for Voltage at Node 3 . . . . . . . 382 5.4.2.2 General Derivation . . . . . . . . . . . . . . . . . . . . . . . . . 383 5.4.3 Elmore Timing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 5.4.4 Examples of Using Elmore Delay . . . . . . . . . . . . . . . . . . . . . . 387 5.4.4.1 Interconnect with Single Fanout . . . . . . . . . . . . . . . . . . 387 5.4.4.2 Interconnect with Multiple Gates in Fanout . . . . . . . . . . . . 389 5.5 Practical Usage of Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 392 5.5.1 Speed Binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393 5.5.1.1 FPGAs, Interconnect, and Synthesis . . . . . . . . . . . . . . . 394 5.5.2 Worst Case Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394 5.5.2.1 Fanout delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394 5.5.2.2 Derating Factors . . . . . . . . . . . . . . . . . . . . . . . . . . 394 5.6 Timing Analysis Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
  • 13.
    CONTENTS xi P5.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396 P5.2 Hold Time Violations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396 P5.2.1 Cause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396 P5.2.2 Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 P5.2.3 Rectification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 P5.3 Latch Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 P5.4 Critical Path and False Path . . . . . . . . . . . . . . . . . . . . . . . . . 398 P5.5 Critical Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399 P5.5.1 Longest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399 P5.5.2 Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399 P5.5.3 Missing Factors . . . . . . . . . . . . . . . . . . . . . . . . . . 399 P5.5.4 Critical Path or False Path? . . . . . . . . . . . . . . . . . . . . 399 P5.6 YACP: Yet Another Critical Path . . . . . . . . . . . . . . . . . . . . . . . 400 P5.7 Timing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401 P5.8 Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402 P5.8.1 Wires in FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . 402 P5.8.2 Age and Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 402 P5.8.3 Temperature and Delay . . . . . . . . . . . . . . . . . . . . . . 402 P5.9 Worst Case Conditions and Derating Factor . . . . . . . . . . . . . . . . . 402 P5.9.1 Worst-Case Commercial . . . . . . . . . . . . . . . . . . . . . . 402 P5.9.2 Worst-Case Industrial . . . . . . . . . . . . . . . . . . . . . . . 402 P5.9.3 Worst-Case Industrial, Non-Ambient Junction Temperature . . . 402 6 Power Analysis and Power-Aware Design 403 6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 6.1.1 Importance of Power and Energy . . . . . . . . . . . . . . . . . . . . . . . 403 6.1.2 Industrial Names and Products . . . . . . . . . . . . . . . . . . . . . . . . 403 6.1.3 Power vs Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 6.1.4 Batteries, Power and Energy . . . . . . . . . . . . . . . . . . . . . . . . . 405 6.1.4.1 Do Batteries Store Energy or Power? . . . . . . . . . . . . . . . 405 6.1.4.2 Battery Life and Efficiency . . . . . . . . . . . . . . . . . . . . 405 6.1.4.3 Battery Life and Power . . . . . . . . . . . . . . . . . . . . . . 406 6.2 Power Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409 6.2.1 Switching Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410 6.2.2 Short-Circuited Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 6.2.3 Leakage Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 6.2.4 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 6.2.5 Note on Power Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 413 6.3 Overview of Power Reduction Techniques . . . . . . . . . . . . . . . . . . . . . . 414 6.4 Voltage Reduction for Power Reduction . . . . . . . . . . . . . . . . . . . . . . . 415 6.5 Data Encoding for Power Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 416 6.5.1 How Data Encoding Can Reduce Power . . . . . . . . . . . . . . . . . . . 416 6.5.2 Example Problem: Sixteen Pulser . . . . . . . . . . . . . . . . . . . . . . 419 6.5.2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . 419 6.5.2.2 Additional Information . . . . . . . . . . . . . . . . . . . . . . 420
  • 14.
    xii CONTENTS 6.5.2.3 Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420 6.6 Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424 6.6.1 Introduction to Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . 424 6.6.2 Implementing Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . 425 6.6.3 Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426 6.6.4 Effectiveness of Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . 427 6.6.5 Example: Reduced Activity Factor with Clock Gating . . . . . . . . . . . 429 6.6.6 Clock Gating with Valid-Bit Protocol . . . . . . . . . . . . . . . . . . . . 431 6.6.6.1 Valid-Bit Protocol . . . . . . . . . . . . . . . . . . . . . . . . . 431 6.6.6.2 How Many Clock Cycles for Module? . . . . . . . . . . . . . . 433 6.6.6.3 Adding Clock-Gating Circuitry . . . . . . . . . . . . . . . . . . 434 6.6.7 Example: Pipelined Circuit with Clock-Gating . . . . . . . . . . . . . . . 437 6.7 Power Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439 P6.1 Short Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439 P6.1.1 Power and Temperature . . . . . . . . . . . . . . . . . . . . . . 439 P6.1.2 Leakage Power . . . . . . . . . . . . . . . . . . . . . . . . . . 439 P6.1.3 Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439 P6.1.4 Gray Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439 P6.2 VLSI Gurus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439 P6.2.1 Effect on Power . . . . . . . . . . . . . . . . . . . . . . . . . . 439 P6.2.2 Critique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440 P6.3 Advertising Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440 P6.4 Vary Supply Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440 P6.5 Clock Speed Increase Without Power Increase . . . . . . . . . . . . . . . 441 P6.5.1 Supply Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . 441 P6.5.2 Supply Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . 441 P6.6 Power Reduction Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 441 P6.6.1 Supply Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . 441 P6.6.2 Transistor Sizing . . . . . . . . . . . . . . . . . . . . . . . . . . 441 P6.6.3 Adding Registers to Inputs . . . . . . . . . . . . . . . . . . . . 441 P6.6.4 Gray Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441 P6.7 Power Consumption on New Chip . . . . . . . . . . . . . . . . . . . . . . 442 P6.7.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442 P6.7.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442 P6.7.3 Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
  • 15.
    CONTENTS xiii 7 Fault Testing and Testability 443 7.1 Faults and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443 7.1.1 Overview of Faults and Testing . . . . . . . . . . . . . . . . . . . . . . . 443 7.1.1.1 Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443 7.1.1.2 Causes of Faults . . . . . . . . . . . . . . . . . . . . . . . . . . 443 7.1.1.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444 7.1.1.4 Burn In . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444 7.1.1.5 Bin Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444 7.1.1.6 Testing Techniques . . . . . . . . . . . . . . . . . . . . . . . . 445 7.1.1.7 Design for Testability (DFT) . . . . . . . . . . . . . . . . . . . 445 7.1.2 Example Problem: Economics of Testing . . . . . . . . . . . . . . . . . . 446 7.1.3 Physical Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 7.1.3.1 Types of Physical Faults . . . . . . . . . . . . . . . . . . . . . . 447 7.1.3.2 Locations of Faults . . . . . . . . . . . . . . . . . . . . . . . . 447 7.1.3.3 Layout Affects Locations . . . . . . . . . . . . . . . . . . . . . 448 7.1.3.4 Naming Fault Locations . . . . . . . . . . . . . . . . . . . . . . 448 7.1.4 Detecting a Fault . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448 7.1.4.1 Which Test Vectors will Detect a Fault? . . . . . . . . . . . . . 449 7.1.5 Mathematical Models of Faults . . . . . . . . . . . . . . . . . . . . . . . 450 7.1.5.1 Single Stuck-At Fault Model . . . . . . . . . . . . . . . . . . . 450 7.1.6 Generate Test Vector to Find a Mathematical Fault . . . . . . . . . . . . . 451 7.1.6.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 7.1.6.2 Example of Finding a Test Vector . . . . . . . . . . . . . . . . . 452 7.1.7 Undetectable Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452 7.1.7.1 Redundant Circuitry . . . . . . . . . . . . . . . . . . . . . . . . 452 7.1.7.2 Curious Circuitry and Fault Detection . . . . . . . . . . . . . . 454 7.2 Test Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 7.2.1 A Small Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 7.2.2 Choosing Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 7.2.2.1 Fault Domination . . . . . . . . . . . . . . . . . . . . . . . . . 456 7.2.2.2 Fault Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . 457 7.2.2.3 Gate Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . 457 7.2.2.4 Node Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . 458 7.2.2.5 Fault Collapsing Summary . . . . . . . . . . . . . . . . . . . . 458 7.2.3 Fault Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 7.2.4 Test Vector Generation and Fault Detection . . . . . . . . . . . . . . . . . 459 7.2.5 Generate Test Vectors for 100% Coverage . . . . . . . . . . . . . . . . . . 459 7.2.5.1 Collapse the Faults . . . . . . . . . . . . . . . . . . . . . . . . 460 7.2.5.2 Check for Fault Domination . . . . . . . . . . . . . . . . . . . . 462 7.2.5.3 Required Test Vectors . . . . . . . . . . . . . . . . . . . . . . . 463 7.2.5.4 Faults Not Covered by Required Test Vectors . . . . . . . . . . . 463 7.2.5.5 Order to Run Test Vectors . . . . . . . . . . . . . . . . . . . . . 464 7.2.5.6 Summary of Technique to Find and Order Test Vectors . . . . . 465 7.2.5.7 Complete Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 466 7.2.6 One Fault Hiding Another . . . . . . . . . . . . . . . . . . . . . . . . . . 467
  • 16.
    xiv CONTENTS 7.3 Scan Testing in General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467 7.3.1 Structure and Behaviour of Scan Testing . . . . . . . . . . . . . . . . . . . 468 7.3.2 Scan Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468 7.3.2.1 Circuitry in Normal and Scan Mode . . . . . . . . . . . . . . . 468 7.3.2.2 Scan in Operation . . . . . . . . . . . . . . . . . . . . . . . . . 469 7.3.2.3 Scan in Operation with Example Circuit . . . . . . . . . . . . . 470 7.3.3 Summary of Scan Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 475 7.3.4 Time to Test a Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476 7.3.4.1 Example: Time to Test a Chip . . . . . . . . . . . . . . . . . . . 476 7.4 Boundary Scan and JTAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477 7.4.1 Boundary Scan History . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477 7.4.2 JTAG Scan Pins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478 7.4.3 Scan Registers and Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . 478 7.4.4 Scan Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479 7.4.5 TAP Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479 7.4.6 Other descriptions of JTAG/IEEE 1194.1 . . . . . . . . . . . . . . . . . . 480 7.5 Built In Self Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481 7.5.1 Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481 7.5.1.1 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481 7.5.1.2 Linear Feedback Shift Register (LFSR) . . . . . . . . . . . . . . 483 7.5.1.3 Maximal-Length LFSR . . . . . . . . . . . . . . . . . . . . . . 484 7.5.2 Test Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 7.5.3 Signature Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486 7.5.4 Result Checker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486 7.5.5 Arithmetic over Binary Fields . . . . . . . . . . . . . . . . . . . . . . . . 487 7.5.6 Shift Registers and Characteristic Polynomials . . . . . . . . . . . . . . . 487 7.5.6.1 Circuit Multiplication . . . . . . . . . . . . . . . . . . . . . . . 489 7.5.7 Bit Streams and Characteristic Polynomials . . . . . . . . . . . . . . . . . 489 7.5.8 Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489 7.5.9 Signature Analysis: Math and Circuits . . . . . . . . . . . . . . . . . . . . 490 7.5.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491 7.6 Scan vs Self Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496 7.7 Problems on Faults, Testing, and Testability . . . . . . . . . . . . . . . . . . . . . 497 P7.1 Based on Smith q14.9: Testing Cost . . . . . . . . . . . . . . . . . . . . . 497 P7.2 Testing Cost and Total Cost . . . . . . . . . . . . . . . . . . . . . . . . . 497 P7.3 Minimum Number of Faults . . . . . . . . . . . . . . . . . . . . . . . . . 498 P7.4 Smith q14.10: Fault Collapsing . . . . . . . . . . . . . . . . . . . . . . . 498 P7.5 Mathematical Models and Reality . . . . . . . . . . . . . . . . . . . . . . 498 P7.6 Undetectable Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498 P7.7 Test Vector Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498 P7.7.1 Choice of Test Vectors . . . . . . . . . . . . . . . . . . . . . . . 499 P7.7.2 Number of Test Vectors . . . . . . . . . . . . . . . . . . . . . . 499 P7.8 Time to do a Scan Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499 P7.9 BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499 P7.9.1 Characteristic Polynomials . . . . . . . . . . . . . . . . . . . . 499
  • 17.
    CONTENTS xv P7.9.2 Test Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 500 P7.9.3 Signature Analyzer . . . . . . . . . . . . . . . . . . . . . . . . 500 P7.9.4 Probabilty of Catching a Fault . . . . . . . . . . . . . . . . . . . 500 P7.9.5 Probabilty of Catching a Fault . . . . . . . . . . . . . . . . . . . 500 P7.9.6 Detecting a Specific Fault . . . . . . . . . . . . . . . . . . . . . 500 P7.9.7 Time to Run Test . . . . . . . . . . . . . . . . . . . . . . . . . 500 P7.10 Power and BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500 P7.11 Timing Hazards and Testability . . . . . . . . . . . . . . . . . . . . . . . 501 P7.12 Testing Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501 P7.12.1 Are there any physical faults that are detectable by scan testing but not by built-in self testing? . . . . . . . . . . . . . . . . . . 501 P7.12.2 Are there any physical faults that are detectable by built-in self testing but not by scan testing? . . . . . . . . . . . . . . . . . . 501 P7.13 Fault Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501 P7.13.1 Design test generator . . . . . . . . . . . . . . . . . . . . . . . 502 P7.13.2 Design signature analyzer . . . . . . . . . . . . . . . . . . . . . 502 P7.13.3 Determine if a fault is detectable . . . . . . . . . . . . . . . . . 502 P7.13.4 Testing time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502 8 Review 503 8.1 Overview of the Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503 8.2 VHDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504 8.2.1 VHDL Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504 8.2.2 VHDL Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 504 8.3 RTL Design Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505 8.3.1 Design Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505 8.3.2 Design Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 505 8.4 Functional Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506 8.4.1 Verification Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506 8.4.2 Verification Example Problems . . . . . . . . . . . . . . . . . . . . . . . . 506 8.5 Performance Analysis and Optimization . . . . . . . . . . . . . . . . . . . . . . . 507 8.5.1 Performance Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507 8.5.2 Performance Example Problems . . . . . . . . . . . . . . . . . . . . . . . 507 8.6 Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508 8.6.1 Timing Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508 8.6.2 Timing Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 508 8.7 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509 8.7.1 Power Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509 8.7.2 Power Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 509 8.8 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510 8.8.1 Testing Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510 8.8.2 Testing Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 510 8.9 Formulas to be Given on Final Exam . . . . . . . . . . . . . . . . . . . . . . . . . 511
  • 18.
    xvi CONTENTS II Solutions to Assignment Problems 1 1 VHDL Problems 3 P1.1 IEEE 1164 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 P1.2 VHDL Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 P1.3 Flops, Latches, and Combinational Circuitry . . . . . . . . . . . . . . . . . . . . . 7 P1.4 Counting Clock Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 P1.5 Arithmetic Overflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 P1.6 Delta-Cycle Simulation: Pong . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 P1.7 Delta-Cycle Simulation: Baku . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 P1.8 Clock-Cycle Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 P1.9 VHDL — VHDL Behavioural Comparison: Teradactyl . . . . . . . . . . . . . . . 20 P1.10VHDL — VHDL Behavioural Comparison: Ichtyostega . . . . . . . . . . . . . . 21 P1.11Waveform — VHDL Behavioural Comparison . . . . . . . . . . . . . . . . . . . . 23 P1.12Hardware — VHDL Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 25 P1.138-Bit Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 P1.13.1 Asynchronous Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 P1.13.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 P1.13.3 Testbench for Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 P1.14Synthesizable VHDL and Hardware . . . . . . . . . . . . . . . . . . . . . . . . . 30 P1.15Datapath Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 P1.15.1 Correct Implementation? . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 P1.15.2 Smallest Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 P1.15.3 Shortest Clock Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2 Design Problems 39 P2.1 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 P2.1.1 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 P2.1.2 Own Code vs Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 P2.2 Design Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 P2.3 Dataflow Diagram Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 P2.3.1 Resource Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 P2.3.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 P2.4 Dataflow Diagram Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 P2.4.1 Maximum Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 P2.4.2 Minimum area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 P2.5 Michener: Design and Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 47 P2.6 Dataflow Diagrams with Memory Arrays . . . . . . . . . . . . . . . . . . . . . . 48 P2.6.1 Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 P2.6.2 Algorithm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 P2.7 2-bit adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 P2.7.1 Generic Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 P2.7.2 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 P2.8 Sketches of Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
  • 19.
    CONTENTS xvii 3 Functional Verification Problems 55 P3.1 Carry Save Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 P3.2 Traffic Light Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 P3.2.1 Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 P3.2.2 Boundary Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 P3.2.3 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 P3.3 State Machines and Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 P3.3.1 Three Different State Machines . . . . . . . . . . . . . . . . . . . . . . . 57 P3.3.1.1 Number of Test Scenarios . . . . . . . . . . . . . . . . . . . . . 57 P3.3.1.2 Length of Test Scenario . . . . . . . . . . . . . . . . . . . . . . 58 P3.3.1.3 Number of Flip Flops . . . . . . . . . . . . . . . . . . . . . . . 58 P3.3.2 State Machines in General . . . . . . . . . . . . . . . . . . . . . . . . . . 59 P3.4 Test Plan Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 P3.4.1 Early Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 P3.4.2 Corner Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 P3.5 Sketches of Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4 Performance Analysis and Optimization Problems 63 P4.1 Farmer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 P4.2 Network and Router . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 P4.2.1 Maximum Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 P4.2.2 Packet Size and Performance . . . . . . . . . . . . . . . . . . . . . . . . . 66 P4.3 Performance Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 P4.4 Microprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 P4.4.1 Average CPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 P4.4.2 Why not you too? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 P4.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 P4.5 Dataflow Diagram Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 P4.6 Performance Optimization with Memory Arrays . . . . . . . . . . . . . . . . . . . 70 P4.7 Multiply Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 P4.7.1 Highest Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 P4.7.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
  • 20.
    xviii CONTENTS 5 Timing Analysis Problems 79 P5.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 P5.2 Hold Time Violations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 P5.2.1 Cause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 P5.2.2 Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 P5.2.3 Rectification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 P5.3 Latch Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 P5.4 Critical Path and False Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 P5.5 Critical Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 P5.5.1 Longest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 P5.5.2 Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 P5.5.3 Missing Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 P5.5.4 Critical Path or False Path? . . . . . . . . . . . . . . . . . . . . . . . . . . 85 P5.6 YACP: Yet Another Critical Path . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 P5.7 Timing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 P5.8 Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 P5.8.1 Wires in FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 P5.8.2 Age and Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 P5.8.3 Temperature and Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 P5.9 Worst Case Conditions and Derating Factor . . . . . . . . . . . . . . . . . . . . . 90 P5.9.1 Worst-Case Commercial . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 P5.9.2 Worst-Case Industrial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 P5.9.3 Worst-Case Industrial, Non-Ambient Junction Temperature . . . . . . . . . 90
  • 21.
    CONTENTS xix 6 Power Problems 91 P6.1 Short Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 P6.1.1 Power and Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 P6.1.2 Leakage Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 P6.1.3 Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 P6.1.4 Gray Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 P6.2 VLSI Gurus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 P6.2.1 Effect on Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 P6.2.2 Critique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 P6.3 Advertising Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 P6.4 Vary Supply Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 P6.5 Clock Speed Increase Without Power Increase . . . . . . . . . . . . . . . . . . . . 95 P6.5.1 Supply Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 P6.5.2 Supply Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 P6.6 Power Reduction Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 P6.6.1 Supply Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 P6.6.2 Transistor Sizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 P6.6.3 Adding Registers to Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . 97 P6.6.4 Gray Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 P6.7 Power Consumption on New Chip . . . . . . . . . . . . . . . . . . . . . . . . . . 98 P6.7.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 P6.7.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 P6.7.3 Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
  • 22.
    xx CONTENTS 7 Problems on Faults, Testing, and Testability 101 P7.1 Based on Smith q14.9: Testing Cost . . . . . . . . . . . . . . . . . . . . . . . . . 101 P7.2 Testing Cost and Total Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 P7.3 Minimum Number of Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 P7.4 Smith q14.10: Fault Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 P7.5 Mathematical Models and Reality . . . . . . . . . . . . . . . . . . . . . . . . . . 105 P7.6 Undetectable Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 P7.7 Test Vector Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 P7.7.1 Choice of Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 P7.7.2 Number of Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 P7.8 Time to do a Scan Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 P7.9 BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 P7.9.1 Characteristic Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . 107 P7.9.2 Test Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 P7.9.3 Signature Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 P7.9.4 Probabilty of Catching a Fault . . . . . . . . . . . . . . . . . . . . . . . . 113 P7.9.5 Probabilty of Catching a Fault . . . . . . . . . . . . . . . . . . . . . . . . 114 P7.9.6 Detecting a Specific Fault . . . . . . . . . . . . . . . . . . . . . . . . . . 114 P7.9.7 Time to Run Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 P7.10Power and BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 P7.11Timing Hazards and Testability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 P7.12Testing Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 P7.12.1 Are there any physical faults that are detectable by scan testing but not by built-in self testing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 P7.12.2 Are there any physical faults that are detectable by built-in self testing but not by scan testing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 P7.13Fault Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 P7.13.1 Design test generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 P7.13.2 Design signature analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . 119 P7.13.3 Determine if a fault is detectable . . . . . . . . . . . . . . . . . . . . . . . 120 P7.13.4 Testing time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
  • 23.
  • 25.
    Chapter 1 VHDL: TheLanguage 1.1 Introduction to VHDL 1.1.1 Levels of Abstraction There are many different levels of abstraction for working with hardware: • Quantum: Schrodinger’s equations describe movement of electrons and holes through mate- rial. • Energy band: 2-dimensional diagrams that capture essential features of Schrodinger’s equa- tions. Energy-band diagrams are commonly used in nano-scale engineering. • Transistor: Signal values and time are continous (analog). Each transistor is modeled by a resistor-capacitor network. Overall behaviour is defined by differential equations in terms of the resistors and capacitors. Spice is a typical simulation tool. • Switch: Time is continuous, but voltage may be either continuous or discrete. Linear equa- tions are used, rather than differential equations. A rising edge may be modeled as a linear rise over some range of time, or the time between a definite low value and a definite high value may be modeled as having an undefined or rising value. • Gate: Transistors are grouped together into gates (e.g. AND, OR, NOT). Voltages are discrete values such as pure Boolean (0 or 1) or IEEE Standard Logic 1164, which has representations for different types of unknown or undefined values. Time may be continuous or may be discrete. If discrete, a common unit is the delay through a single inverter (e.g. a NOT gate has a delay of 1 and AND gate has a delay of 2). • Register transfer level: The essential characteristic of the register transfer level is that the behaviour of hardware is modeled as assignments to registers and combinational signals. Equations are written where a register signal is a function of other signals (e.g. c = a 3
  • 26.
    4 CHAPTER 1. VHDL and b;). The assignments may be either combinational or registered. Combinational as- signments happen instanteously and registered assignments take exactly one clock cycle. There are variations on the pure register-transfer level. For example, time may be measured in clock phases rather than clock cycles, so as to allow assignments on either the rising or falling edge of a clock. Another variation is to have multiple clocks that run at different speeds — a clock on a bus might run at half the speed of the primary clock for the chip. • Transaction level: The basic unit of computation is a transaction, such as executing an in- struction on a microprocessor, transfering data across a bus, or accessing memory. Time is usually measured as an estimate (e.g. a memory write requires 15 clock cycles, or a bus transfer requires 250 ns.). The building blocks of the transaction level are processors, controllers, memory arrays, busses, intellectual property (IP) blocks (e.g. UARTs). The behaviour of the building blocks are described with software-like models, often written in behavioural VHDL, SystemC, or SystemVerilog. The transaction level has many similarities to a software model of a distributed system. • Electronic-system level: Looks at an entire electronic system, with both hardware and soft- ware. In this course, we will focus on the register-transfer level. In the second half of the course, we will look at how analog phenomenon, such as timing and power, affect the register-transfer level. In these chapters we will occasionally dip down into the transistor, switch, and gate levels. 1.1.2 VHDL Origins and History VHDL = VHSIC Hardware Description Language VHSIC = Very High Speed Integrated Circuit The VHSIC Hardware Description Language (VHDL) is a formal notation intended for use in all phases of the creation of electronic systems. Because it is both machine readable and human readable, it supports the development, verification, synthesis and testing of hardware designs, the communication of hardware design data, and the maintenance, modification, and procurement of hardware. Language Reference Manual (IEEE Design Automation Standards Committee, 1993a) • development • hardware designs • verification • communication • synthesis • maintenance • testing • modification
  • 27.
    1.1.2 VHDL Originsand History 5 • procurement VHDL is a lot more than synthesis of digital hardware VHDL History ....................................................................... . • Developed by the United States Department of Defense as part of the very high speed integrated circuit (VHSIC) program in the early 1980s. • The Department of Defense intended VHDL to be used for the documentation, simulation and verification of electronic systems. • Goals: – improve design process over schematic entry – standardize design descriptions amongst multiple vendors – portable and extensible • Inspired by the ADA programming language – large: 97 keywords, 94 syntactic rules – verbose (designed by committee) – static type checking, overloading – complicated syntax: parentheses are used for both expression grouping and array indexing Example: a <= b * (3 + c); -- integer a <= (3 + c); -- 1-element array of integers • Standardized by IEEE in 1987 (IEEE 1076-1987), revised in 1993, 2000. • In 1993 the IEEE standard VHDL package for model interoperability, STD_LOGIC_1164 (IEEE Standard 1164-1993), was developed. – std_logic_1164 defines 9 different values for signals • In 1997 the IEEE standard packages for arithmetic over std logic and bit signals were defined (IEEE Standard 1076.3–1997). – numeric_std defines arithmetic over std logic vectors and integers. Note: This is the package that you should use for arithmetic. Don’t use std logic arith — it has less uniform support for mixed inte- ger/signal arithmetic and has a greater tendency for differences between tools. – numeric_bit defines arithmetic over bit vectors and integers. We won’t use bit signals in this course, so you don’t need to worry about this package.
  • 28.
    6 CHAPTER 1. VHDL 1.1.3 Semantics The original goal of VHDL was to simulate circuits. The semantics of the language define circuit behaviour. a c <= a AND b; simulation b c But now, VHDL is used in simulation and synthesis. Synthesis is concerned with the structure of the circuit. Synthesis: converts one type of description (behavioural) into another, lower level, description (usually a netlist). a c <= a AND b; synthesis c b Synthesis is a computer-aided design (CAD) technique that transforms a designer’s concise, high- level description of a circuit into a structural description of a circuit. CAD Tools ............................................................................ CAD Tools allow designers to automate lower-level design processes in implementing the desired functionality of a system. NOTE: EDA = Electronic Design Automation. In digital hardware design EDA = CAD. Synthesis vs Simulation ................................................................ For synthesis, we want the code we write to define the structure of the hardware that is generated. a c <= a AND b; synthesis c b
  • 29.
    1.1.4 Synthesis ofa Simulation-Based Language 7 The VHDL semantics define the behaviour of the hardware that is generated, not the structure of the hardware. The scenario below complies with the semantics of VHDL, because the two synthesized circuits produce the same behaviour. If the two synthesized circuits had different behaviour, then the scenario would not comply with the VHDL Standard. a a c simulation b b sis th e c syn different same c <= a AND b; structure behaviour syn the a sis a c simulation b b c 1.1.4 Synthesis of a Simulation-Based Language • Not all of VHDL is synthesizable – c <= a AND b; (synthesizable) – c <= a AND b AFTER 2ns; (NOT synthesizable) ∗ how do you build a circuit with exactly 2ns of delay through an AND gate? ∗ more examples of non-synthesizable code are in section 1.11 – See section 1.11 for more details • Different synthesis tools support different subsets of VHDL • Some tools generate erroneous hardware for some code – behaviour of hardware differs from VHDL semantics • Some tools generate unpredictable hardware (Hardware that has the correct behaviour, but un- desirable or weird structure). • There is an IEEE standard (1076.6) for a synthesizable subset of VHDL, but tool vendors don’t yet conform to it. (Most vendors still don’t have full support for the 1993 extensions to VHDL!). For more info, see http://www.vhdl.org/siwg/. 1.1.5 Solution to Synthesis Sanity • Pick a high-quality synthesis tool and study its documentation thoroughly • Learn the idioms of the tool • Different VHDL code with same behaviour can result in very different circuits • Be careful if you have to port VHDL code from one tool to another
  • 30.
    8 CHAPTER 1. VHDL • KISS: Keep It Simple Stupid – VHDL examples will illustrate reliable coding techniques for the synthesis tools from Synop- sys, Mentor Graphics, Altera, Xilinx, and most other companies as well. – Follow the coding guidelines and examples from lecture – As you write VHDL, think about the hardware you expect to get. Note: If you can’t predict the hardware, then the hardware probably won’t be very good (small, fast, correct, etc) 1.1.6 Standard Logic 1164 At the core of VHDL is a package named STANDARD that defines a type named bit with values of ’0’ and ’1’. For simulation, it helpful to have additional values, such as “undefined” and “high impedance”. Many companies created their own (incompatible) definitions of signal types for simulation. To regain compatibility amongst packages from different companies, the IEEE defined std logc 1164 to be the standard type for signal values in VHDL simulation. ’U’ uninitialized ’X’ strong unknown ’0’ strong 0 ’1’ strong 1 ’Z’ high impedance ’W’ weak unknown ’L’ weak 0 ’H’ weak 1 ’--’ don’t care The most common values are: ’U’, ’X’, ’0’, ’1’. If you see ’X’ in a simulation, it usually means that there is a mistake in your code. Every VHDL file that you write should begin with: library ieee; use ieee.std_logic_1164.all; Note: std logic vs boolean The std logic values ’1’ and ’0’ are not the same as the boolean values true and false. For example, you must write if a = ’1’ then .... The code if a then ... will not type- check if a is of type std logic. From a VLSI perspective, a weak value will come from a smaller gate. One aspect of VHDL that we don’t touch on in ece327 is ”resolution”, which describes how to determine the value of a signal if the signal is driven by ¡b¿more than one¡/b¿ process. (In ece327, we restrict ourselves to having each signal be driven by (be the target of) exactly one process). The std logic 1164 library provides a resolution function to deal with situation where different processes drive the same signal with different values. In this situation, a strong value (e.g. ’1’) will overpower a weak value (e.g. ’L’). If two processes drive the signal with different strong values (e.g. ’1’ and ’0’) the signal resolves
  • 31.
    1.2. COMPARISON OFVHDL TO OTHER HARDWARE DESCRIPTION LANGUAGES 9 to a strong unknown (’X’). If a signal is driven with two different weak values (e.g. ’H’ and ’L’), the signal resolves to a weak unknown (’W’). 1.2 Comparison of VHDL to Other Hardware Description Lan- guages 1.2.1 VHDL Disadvantages • Some VHDL programs cannot be synthesized • Different tools support different subsets of VHDL. • Different tools generate different circuits for same code • VHDL is verbose – Many characters to say something simple • VHDL is complicated and confusing – Many different ways of saying the same thing – Constructs that have similar purpose have very different syntax (case vs. select) – Constructs that have similar syntax have very different semantics (variables vs signals) • Hardware that is synthesized is not always obvious (when is a signal a flip-flop vs latch vs combinational) – The infamous latch inference problem (See section 1.5.2 for more information) 1.2.2 VHDL Advantages • VHDL supports unsynthesizable constructs that are useful in writing high-level models, test- benches and other non-hardware or non-synthesizable artifacts that we need in hardware design. VHDL can be used throughout a large portion of the design process in different capacities, from specification to implementation to verification. • VHDL has static typechecking — many errors can be caught before synthesis and/or simulation. (In this respect, it is more similar to Java than to C.) • VHDL has a rich collection of datatypes • VHDL is a full-featured language with a good module system (libraries and packages). • VHDL has a well-defined standard.
  • 32.
    10 CHAPTER 1. VHDL 1.2.3 VHDL and Other Languages 1.2.3.1 VHDL vs Verilog • Verilog is a “simpler” language: smaller language, simple circuits are easier to write • VHDL has more features than Verilog – richer set of data types and strong type checking – VHDL offers more flexibility and expressivity for constructing large systems. • The VHDL Standard is more standard than the Verilog Standard – VHDL and Verilog have simulation-based semantics – Simulation vendors generally conform to VHDL standard – Some Verilog constructs give different behaviours in simulation and synthesis • VHDL is used more than Verilog in Europe and Japan • Verilog is used more than VHDL in North America • VHDL is used more in FPGAs than in ASICs • South-East Asia, India, South America: ????? 1.2.3.2 VHDL vs System Verilog • System Verilog is a superset of Verilog. It extends Verilog to make it a full object-oriented hardware modelling language • Syntax is based on Verilog and C++. • As of 2007, System Verilog is used almost exclusively for test benches and simulation. Very few people are trying to use it to do hardware design. • System Verilog grew out of Superlog, a proposed language that was based on Verilog and C. Basic core came from Verilog. C-like extensions included to make language more expressive and powerful. Developed by originally the company Co-Design Automation and then standardized by Accellera, an organization aimed at standardizing EDA languages. Co-Design was purchased by Synopsys and now Synopsys is the leading proponent of System Verilog. 1.2.3.3 VHDL vs SystemC • System C looks like C — familiar syntax • C is often used in algorithmic descriptions of circuits, so why not try to use it for synthesizable code as well? • If you think VHDL is hard to synthesize, try C.... • SystemC simulation is slower than advertised
  • 33.
    1.3. OVERVIEW OFSYNTAX 11 1.2.3.4 Summary of VHDL Evaluation • VHDL is far from perfect and has lots of annoying characteristics • VHDL is a better language for education than Verilog because the static typechecking enforces good software engineering practices • The richness of VHDL will be useful in creating concise high-level models and powerful test- benches 1.3 Overview of Syntax This section is just a brief overview of the syntax of VHDL, focusing on the constructs that are most commonly used. For more information, read a book on VHDL and use online resources. (Look for “VHDL” under the “Documentation” tab in the E&C 327 web pages.) 1.3.1 Syntactic Categories There are five major categories of syntactic constructs. (There are many, many minor categories and subcategories of constructs.) • Library units (section 1.3.2) – Top-level constructs (packages, entities, architectures) • Concurrent statements (section 1.3.4) – Statements executed at the same time (in parallel) • Sequential statements (section 1.3.7) – Statements executed in series (one after the other) • Expressions – Arithmetic (section 1.10), Boolean, Vectors , etc • Declarations – Components , signals, variables, types, functions, .... 1.3.2 Library Units Library units are the top-level syntactic constructs in VHDL. They are used to define and include libraries, declare and implement interfaces, define packages of declarations and otherwise bind together VHDL code. • Package body – define the contents of a library • Packages – determine which parts of the library are externally visible
  • 34.
    12 CHAPTER 1. VHDL • Use clause – use a library in an entity/architecture or another package – technically, use clauses are part of entities and packages, but they proceed the entity/package keyword, so we list them as top-level constructs • Entity (section 1.3.3) – define interface to circuit • Architecture (section 1.3.3) – define internal signals and gates of circuit 1.3.3 Entities and Architecture Each hardware module is described with an Entity/Architecture pair entity entity architecture architecture Figure 1.1: Entity and Architecture • Entity: interface • Architecture: internals – names, modes (in / out), types of externally visible signals of circuit – structure and behaviour of module library ieee; use ieee.std_logic_1164.all; entity and_or is port ( a, b, c : in std_logic ; z : out std_logic ); end and_or; Figure 1.2: Example of an entity
  • 35.
    1.3.3 Entities andArchitecture 13 The syntax of VHDL is defined using a variation on Backus-Naur forms (BNF). [ { use_clause } ] entity ENTITYID is [ port ( { SIGNALID : (in | out) TYPEID [ := expr ] ; } ); ] [ { declaration } ] [ begin { concurrent_statement } ] end [ entity ] ENTITYID ; Figure 1.3: Simplified grammar of entity architecture main of and_or is signal x : std_logic; begin x <= a AND b; z <= x OR (a AND c); end main; Figure 1.4: Example of architecture [ { use_clause } ] architecture ARCHID of ENTITYID is [ { declaration } ] begin [ { concurrent_statement } ] end [ architecture ] ARCHID ; Figure 1.5: Simplified grammar of architecture
  • 36.
    14 CHAPTER 1. VHDL 1.3.4 Concurrent Statements • Architectures contain concurrent statements • Concurrent statements execute in parallel (Figure1.6) – Concurrent statements make VHDL fundamentally different from most software languages. – Hardware (gates) naturally execute in parallel — VHDL mimics the behaviour of real hard- ware. – At each infinitesimally small moment of time, each gate: 1. samples its inputs 2. computes the value of its output 3. drives the output architecture main of bowser is architecture main of bowser is begin begin x1 <= a AND b; z <= NOT x2; x2 <= NOT x1; x2 <= NOT x1; z <= NOT x2; x1 <= a AND b; end main; end main; a x1 x2 z b Figure 1.6: The order of concurrent statements doesn’t matter
  • 37.
    1.3.4 Concurrent Statements 15 . . . <= . . . when . . . else . . .; conditional assignment • normal assignment (. . . <= . . .) • if-then-else style (uses when) c <= a+b when sel=’1’ else a+c when sel=’0’ else "0000"; selected assignment with . . . select . . . <= . . . when . . . | . . . , . . . when . . . | . . . , ... . . . when . . . | . . . ; • case/switch style assignment with color select d <= "00" when red , "01" when . . .; component instantiation . . . : . . . port map ( . . . => . . . , . . . ); • use an existing circuit • section 1.3.5 add1 : adder port map( a => f, b => g, s => h, co => i); for-generate . . . : for . . . in . . . generate ... end generate; • replicate some hardware bgen: for i in 1 to 7 generate b(i)<=a(7-i); end generate; if-generate . . . : if . . . generate ... end generate; • conditionally create some hardware okgen : if optgoal /= fast then generate result <= ((a and b) or (d and not e)) or g; end generate; fastgen : if optgoal = fast then generate result <= ’1’; end generate; process process . . . begin ... end process; • the body of a process is executed sequentially • Sections 1.3.6, 1.6 Figure 1.7: The most commonly used concurrent statements
  • 38.
    16 CHAPTER 1. VHDL 1.3.5 Component Declaration and Instantiations There are two different syntaxes for component declaration and instantiation. The VHDL-93 syn- tax is much more concise than the VHDL-87 syntax. Not all tools support the VHDL-93 syntax. For E&CE 327, some of the tools that we use do not support the VHDL-93 syntax, so we are stuck with the VHDL-87 syntax. 1.3.6 Processes • Processes are used to describe complex and potentially unsynthesizable behaviour • A process is a concurrent statement (Section 1.3.4). • The body of a process contains sequential statements (Section 1.3.7) • Processes are the most complex and difficult to understand part of VHDL (Sections 1.5 and 1.6) process (a, b, c) process begin begin y <= a AND b; y <= a AND b; if (a = ’1’) then z <= ’0’; z1 <= b AND c; wait until rising_edge(clk); z2 <= NOT c; if (a = ’1’) then else z <= ’1’; z1 <= b OR c; y <= ’0’; z2 <= c; wait until rising_edge(clk); end if; else end process; y <= a OR b; end if; end process; Figure 1.8: Examples of processes • Processes must have either a sensitivity list or at least one wait statement on each execution path through the process. • Processes cannot have both a sensitivity list and a wait statement. Sensitivity List ....................................................................... . The sensitivity list contains the signals that are read in the process. A process is executed when a signal in its sensitivity list changes value.
  • 39.
    1.3.7 Sequential Statements 17 An important coding guideline to ensure consistent synthesis and simulation results is to include all signals that are read in the sensitivity list. If you forget some signals, you will either end up with unpredictable hardware and simulation results (different results from different programs) or undesirable hardware (latches where you expected purely combinational hardware). For more on this topic, see sections 1.5.2 and 1.6. There is one exception to this rule: for a process that implements a flip-flop with an if rising edge statement, it is acceptable to include only the clock signal in the sensitivity list — other signals may be included, but are not needed. [ PROCLAB : ] process ( sensitivity_list ) [ { declaration } ] begin { sequential_statement } end process [ PROCLAB ] ; Figure 1.9: Simplified grammar of process 1.3.7 Sequential Statements Used inside processes and functions. wait wait until . . . ; signal assignment . . . <= . . . ; if-then-else if . . . then . . . elsif . . . end if; case case . . . is when . . . | . . . => . . . ; when . . . => . . . ; end case; loop loop . . . end loop; while loop while . . . loop . . . end loop; for loop for . . . in . . . loop . . . end loop; next next . . . ; Figure 1.10: The most commonly used sequential statements
  • 40.
    18 CHAPTER 1. VHDL 1.3.8 A Few More Miscellaneous VHDL Features Some constructs that are useful and will be described in later chapters and sections: report : print a message on stderr while simulating assert : assertions about behaviour of signals, very useful with report statements. generics : parameters to an entity that are defined at elaboration time. attributes : predefined functions for different datatypes. For example: high and low indices of a vector. 1.4 Concurrent vs Sequential Statements All concurrent assignments can be translated into sequential statements. But, not all sequential statements can be translated into concurrent statements. 1.4.1 Concurrent Assignment vs Process The two code fragments below have identical behaviour: architecture main of tiny is architecture main of tiny is begin begin b <= a; process (a) begin end main; b <= a; end process; end main; 1.4.2 Conditional Assignment vs If Statements The two code fragments below have identical behaviour: Concurrent Statements Sequential Statements if <cond> then t <= <val1> when <cond> t <= <val1>; else <val2>; else t <= <val2>; end if
  • 41.
    1.4.3 Selected Assignmentvs Case Statement 19 1.4.3 Selected Assignment vs Case Statement The two code fragments below have identical behaviour Concurrent Statements Sequential Statements with <expr> select case <expr> is t <= <val1> when <choices1>, when <choices1> => <val2> when <choices2>, t <= <val1>; <val3> when <choices3>; when <choices2> => t <= <val2>; when <choices3> => t <= <val3>; end case; 1.4.4 Coding Style Code that’s easy to write with sequential statements, but difficult with concurrent: Sequential Statements Concurrent Statements case <expr> is Overall structure: when <choice1> => with <expr> select if <cond> then t <= ... when <choice1>, o <= <expr1>; ... when <choice2>; else o <= <expr2>; Failed attempt: end if; with <expr> select when <choice2> => t <= -- want to write: ... -- <val1> when <cond> end case; -- else <val2> -- but conditional assignment -- is illegal here when c1, ... when c2; Concurrent statement with correct behaviour, but messy: t <= <expr1> when (expr = <choice1> AND <cond>) else <expr2> when (expr = <choice1> AND NOT <cond>) else . . . ;
  • 42.
    20 CHAPTER 1. VHDL 1.5 Overview of Processes Processes are the most difficult VHDL construct to understand. This section gives an overview of processes. Section 1.6 gives the details of the semantics of processes. • Within a process, statements are executed almost sequentially • Among processes, execution is done in parallel • Remember: a process is a concurrent statement! entity ENTITYID is interface declarations end ENTITYID; architecture ARCHID of ENTITYID is begin concurrent statements ⇐= process begin sequential statements ⇐= end process; concurrent statements ⇐= end ARCHID; Figure 1.11: Sequential statements in a process Key concepts in VHDL semantics for processes: • VHDL mimics hardware • Hardware (gates) execute in parallel • Processes execute in parallel with each other • All possible orders of executing processes must produce the same simulation results (wave- forms) • If a signal is not assigned a value, then it holds its previous value All orders of executing concurrent statements must produce the same waveforms It doesn’t matter whether you are running on a single-threaded operating system, on a multi- threaded operating system, on a massively parallel supercomputer, or on a special hardware emu- lator with one FPGA chip per VHDL process — all simulations must be the same. These concepts are the motivation for the semantics of executing processes in VHDL (Section 1.6) and lead to the phenomenon of latch-inference (Section 1.5.2).
  • 43.
    1.5. OVERVIEW OFPROCESSES 21 execution sequence execution sequence execution sequence architecture procA: process stmtA1; A1 A1 A1 stmtA2; A2 A2 A2 stmtA3; A3 A3 A3 end process; procB: process stmtB1; stmtB2; B1 B1 B1 end process; B2 B2 B2 single threaded: single threaded: multithreaded: procA procA before procB procB before procA and procB in parallel Figure 1.12: Different process execution sequences Figure 1.13: All execution orders must have same behaviour Sections 1.5.1–1.5.3 discuss the hardware generated by processes. Sections 1.6–1.6.7 discuss the behaviour and execution of processes.
  • 44.
    22 CHAPTER 1. VHDL 1.5.1 Combinational Process vs Clocked Process Each well-written synthesizable process is either combinational or clocked. Some synthesizable processes that do not conform to our coding guidelines are both combinational and clocked. For example, in a flip-flop with an asynchronous reset, the output is a combinational function of the reset signal and a clocked function of the data input signal. We will deal with only with processes that follow our coding conventions, and so we will continue to say that each process is either combinational xor clocked. Combinational process: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. • Executing the process takes part of one clock cycle • Target signals are outputs of combinational circuitry • A combinational processes must have a sensitivity list • A combinational process must not have any wait statements • A combinational process must not have any rising_edges, or falling_edges • The hardware for a combinational process is just combinational circuitry Clocked process: ..................................................................... . • Executing the process takes one (or more) clock cycles • Target signals are outputs of flops • Process contains one or more wait or if rising edge statements • Hardware contains combinational circuitry and flip flops Note: Clocked processes are sometimes called “sequential processes”, but this can be easily confused with “sequential statements”, so in E&CE 327 we’ll refer to synthesizable processes as either “combinational” or “clocked”. Example Processes ................................................................... . Combinational Process process (a,b,c) p1 <= a; if (b = c) then p2 <= b; else p2 <= a; end if; end process;
  • 45.
    1.5.2 Latch Inference 23 Clocked Processes process begin wait until rising_edge(clk); b <= a; end process; process (clk) begin if rising_edge(clk) then b <= a; end if; end process; 1.5.2 Latch Inference The semantics of VHDL require that if a signal is assigned a value on some passes through a process and not on other passes, then on a pass through the process when the signal is not assigned a value, it must maintain its value from the previous pass. process (a, b, c) begin a if (a = ’1’) then b z1 <= b; z2 <= b; c else z1 z1 <= c; end if; z2 end process; Figure 1.14: Example of latch inference When a signal’s value must be stored, VHDL infers a latch or a flip-flop in the hardware to store the value. If you want a latch or a flip-flop for the signal, then latch inference is good. If you want combinational circuitry, then latch inference is bad. Loop, Latch, Flop .................................................................... .
  • 46.
    24 CHAPTER 1. VHDL a b z b D Q z b z a EN a Latch Flip-flop Combinational loop Question: Write VHDL code for each of the above circuits Answer: combinational loop if a = ’1’ then z <= b; else z <= z; end if; latch if a = ’1’ then z <= b; end if; flop if rising edge(a) then z <= b; end if; Causes of Latch Inference ............................................................ . Usually, “latch inference” refers to the unintentional creation of latches. The most common cause of unintended latch inference is missing assignments to signals in if-then- else and case statements. Latch inference happens during elaboration. When using the Synopsys tools, look for: Inferred memory devices in the output or log files.
  • 47.
    1.5.3 Combinational vsFlopped Signals 25 1.5.3 Combinational vs Flopped Signals Signals assigned to in combinational processes are combinational. Signals assigned to in clocked processes are outputs of flip-flops. 1.6 Details of Process Execution In this section we go through the detailed semantics of how processes execute. These semantics form the foundation for the simulation and synthesis of VHDL. The semantics define the simulation behaviour, and the duty of synthesis is to produce hardware that has the same behaviour as the simulation of the original VHDL code. 1.6.1 Simple Simulation Before diving into the details of processes, we briefly review gate-level simulation with a simple example, which we will then explore in excruciating detail through the semantics of VHDL. With knowledge of just basic gate-level behaviour, we simulate the circuit below with waveforms for a and b and calculate the behaviour for c, d, and e. 0ns 10ns 12ns 15ns a a c d b e c b d e Different Programs, Same Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. There are many different VHDL programs that will synthesize to this circuit. Three examples are:
  • 48.
    26 CHAPTER 1. VHDL process (a,b) process (a,b,c,d) process (a,b) begin begin begin c <= a and b; c <= a and b; c <= a and b; end process; d <= not c; end process; process (b,c,d) e <= b and d; process (c) begin end process; begin d <= not c; d <= not c; e <= b and d; end process; end process; process (b,d) begin e <= b and d; end process; The goal of the VHDL semantics is that all of these programs will have the same behaviour. The two main challenges to make this happen are: a value change on a signal must propagate instantaneously, and all gates must operate in parallel. We will return to these points in section 1.6.3 1.6.2 Temporal Granularities of Simulation There are several different granularities of time to analyze VHDL behaviour. In this course, we will discuss three major granularities: clock cycles, timing simulation, and “delta cycles”. clock-cycle • smallest unit of time is a clock cycle • combinational logic has zero delay • flip-flops have a delay of one clock cycle • used for simulation early in the design cycle • fastest simulation run times timing simulation • smallest unit of time is a nano, pico, or fempto second • combinational logic and wires have delay as computed by timing analysis tools • flip-flops have setup, hold, and clock-to-Q timing parameters • used for simulation when fine-tuning design and confirming that timing contraints are satisfied • slow simulation times for large circuits delta cycles • units of time are artifacts of VHDL semantics and simulation software • simulation cycles, delta cycles, and simulation steps are infinitesimally small amounts of time • VHDL semantics are defined in terms of these concepts In assignments and exams, you will need to be able to simulate VHDL code at each of the three different levels of temporal granularity. In the laboratories and project, you will use simulation
  • 49.
    1.6.3 Intuition BehindDelta-Cycle Simulation 27 programs for both clock-cycle simulation and timing simulation. We don’t have access to a pro- gram that will produce delta-cycle waveforms, but if anyone is looking for a challenging co-op job or fourth-year design project.... For the remainder of section 1.6, we’ll look at only the delta cycle view of the world. 1.6.3 Intuition Behind Delta-Cycle Simulation Zero-delay simulation might appear to be the simpler than simulation with delays through gates (timing simulation), but in reality, zero-delay simulation algorithms are more complicated than algorithms for timing simulation. The reason is that in zero-delay simulation, a sequence of de- pendent events must appear to happen instantaneously (in zero time). In particular, the effect of an event must propagate instantaneously through the combinational circuitry. Two fundamental rules for zero-delay simulation: 1. events appear to propagate through combinational circuitry instantaneously. 2. all of the gates appear to operate in parallel To make it appear that events propagate instaneously, VHDL introduces an artificial unit of time, the delta cycle, to represent an infinitesimally small amount of time. In each delta cycle, every gate in the circuit will sample its inputs, compute its result, and drive its output signal with the result. Because software executes in serial, a simulator cannot run/simulate multiple gates in parallel. Instead, the simulator must simulate the gates one at a time, but make the waveforms appear as if all of the gates were simulated in parallel. In each delta cycle, the simulator will simulate any gate whose input changed in the previous delta cycle. To preserve the illusion that the gates ran in parallel, the effect of simulating a gate remains invisible until the end of the delta cycle. 1.6.4 Definitions and Algorithm 1.6.4.1 Process Modes An architecture contains a set of processes. Each process is in one of the following modes: active, suspended, or postponed. Note: “postponed” This use of the word “postponed” differs from that in the VHDL Standard. We won’t be using postponed processes as defined in the Standard. Note: “postponed” “Postponed” in VHDL terminology is a synonym for some operating-systems’ usage of “ready” to describe a process that is ready to execute.
  • 50.
    28 CHAPTER 1. VHDL • Suspended – Nothing to currently execute – A process stays suspended until the event active that it is waiting for occurs: either a change in a signal on its sensitivity list su or the condition in a wait statement e sp at e tiv nd ac • Postponed – Wants to execute, but not currently active – A process stays postponed until the sim- postponed suspended ulator chooses it from the pool of post- poned processes resume • Active – Currently executing – A process stays active until it hits a wait statement or sensitivity list, at which point it suspends Figure 1.15: Process modes 1.6.4.2 Simulation Algorithm The algorithm presented here is a simplification of the actual algorithm in Section 12.6 of the VHDL Standard. The most significant simplification is that this algorithm does not support de- layed assignments. To support delayed assignments, each signal’s provisional value would be gen- eralized to an event wheel, which is a list containing the times and values for multiple provisional assignments in the future. A somewhat ironic note, only six of the two hundred pages in the VHDL Standard are devoted to the semantics of executing processes. The Algorithm ....................................................................... . Simulations start at step 1 with all processes postponed and all signals with a default value (e.g., ’U’ for std logic).
  • 51.
    1.6.4 Definitions andAlgorithm 29 1. While there are postponed processes: (a) Pick one or more postponed processes to execute (become active). (b) As a process executes, assignments to signals are provisional — new values do not become visible until step 3 (c) A process executes until it hits its sensitivity list or a wait statement, at which point it suspends. At a wait statement, the process will suspend even if the condition is true during the current simulation cycle. (d) Processes that become suspended stay suspended until there are no more post- poned or active processes. 2. Each process looks at signals that changed value (provisional value differs from visible value) and at the simulation time. If a signal in a process’s sensitivity list changed value, or if the wait condition on which a process is suspended became true, then the process resumes (becomes postponed). 3. Each signal that changed value is updated with its provisional value (the provisional value becomes visible). 4. If there are no postponed processes, then increment simulation time to the next sched- uled event. Note: Parallel execution In n-threaded execution, at most n processes are active at a time
  • 52.
    30 CHAPTER 1. VHDL 1.6.4.3 Delta-Cycle Definitions Definition simulation step: Executing one sequential assignment or process mode change. Definition simulation cycle: The operations that occur in one iteration of the simulation algorithm. Definition delta cycle: A simulation cycle that does not advance simulation time. Equivalently: A simulation cycle with zero-delay assignments where the assignment causes a process to resume. Definition simulation round: A sequence of simulation cycles that all have the same simulation time. Equivalently: a contiguous sequence of zero or more delta cycles followed by a simulation cycle that increments time (i.e., the simulation cycle is not a delta cycle). Note: Official and unofficial terminology Simulation cycle and delta cycle are official definitions in the VHDL Standard. Simulation step and simulation round are not standard definitions. They are used in E&CE 327 because we need words to associate with the concepts that they describe.
  • 53.
    1.6.5 Example 1:Process Execution (Bamboozle) 31 1.6.5 Example 1: Process Execution (Bamboozle) This example (Bamboozle) and the next example (Flummox, section 1.6.6) are very similar. The VHDL code for the circuit is slightly different, but the hardware that is generated is the same. The stimulus for signals a and b also differ. entity bamboozle is begin end bamboozle; architecture main of bamboozle is signal a, b, c, d : std_logic; begin procA : process (a, b) begin c <= a AND b; end process; procB : process (b, c, d) begin d <= NOT c; e <= b AND d; end process; procC : process begin a <= ’0’; b <= ’1’; wait for 10 ns; a <= ’1’; wait for 2 ns; b <= ’0’; wait for 3 ns; a <= ’0’; wait for 20 ns; end main; Figure 1.16: Example bamboozle circuit for process execution
  • 54.
    32 CHAPTER 1. VHDL Initial conditions (Shown in slides, not in notes) Step 1(a): Activate procA (Shown in slides, not in notes) A procA: process (a, b) begin U c <= a AND b; a Uc Ud U end process; e U procB: process (b, c, d) begin b P 0ns d <= NOT c; e <= b AND d; sim round B end process; sim cycle B P procC: process begin delta cycle ? a <= ’0’; procA P A b <= ’1’; procB P wait for 10 ns; procC P a <= ’1’; a U wait for 2 ns; b <= ’0’; b U wait for 3 ns; c U a <= ’0’; wait for 20 ns; d U end process; e U Step 1(a): Activate procA Step 1(c): Suspend procA (Shown in slides, not in notes) Step 1(a): Activate procC (Shown in slides, not in notes) Step 1(b): Provisional assignment to a (Shown in slides, not in notes) Step 1(b): Provisional assignment to b (Shown in slides, not in notes) S procA: process (a, b) begin 0U c <= a AND b; a UUc Ud U end process; e 1U procB: process (b, c, d) begin b P 0ns d <= NOT c; e <= b AND d; sim round B end process; sim cycle B procC: process begin delta cycle ? a <= ’0’; procA P A S A b <= ’1’; procB P wait for 10 ns; procC P A a <= ’1’; a U U wait for 2 ns; b <= ’0’; b U U wait for 3 ns; c U U a <= ’0’; d U wait for 20 ns; end process; e U Step 1(b): Provisional assignment to b
  • 55.
    1.6.5 Example 1:Process Execution (Bamboozle) 33 Step 1(a): Activate procB (Shown in slides, not in notes) Step 1(b): Provisional assignment to d (Shown in slides, not in notes) Step 1(b): Provisional assignment to e (Shown in slides, not in notes) Step 1(c): Suspend procB (Shown in slides, not in notes) S procA: process (a, b) begin 0U c <= a AND b; a UUc UUd UU end process; e 1U procB: process (b, c, d) begin b S 0ns d <= NOT c; e <= b AND d; sim round B end process; sim cycle B procC: process begin delta cycle ? a <= ’0’; procA P A S b <= ’1’; procB P A S S wait for 10 ns; procC P A S a <= ’1’; a U wait for 2 ns; b <= ’0’; b U wait for 3 ns; c U a <= ’0’; wait for 20 ns; d U end process; e U Step 1(c): Suspend procB S procA: process (a, b) begin 0U c <= a AND b; a UUc UUd UU end process; e 1U procB: process (b, c, d) begin b S 0ns d <= NOT c; e <= b AND d; sim round B end process; sim cycle B E procC: process begin delta cycle ? ? a <= ’0’; procA P A S b <= ’1’; procB P A S S wait for 10 ns; procC P A S a <= ’1’; a U U wait for 2 ns; b U U b <= ’0’; wait for 3 ns; c U U a <= ’0’; wait for 20 ns; d U U end process; e U U All processes suspended Step 3: Update signal values (Shown in slides, not in notes)
  • 56.
    34 CHAPTER 1. VHDL P procA: process (a, b) begin 0 c <= a AND b; a UUc UUd UU end process; e 1 procB: process (b, c, d) begin b P 0ns d <= NOT c; e <= b AND d; sim round B end process; sim cycle B procC: process begin delta cycle ? a <= ’0’; procA P A S P b <= ’1’; procB P A S P S wait for 10 ns; procC P A S a <= ’1’; a U U 0 wait for 2 ns; b U U 1 b <= ’0’; wait for 3 ns; c U U a <= ’0’; wait for 20 ns; d U U end process; e U U Step 3: Update signal values S procA: process (a, b) begin 0 c <= a AND b; a Uc Ud U end process; e 1 procB: process (b, c, d) begin b S 0ns d <= NOT c; e <= b AND d; sim round B end process; sim cycle B E procC: process begin delta cycle B E a <= ’0’; procA P A S P b <= ’1’; procB P A S P S wait for 10 ns; procC P A S a <= ’1’; a U U 0 wait for 2 ns; b <= ’0’; b U U 1 wait for 3 ns; c U U a <= ’0’; wait for 20 ns; d U U end process; e U U Step 4: Simulation time remains at 0 ns --- delta cycle
  • 57.
    1.6.5 Example 1:Process Execution (Bamboozle) 35 Step 1(a): Activate procA (Shown in slides, not in notes) Step 1(b): Provisional assignment to c (Shown in slides, not in notes) Step 1(c): Suspend procA (Shown in slides, not in notes) Step 1(a): Activate procB (Shown in slides, not in notes) Step 1(b): Provisional assignment to d (Shown in slides, not in notes) Step 1(b): Provisional assignment to e (Shown in slides, not in notes) Step 1(c): Suspend procB (Shown in slides, not in notes) S procA: process (a, b) begin 0 c <= a AND b; a 0Uc UUd UU end process; e 1 procB: process (b, c, d) begin b S 0ns d <= NOT c; e <= b AND d; sim round B end process; sim cycle B E B procC: process begin delta cycle B E ? a <= ’0’; procA P P A S b <= ’1’; procB P P A S S wait for 10 ns; procC P a <= ’1’; a U U 0 wait for 2 ns; b <= ’0’; b U U 1 wait for 3 ns; c U U U a <= ’0’; d U U U wait for 20 ns; end process; e U U U All processes suspended Step 3: Update signal values (Shown in slides, not in notes) Step 4: Simulation time remains at 0ns — delta cycle (Shown in slides, not in notes) Compact simulation cycle (Shown in slides, not in notes) Begin next simulation cycle (Shown in slides, not in notes) Step 1(a): Activate procB (Shown in slides, not in notes) Step 1(b): Provisional assignment to d (Shown in slides, not in notes) Step 1(b): Provisional assignment to e (Shown in slides, not in notes) Step 1(c): Suspend procB (Shown in slides, not in notes) All processes suspended (Shown in slides, not in notes)
  • 58.
    36 CHAPTER 1. VHDL S procA: process (a, b) begin 0 c <= a AND b; a 0c 1Ud UU end process; e 1 procB: process (b, c, d) begin b S 0ns d <= NOT c; e <= b AND d; sim round B end process; sim cycle B E B E B procC: process begin delta cycle B E B E ? a <= ’0’; procA P P b <= ’1’; procB P P P A S S wait for 10 ns; procC P a <= ’1’; a U U 0 wait for 2 ns; b <= ’0’; b U U 1 wait for 3 ns; c U U U 0 a <= ’0’; wait for 20 ns; d U U U U end process; e U U U U All processes suspended Step 3: Update signal values (Shown in slides, not in notes) S procA: process (a, b) begin 0 c <= a AND b; a 0c 1d U end process; e 1 procB: process (b, c, d) begin b P 0ns d <= NOT c; e <= b AND d; sim round B end process; sim cycle B E B E B procC: process begin delta cycle B E B E ? a <= ’0’; procA P P b <= ’1’; procB P P P A S P S wait for 10 ns; procC P a <= ’1’; a U U 0 wait for 2 ns; b <= ’0’; b U U 1 wait for 3 ns; c U U U 0 a <= ’0’; wait for 20 ns; d U U U U 1 end process; e U U U U Step 3: Update signal values Compact simulation cycle (Shown in slides, not in notes) Begin next simulation cycle (Shown in slides, not in notes) Step 1(a): Activate procB (Shown in slides, not in notes) Step 1(b): Provisional assignment to d (Shown in slides, not in notes) Step 1(b): Provisional assignment to e (Shown in slides, not in notes) Step 1(c): Suspend procB (Shown in slides, not in notes)
  • 59.
    1.6.5 Example 1:Process Execution (Bamboozle) 37 S procA: process (a, b) begin 0 c <= a AND b; a 0c 11d 1U end process; e 1 procB: process (b, c, d) begin b S 0ns d <= NOT c; e <= b AND d; sim round B end process; sim cycle B E B E B E B procC: process begin delta cycle B E B E B E ? a <= ’0’; procA P P b <= ’1’; procB P P P P A S S wait for 10 ns; procC P a <= ’1’; a U U 0 wait for 2 ns; b <= ’0’; b U U 1 wait for 3 ns; c U U U 0 a <= ’0’; wait for 20 ns; d U U U U 1 end process; e U U U U U Step 1(c): Suspend procB Step 3: Update signal values (Shown in slides, not in notes) S procA: process (a, b) begin 0 c <= a AND b; a 0c 1d 1 end process; e 1 procB: process (b, c, d) begin b S 0ns d <= NOT c; e <= b AND d; sim round B end process; sim cycle B E B E B E B procC: process begin delta cycle B E B E B E ? a <= ’0’; procA P P b <= ’1’; procB P P P P A S S wait for 10 ns; procC P a <= ’1’; a U U 0 wait for 2 ns; b <= ’0’; b U U 1 wait for 3 ns; c U U U 0 a <= ’0’; wait for 20 ns; d U U U U 1 end process; e U U U U U 1 Step 3: Update signal values Compact simulation cycle (Shown in slides, not in notes)
  • 60.
    38 CHAPTER 1. VHDL Begin next simulation cycle (Shown in slides, not in notes) Step 1: No postponed processes (Shown in slides, not in notes) S procA: process (a, b) begin 0 c <= a AND b; a 0c 1d 1 end process; e 1 procB: process (b, c, d) begin b S 0ns 10ns d <= NOT c; e <= b AND d; sim round B E end process; sim cycle B E B E B E B E procC: process begin delta cycle B E B E B E a <= ’0’; procA P P b <= ’1’; procB P P P P S wait for 10 ns; procC P a <= ’1’; a U U 0 wait for 2 ns; b <= ’0’; b U U 1 wait for 3 ns; c U U U 0 a <= ’0’; wait for 20 ns; d U U U U 1 end process; e U U U U U 1 Step 1: no postponed processes Compact simulation cycle (Shown in slides, not in notes)
  • 61.
    1.6.5 Example 1:Process Execution (Bamboozle) 39 Begin next simulation cycle (Shown in slides, not in notes) Step 1(a): Activate procC (Shown in slides, not in notes) Step 1(b): Provisional assignment to a (Shown in slides, not in notes) Step 1(c): Suspend procC (Shown in slides, not in notes) Step 2: Check sensitivity list; resume processes (Shown in slides, not in notes) Step 3: Update signal values (Shown in slides, not in notes) P procA: process (a, b) begin 1 c <= a AND b; a 0c 1d 1 end process; e 1 procB: process (b, c, d) begin b S 0ns 10ns d <= NOT c; e <= b AND d; sim round B E B end process; sim cycle B E B E B E B E B procC: process begin delta cycle B E B E B E B a <= ’0’; procA P P P b <= ’1’; procB P P P P wait for 10 ns; procC P P A S a <= ’1’; a U U 0 1 S wait for 2 ns; b <= ’0’; b U U 1 wait for 3 ns; c U U U 0 a <= ’0’; wait for 20 ns; d U U U U 1 end process; e U U U U U 1 Step 3: Update signal values Compact simulation cycle (Shown in slides, not in notes)
  • 62.
    40 CHAPTER 1. VHDL 1.6.6 Example 2: Process Execution (Flummox) This example is a variation of the Bamboozle example from section 1.6.5. entity flummox is begin end flummox; architecture main of flummox is signal a, b, c, d : std_logic; begin proc1 : process (a, b, c) begin c <= a AND b; d <= NOT c; end process; proc2 : process (b, d) begin e <= b AND d; end process; proc3 : process begin a <= ’1’; b <= ’0’; wait for 3 ns; b <= ’1’; wait for 99 ns; end main; Figure 1.17: Example flummox circuit for process execution 0ns +1δ +2δ +3δ 3ns 102ns sim round B EB E sim cycle B EB EB EB EB EB EB EB E delta cycle B EB EB E B EB EB E proc1 P A S PA S PA S P A S PA S proc2 P A S P A S PA S PA S PA S proc3 P A S PA S a U 1 b U 0 1 c U U 0 0 1 1 d U U U 1 1 0 e U U 0 0 1 0
  • 63.
    1.6.6 Example 2:Process Execution (Flummox) 41 To get a more natural view of the behaviour of the signals, we draw just the waveforms and use a timescale of nanoseconds plus delta cycles: 0ns 3ns 102ns +1δ +2δ +3δ +1δ +2δ +3δ a U U b U U c U U d U U e U U Finally, we draw the behaviour of the signals using the standard time scale of nanoseconds. Notice that the delta-cycles within a simulation round all collapse to the left, so the signals change value exactly at the nanosecond boundaries. Also, the glitch on e dissappears. Answer: 0ns 1ns 2ns 3ns 4ns 100ns 101ns 102ns a U b U c U d U e U Note and Questions . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . .. Note: If a signal is updated with the same value it had in the previous sim- ulation cycle, then it does not change, and therefore does not trigger processes to resume. Question: What are the different granularities of time that occur when doing delta-cycle simulation?
  • 64.
    42 CHAPTER 1. VHDL Answer: simulation step, delta cycle, simulation cycle, simulation round Question: What is the order of granularity, from finest to coarsest, amongst the different granularities related to delta-cycle simulation? Answer: Same order as listed just above. Note: delta cycles have a finer granularity that simulation cycles, because delta cycles do not advance time, while simulation cycles that are not delta cycles do advance time. 1.6.7 Example: Need for Provisional Assignments This is an example of processes where updating signals during a simulation cycle leads to different results for different process execution orderings. architecture main of swindle is begin p_c: process (a, b) begin c <= a AND b; a end process; d c p_d: process (a, c) begin b d <= a XOR c; end process; end main; Figure 1.18: Circuit to illustrate need for provisional assignments 1. Start with all signals at ’0’. 2. Simultaneously change to a = ’1’ and b = ’1’.
  • 65.
    1.6.7 Example: Needfor Provisional Assignments 43 . . If assignments are not visible within same simulation cycle (correct: i.e. provisional assignments are used) p_c P A S p_c P A S p_d P A S P A S p_d P A S P A S a 0 a 0 b 0 b 0 c 0 c 0 d 0 d 0 If p c is scheduled before p d, then d will If p d is scheduled before p c, then d will have a ’1’ pulse. . have a ’1’ pulse. . If assignments are visible within same simulation cycle (incorrect) p_c P A S p_c P A S p_d P A S P A S p_d P A S P A S a 0 a 0 b 0 b 0 c 0 c 0 d 0 d 0 If p c is scheduled before p d, then d will If p d is scheduled before p c, then d will stay constant ’0’. have a ’1’ pulse. With provisional assignments, both orders of scheduling processes result in the same behaviour on all signals. Without provisional assignments, different scheduling orders result in different behaviour.
  • 66.
    44 CHAPTER 1. VHDL 1.6.8 Delta-Cycle Simulations of Flip-Flops This example illustrates the delta-cycle simulation of a flip-flop. Notice how the delta-cycle simu- lation captures the expected behaviour of the flip flop: the signal q changes at the same time (10ns) as rising edge on the clock. p_clk : process begin clk <= ’0’; p_a : process begin wait for 10 ns; a <= ’0’; clk <= ’1’; wait for 15 ns; wait for 10 ns; a <= ’1’; end process; wait for 20 ns; flop : process ( clk ) begin end process; if rising_edge( clk ) then q <= a; end if; end process; 0ns 0ns+1δ 10ns 15ns 20ns 30ns 35ns sim round B E B E B E B E B E sim cycle B E B E B/E B E B E B/E E B/E B E B E B/E B E B E delta cycle B E B/E B E B/E E B/E B E B/E B E p_a P A S P P A S p_clk P A S A S P A S P A S flop P A S P A S P A S P A S P A S a U U 0 1 clk U U 0 1 0 1 q U U 0 1 Redraw with Normal Time Scale ....................................................... To clarify the behaviour, we redraw the same simulation using a normal time scale. 0ns 5ns 10ns 15ns 20ns 25ns 30ns 35ns a U clk U q U
  • 67.
    1.6.8 Delta-Cycle Simulationsof Flip-Flops 45 Back-to-Back Flops .................................................................. . In the previous simulation, the input to the flip-flop (a) changed several nanoseconds before the rising-edge on the clock. In zero delay simulation, the output of a flip-flop changes exactly on the rising edge of the clock. This means that the input to the next flip-flop will change at exactly the same time as a rising edge. This example illustrates how delta-cycle simulation handles the situation correctly. p_clk : process begin clk <= ’0’; wait for 10 ns; p_a : process begin clk <= ’1’; a <= ’0’; wait for 10 ns; wait for 15 ns; end process; a <= ’1’; flops : process ( clk ) begin wait for 20 ns; if rising_edge( clk ) then end process; q1 <= a; q2 <= q1; end if; end process; 10ns 15ns 20ns 30ns 35ns sim round B E B E B E B E sim cycle B/E B E B E B/E B E B/E B E B E B/E B E B E delta cycle B/E B E B/E B/E B E B/E B E p_a P A S p_clk P A S P A S P A S flops P A S P A S P A S a 0 1 clk 0 1 0 1 q1 U U 0 1 q2 U U U Redraw with Normal Time Scale ....................................................... To clarify the behaviour, we redraw the same simulation using a normal time scale. 0ns 5ns 10ns 15ns 20ns 25ns 30ns 35ns a U clk U q1 U q2 U
  • 68.
    46 CHAPTER 1. VHDL External Inputs and Flops ............................................................ . In our work so far with delta-cycle simulation, we have worked through the mechanics of simula- tion. This example applies knowledge of delta-cycle simulation at a conceptual level. We could answer the question by thinking about the semantics of delta-cycle simulation or by mechanicaly doing the simulation. Question: Do the signals b1 and b2 have the same behaviour from 20–30 ns? architecture mathilde of sauv´ is e signal clk, a, b : std_logic; begin process begin clk <= ’1’; wait for 10 ns; clk <= ’0’; wait for 10 ns; end process; process begin wait for 20 ns; a1 <= ’1’; end process; process begin wait until rising_edge(clk); a1 <= ’1’; end process; process begin wait until rising_edge( clk ); b1 <= a1; b2 <= a2; end process; end architecture; Answer: The signals b1 and b2 will have the same behaviour if a1 and a2 have the same behaviour. The difference in the code between a1 and a2 is that a1 is waiting for 20ns and a2 is waiting until a rising edge of the clock. There is a rising edge of the clock at 20ns, so we might be tempted to conclude (incorrectly) that both a1 and a2 transition from ’U’ to 0 at exactly 20ns and therefore have exactly the same behaviour.
  • 69.
    1.6.8 Delta-Cycle Simulationsof Flip-Flops 47 The difference between the behaviour of a1 and a2 is that in the first simulation cycle for 20 ns, the process for a1 becomes postponed, while the process for a2 becomes postponed only after the rising edge of clock. The signal a1 is waiting for 20ns, so in the first simulation cycle for 20ns, the process for a1 becomes postponed. In the second simulation cycle for 20ns, the clock toggles from 0 to 1 and a1 toggles from ’U’ to 1. The rising edge on the clock causes the processes for a2, b1, and b2 to become postponed. In the third simulation cycle for 20ns: • a2 toggles from ’U’ to 1. • b1 sees the value of 1 for a1, because a1 became 1 in the first simulation cycle. • b2 sees the old value of ’U’ for a2, because the process for a2 did not run in the second simulation cycle. 0ns 10ns 20ns 20ns+1δ 20ns+2δ 30ns sim round B E sim cycle B/E B E B E delta cycle B/E B E proc_clk P A S proc_a1 P A S proc_a2 P A S proc_b P A S clk U a1 U a2 b1 b2
  • 70.
    48 CHAPTER 1. VHDL Testbenches and Clock Phases ........................................................ . env : process begin a <= ’1’; clk <= ’0’; wait for 10 ns; a <= ’0’; clk <= ’1’; wait for 10 ns; end process; flop : process ( clk ) begin if rising_edge( clk ) then q1 <= a end if; end process; 0ns 0ns+1δ 10ns 20ns sim round B E B E sim cycle B E B E B E B E B E delta cycle B E B E B E B env P A S P A S flop1 P A S P A S P A S flop2 P A S P A S P A S a U U 1 0 clk U U 0 1 q1 U U U Redraw with Normal Time Scale ....................................................... 0ns 10ns 20ns a U clk U q1 U
  • 71.
    1.6.8 Delta-Cycle Simulationsof Flip-Flops 49 Note: Testbench signals For consistent results across different simulators, simulation scripts vs test benches, and timing-simulation vs zero-delay simula- tion do not change signals in your testbench or script at the same time as the clock changes. 0ns 10ns 20ns 30ns 40ns 50ns 60ns a U a is output of clocked or combina- tional process clk U q1 U 0ns 10ns 20ns 30ns 40ns 50ns 60ns a U a is output of timed process (testbench or environment) POOR clk U DESIGN q1 U 0ns 10ns 20ns 30ns 40ns 50ns 60ns a U a is output of timed process (test- bench or environment) GOOD clk U DESIGN q1 U
  • 72.
    50 CHAPTER 1. VHDL 1.7 Register-Transfer-Level Simulation Delta-cycle simulation is very tedious for both humans and computers. For many circuits, the complexity of delta-cycle simulation is not needed and register-transfer-level simulation, which is much simpler, can be used instead. The major complexities of delta-cycle simulation come from running a process multiple times within a single simulation round and keeping track of the modes of the proceses. Register-transfer- level simulation avoids both of these complexities. By evaluating each signal only once per sim- ulation round, an entire simulation round can be reduced to a single column in a timing diagram. The disadvantage of register-transfer-level simulation is that it does not work for all VHDL pro- grams — in particular, it does not support combinational loops. 0ns 0ns+1δ 0ns+2δ 0ns+2δ3ns 3ns+1δ 3ns+2δ 3ns+3δ 102ns sim round B EB E sim cycle B EB EB EB EB EB EB EB E delta cycle B EB EB E B EB EB E E proc1 P A S PA S PA S P A S PA S proc2 P A S P A S PA S PA S PA S 0ns 1ns 2ns 3ns proc3 P A S PA S 102ns a U U1 a U 1 b U 0 1 b U 0 1 c U U 0 0 1 1 c U 0 1 d U U U 1 1 0 d U 1 0 e U U 0 0 1 0 e U 0 Delta cycle simulation RTL simulation 1.7.1 Overview In delta-cycle simulations, we often simulated the same process multiple times within the same simulation round. In looking at the circuit though, we mentally can calculate the output value by evaluating each gate only once per simulation round. For both humans and computers (or the humans waiting for results from computers), it is desirable to avoid the wasted work of simulating a gate when the output will remain at ’U’ or will change again later in the same simulation round. In register-transfer-level simulation, we evaluate each gate only once per simulation round. Register- transfer-level simulation is simpler and faster than delta-cycle simuation, because it avoids delta cycles and provisional assignments. In delta-cycle simulation, we evaluate a gate multiple times in a single simulation round if the process that drives the gate is active in multiple simulation cycles, which happens when the process is triggered in multiple simulation cycles. To avoid this, we must evaluate a signal only after all of the signals that it depends on have stable values, that is, the signals will not change value later in the simulation round. A combinational loop is a circuit that contains a cyclic path through the circuit that includes only combinational gates. Combinational loops can cause signals to oscillate, which in delta-cycle simulation with zero-delay assignments, corresponds to an infinite sequence of delta cycles. We
  • 73.
    1.7.1 Overview 51 immediately see that when doing zero-delay simulation of a combinational loop such as a <= not(a);, the change on a will trigger the process to re-run and re-evaluate a an infinite number of times. Hence, register-transfer-level simulation does not support combinational loops. To make register-transfer simulation work, we preprocess the VHDL program and transform it so that each process is dependent upon only those processes that appear before it. This dependency ordering is called topological ordering. If a circuit has combinational loops, we cannot sort the processes into a topological order. The register-transfer level is a coarser level of temporal abstraction than the delta-cycle level. In delta-cycle simulation, many delta-cycles can elapse without an increment in real time (e.g. nanoseconds). In register-transfer-level simulation, all of the events that take place in the same moment of real time take place at same moment in the simulation. In other words, all of the events that take place at the same time are drawn in the same column of the waveform diagram. Register-transfer-level simulation can be done for legal VHDL code, either synthesizable or unsyn- thesizable, so long as the code does not contain combinational loops. For any piece of VHDL code without combinational loops, the register-transfer-level simulation and the delta-cycle simulation will have same value for each signal at the end of each simulation round. By sorting the processes in topological order, when we execute a process, all of the signals that the process depends on will have already been evaluated, and so we know that we are reading the final, stable values that each signal will have for that moment in time. This is good, because for most processes, we want to read the most recent values of signals. The exceptions are timed processes that are dependent upon other timed processes running at the same moment in time and clocked processes that are dependent upon other clocked processes. process begin Question: In this code, what value a <= ’0’; should b have 10 ns? wait for 10 ns; a <= ’1’; ... Answer: end process; Both processes will execute in the same simulation cycle at 10 process begin ns. The statement b <= a will b <= ’0’; see the value of a from the wait for 10 ns; b <= a; previous simulation cycle, which ... is before a <= ’1’; is end process; evaluated. The signal b will be ’0’ at 10 ns. As the above example illustrates, if a clocked process reads the values of signals from processes that resume at the same time, it must read the previous value of those signals. Similarly, if a clocked process reads the values of signals from processes that are sensitive to the same clock, those processes will all resume in the same simulation cycle — the cycle immediately after the rising-edge of the clock (assuming that the processes use if rising edge or wait until rising edge statements). Because the processes run in the same simulation cycle, they all read
  • 74.
    52 CHAPTER 1. VHDL the previous values of the signals that they depend on. If this were not the case, then the VHDL code for pair of back-to-back flip flops would not operate correctly, because the output of the first flip-flop would appear immediately at the output of the second flip-flop. Simulation rounds begin with incrementing time, which triggers timed processes. Therefore, the first processes in the topological order are the timed processes. Timed processes may be run in any order, and they read the previous values of signals that they depend on. This gives the same effect as in delta-cycle simulation, where the timed processes would run in the same simulation cycle and read the values that signals had before the simulation cycle began. We then sort the clocked and combinational processes based on their dependencies, so that each process appears (is run) after all of the processes on which it depends. Although a clocked process may read many signals, we say that a clocked process is dependent upon only its clock signal. It is the change in the clock signal that causes the process to resume. So, as long as the process is run after the clock signal is stable, we can be sure that it will not need to be run again at this time step. Clocked processes may be run in any order. They read the current value of their clock signal and the previous value of the other signals that they depend on. As with timed processes, this gives the same effect as in delta-cycle simulation, where the clock edge would trigger the clocked processes to run in the same simulation cycle and the processes would read the values that signals had before the simulation cycle began. 1.7.2 Technique for Register-Transfer Level Simulation 1. Pre-processing (a) Separate processes into combinational and non-combinational (clocked and timed) (b) Decompose each combinational process into separate processes with one target signal per process (c) Sort processes into topological order based on dependencies 2. For each clock cycle or unit of time: (a) Run non-combinational processes in any order. Non-combinational assignments read from earlier clock cycle / time step, except that clocked processes read the current value of the clock signal. (b) Run combinational processes in topological order. Combinational assignments read from current clock cycle / time step. Combinational Process Decomposition ................................................ .
  • 75.
    1.7.3 Examples ofRTL Simulation 53 proc(a,b,c) proc(a,b,c) if a = ’1’ then if a = ’1’ then d <= b; d <= b; e <= c; else else d <= not b; d <= not b; end if; e <= b and c; end process; end if; proc(a,b,c) end process; if a = ’1’ then e <= c; else e <= b and c; end if; end process; Original code After decomposition 1.7.3 Examples of RTL Simulation 1.7.3.1 RTL Simulation Example 1 We revisit an earlier example from delta-cycle simulation, but change the code slightly and do register-transfer-level simulation. 1. Original code: proc1: process (a, b, c) begin proc3: process begin d <= NOT c; a <= ’1’; c <= a AND b; b <= ’0’; end process; wait for 3 ns; b <= ’1’; proc2: process (b, d) begin wait for 99 ns; e <= b AND d; end process; end process; 2. Decompose combinational processes into single-target processes:
  • 76.
    54 CHAPTER 1. VHDL proc1d: process (c) begin proc1c: process (a, b) begin d <= NOT c; c <= a AND b; end process; end process; proc1c: process (a, b) begin proc1d: process (c) begin c <= a AND b; d <= NOT c; end process; end process; proc2: process (b, d) begin proc2: process (b, d) begin e <= b AND d; e <= b AND d; end process; end process; Decomposed Sorted 3. To sort combinational processes into topological order, move proc1d after proc1c, be- cause d depends on c. 4. Run timed process (proc3) until suspend at wait for 3 ns;. • The signal a gets ’1’ from 0 to 3 ns. • The signal b gets ’0’ from 0 to 3 ns. 5. Run proc1c • The signal c gets a AND b (0 AND 1 = ’0’) from 0 to 3 ns. 6. Run proc1d • The signal d gets NOT c (NOT 0 = ’1’) from 0 to 3 ns. 7. Run proc2 • The signal e gets b AND d (0 AND 1 = ’0’) from 0 to 3 ns. 8. Run the timed process until suspend at wait for 99 ns;, which takes us from 3ns to 102ns. 9. Run combinational processes in topological order to calculate values on c, d, e from 3ns to 102ns. Waveforms .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . . .. . . . . .. . . . . ... 0ns 1ns 2ns 3ns 102ns a U 1 b U 0 1 c U 0 1 d U 1 0 e U 0
  • 77.
    1.7.3 Examples ofRTL Simulation 55 Question: Draw the RTL waveforms that correspond to the delta-cycle waveform below. 0ns 0ns+1δ 0ns+2δ 0ns+2δ3ns 3ns+1δ 3ns+2δ 3ns+3δ 102ns sim round B EB E sim cycle B EB EB EB EB EB EB EB E delta cycle B EB EB E B EB EB E E proc1 P A S PA S PA S P A S PA S proc2 P A S P A S PA S PA S PA S proc3 P A S PA S a U U1 b U 0 1 c U U 0 0 1 1 d U U U 1 1 0 e U U 0 0 1 0 Answer: 0ns 1ns 2ns 3ns 102ns a U 1 b U 0 1 c U 0 1 d U 1 0 e U 0
  • 78.
    56 CHAPTER 1. VHDL Example: Communicating State Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. huey: process louie: process begin begin clk <= ’0’; d <= ’1’; wait for 10 ns; wait until re(clk); 10 30 50 70 90 110 10 30 50 90 if (a >= 2) then clk <= ’1’; wait for 10 ns; d <= ’0’; 20 40 60 80 100 end process; wait until re(clk); 70 110 end if; dewey: process end process; begin a <= to_unsigned(0,4); wait until re(clk); 10 110 while (a < 4) loop a <= a + 1; wait until re(clk); end loop; 30 50 70 90 end process; I 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 110 120 clk U 0 1 a U 0 1 2 3 4 0 1 d U 1 1 1 0 1 0 1
  • 79.
    1.7.3 Examples ofRTL Simulation 57 A Related Simulation .................................................................. Small changes to the code can cause significant changes to the behaviour. riri: process loulou: process begin begin clk <= ’1’; wait until re(clk); wait for 10 ns; d <= ’1’; clk <= ’0’; if (a < 2) then wait for 10 ns; d <= ’0’; end process; wait until re(clk); end if; fifi: process end process; begin a <= to_unsigned(0,4); wait until re(clk); while (a < 4) loop a <= a + 1; wait until re(clk); end loop; end process; I 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 110 120 clk a d
  • 80.
    58 CHAPTER 1. VHDL 1.8 VHDL and Hardware Building Blocks This section outlines the building blocks for register transfer level design and how to write VHDL code for the building blocks. 1.8.1 Basic Building Blocks (also: n-to-1 muxes) 2:1 mux R D Q WE WE CE A DO A0 DO0 S DI DI0 A1 DO1 Hardware VHDL AND, OR, NAND, NOR, XOR, and, or, nand, nor, xor, xnor XNOR multiplexer if-then-else, case statement, selected assignment, conditional as- signment adder, subtracter, negater +, -, - shifter, rotater sll, srl, sla, sra, rol, ror flip-flop wait until, if-then-else, rising edge memory array, register file, queue 2-d array or library component Figure 1.19: RTL Building Blocks
  • 81.
    1.8.2 Deprecated BuildingBlocks for RTL 59 1.8.2 Deprecated Building Blocks for RTL Some of the common gates you have encountered in previous courses should be avoided when synthesizing register-transfer-level hardware, particularly if FPGAs are the implementation tech- nology. 1.8.2.1 An Aside on Flip-Flops and Latches flip-flop Edge sensitive: output only changes on rising (or falling) edge of clock latch Level sensitive: output changes whenever clock is high (or low) A common implementation of a flip-flop is a pair of latches (Master/Slave flop). Latches are sometimes called “transparent latches”, because they are transparent (input directly connected to output) when the clock is high. The clock to a latch is sometimes called the “enable” line. There is more information in the course notes on timing analysis for storage devices (Section 5.2). 1.8.2.2 Deprecated Hardware Latches • Use flops, not latches • Latch-based designs are susceptible to timing problems • The transparent phase of a latch can let a signal “leak” through a latch — causing the signal to affect the output one clock cycle too early • It’s possible for a latch-based circuit to simulate correctly, but not work in real hardware, because the timing delays on the real hardware don’t match those predicted in synthesis T, JK, SR, etc flip-flops • Limit yourself to D-type flip-flops • Some FPGA and ASIC cell libraries include only D-type flip flops. Others, such as Al- tera’s APEX FPGAs, can be configured as D, T, JK, or SR flip-flops. Tri-State Buffers • Use multiplexers, not tri-state buffers • Tri-state designs are susceptible to stability and signal integrity problems • Getting tri-state designs to simulate correctly is difficult, some library components don’t support tri-state signals • Tri-state designs rely on the code never letting two signals drive the bus at the same time • It can be difficult to check that bus arbitration will always work correctly
  • 82.
    60 CHAPTER 1. VHDL • Manufacturing and environmental variablity can make real hardware not work correctly even if it simulates correctly • Typical industrial practice is to avoid use of tri-state signals on a chip, but allow tri-state signals at the board level Note: Unfortunately and surprisingly, PalmChip has been awarded a US patent for using uni-directional busses (i.e. multiplexers) for system- on-chip designs. The patent was filed in 2000, so all fourth-year design projects since 2000 that use muxes on FPGAs will need to pay royalties to PalmChip 1.8.3 Hardware and Code for Flops 1.8.3.1 Flops with Waits and Ifs The two code fragments below synthesize to identical hardware (flops). If Wait process (clk) process begin begin if rising_edge(clk) then wait until rising_edge(clk); q <= d; q <= d; end if; end process; end process; 1.8.3.2 Flops with Synchronous Reset The two code fragments below synthesize to identical hardware (flops with synchronous reset). Notice that the synchronous reset is really nothing more than an AND gate on the input. If Wait process (clk) process begin begin if rising_edge(clk) then wait until rising_edge(clk); if (reset = ’1’) then if (reset = ’1’) then q <= ’0’; q <= ’0’; else else q <= d; q <= d0; end if; end if; end if; end process; end process;
  • 83.
    1.8.3 Hardware andCode for Flops 61 1.8.3.3 Flops with Chip-Enable The two code fragments below synthesize to identical hardware (flops with chip-enable lines). If Wait process process (clk) begin begin wait until rising_edge(clk); if rising_edge(clk) then if (ce = ’1’) then if (ce = ’1’) then q <= d; q <= d; end if; end if; end process; end if; end process; 1.8.3.4 Flop with Chip-Enable and Mux on Input The two code fragments below synthesize to identical hardware (flops with chip-enable lines and muxes on inputs). If Wait process (clk) process begin begin if rising_edge(clk) then wait until rising_edge(clk); if (ce = ’1’) then if (ce = ’1’) then if (sel = ’1’) then if (sel = ’1’) then q <= d1; q <= d1; else else q <= d0; q <= d0; end if; end if; end if; end if; end if; end process; end process;
  • 84.
    62 CHAPTER 1. VHDL 1.8.3.5 Flops with Chip-Enable, Muxes, and Reset The two code fragments below synthesize to identical hardware (flops with chip-enable lines, muxes on inputs, and synchronous reset). Notice that the synchronous reset is really nothing more than a mux, or an AND gate on the input. Note: The specific combination and order of tests is important to guarantee that the circuit synthesizes to a flop with a chip enable, as opposed to a level- sensitive latch testing the chip enable and/or reset followed by a flop. Note: The chip-enable pin on the flop is connected to both ce and reset. If the chip-enable pin was not connected to reset, then the flop would ignore reset unless chip-enable was asserted. If Wait process (clk) process begin begin if rising_edge(clk) then wait until rising_edge(clk); if (ce = ’1’ or reset =’1’ ) then if (ce = ’1’ or reset = ’1’) then if (reset = ’1’) then if (reset = ’1’) then q <= ’0’; q <= ’0’; elsif (sel = ’1’) then elsif (sel = ’1’) then q <= d1; q <= d1; else else q <= d0; q <= d0; end if; end if; end if; end if; end if; end process; end process; 1.8.4 An Example Sequential Circuit There are many ways to write VHDL code that synthesizes to the schematic in figure1.20. The major choices are: 1. Categories of signals (a) All signals are outputs of flip-flops or inputs (no combinational signals) (b) Signals include both flopped and combinational 2. Number of flopped signals per process (a) All flopped signals in a single process (b) Some processes with multiple flopped signals (c) Each flopped signal in its own process 3. Style of flop code
  • 85.
    1.8.4 An ExampleSequential Circuit 63 (a) Flops use if statements (b) Flops use wait statements Some examples of these different options are shown in figures1.21–1.24. sel reset entity and_not_reg is port ( R reset, a clk, sel : in std_logic; S R c : out std_logic c ); clk end; S Schematic and entity for examples of different code organizations in Figures1.21–1.24 Figure 1.20: Schematic and entity for and not reg One Process, Flops, Wait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. architecture one_proc of and_not_reg is signal a : std_logic; begin process begin wait until rising_edge(clk); if (reset = ’1’) then a <= ’0’; elsif (sel = ’1’) then a <= NOT a; else a <= a; end if; c <= NOT a; end process; end one_proc; Figure 1.21: Implementation of Figure1.20: all signals are flops, all flops in one process, flops use waits
  • 86.
    64 CHAPTER 1. VHDL Two Processes, Flops, Wait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. architecture two_proc_wait of and_not_reg is signal a : std_logic; begin process begin wait until rising_edge(clk); if (reset = ’1’) then a <= ’0’; elsif (sel = ’1’) then a <= NOT a; else a <= a; end if; end process; process begin wait until rising_edge(clk); c <= NOT a; end process; end two_proc_wait; Figure 1.22: Implementation of Figure1.20: all signals are flops, one flop per process, flops use waits
  • 87.
    1.8.4 An ExampleSequential Circuit 65 Two Processes with If-Then-Else ....................................................... architecture two_proc_if of and_not_reg is signal a : std_logic; begin process (clk) begin if rising_edge(clk) then if (reset = ’1’) then a <= ’0’; elsif (sel = ’1’) then a <= NOT a; else a <= a; end if; end if; end process; process (clk) begin if rising_edge(clk) then c <= NOT a; end if; end process; end two_proc_if; Figure 1.23: Implementation of Figure1.20: all signals are flops, one flop per process, flops use if-then-else
  • 88.
    66 CHAPTER 1. VHDL Concurrent Statements ................................................................ architecture comb of and_not_reg is signal a, b, d : std_logic; begin process (clk) begin if rising_edge(clk) then if (reset = ’1’) then a <= ’0’; else a <= d; end if; end if; end process; process (clk) begin if rising_edge(clk) then c <= NOT a; end if; end process; d <= b when (sel = ’1’) else a; b <= NOT a; end comb; Figure 1.24: Implementation of Figure1.20: flopped and combinational signals, one flop per process, flops use if-then-else 1.9 Arrays and Vectors VHDL supports multidimensional arrays over elements of any type. The most common array is an array of std logic signals, which has a predefined type: std logic vector. Throughout the rest of this section, we will discuss only std logic vector, but the rules apply to arrays of any type. VHDL supports reading from and assigning to slices (aka “discrete subranges”) of vectors. The rules for working with slices of vectors are listed below and illustrated in figure1.25. 1. The ranges on both sides of the assignment must be the same. 2. The direction (downto or to) of each slice must match the direction of the signal declara- tion. 3. The direction of the target and expression may be different.
  • 89.
    1.10. ARITHMETIC 67 Declarations ---------------------------------------------------- a, b : in std_logic_vector(15 downto 0); c, d, e : out std_logic_vector(15 downto 0); ---------------------------------------------------- ax, bx : in std_logic_vector(0 to 15); cx, dx, ex : out std_logic_vector(0 to 15); ---------------------------------------------------- m, n : in unsigned(15 downto 0); p, q, r : out unsigned(15 downto 0); ---------------------------------------------------- w, x : in signed(15 downto 0); y, z : out signed(15 downto 0) ---------------------------------------------------- Legal code c(3 downto 0) <= a(15 downto 12); cx(0 to 3) <= a(15 downto 12); (e(3), e(4)) <= bx(12 to 13); (e(5), e(6)) <= b(13 downto 12); Illegal code d(0 to 3) <= a(15 to 12); -- slice dirs must be same as decl e(3) & e(2) <= b(12 to 13); -- syntax error on & p(3 downto 0) <= (m + n)( 3 downto 0); -- syntax error on )( z(3 downto 0) <= m(15 downto 12); -- types on lhs and rhs must match Figure 1.25: Illustration of Rules for Slices of Vectors 1.10 Arithmetic VHDL includes all of the common arithmetic and logical operators. Use the VHDL arithmetic operators and let the synthesis tool choose the better implementation for you. It is almost impossible for a hand-coded implementation to beat vendor-supplied arithmetic libraries. To use the operators, you must choose which arithmetic package you wish to use (section 1.10.1). The arithmetic operators are overloaded, and you can usually use any mixture of constants and sig- nals of different types that you need (Section 1.10.3). However, you might need to convert a signal from one type (e.g. std logic vector) to another type (e.g. integer) (Section 1.10.7).
  • 90.
    68 CHAPTER 1. VHDL 1.10.1 Arithmetic Packages Rushton Ch-7 covers arithmetic packages. Rushton Appendex A.5 has the code listing for the numeric std package. To do arithmetic with signals, use the numeric_std package. This package defines types signed and unsigned, which are std_logic vectors on which you can do signed or un- signed arithmetic. numeric std supersedes earlier arithmetic packages, such as std logic arith. Use only one arithmetic package, otherwise the different definitions will clash and you can get strange error messages. 1.10.2 Shift and Rotate Operations Shift and rotate operations are described with three character acronyms: shift/rotate left/right arithmetic/logical The shift right arithmetic (sra) operation preserves the sign of the operand, by copying the most significant bit into lower bit positions. The shift left arithmetic (sla) does the analogous operation, except that the least significant bit is copied. 1.10.3 Overloading of Arithmetic The arithmetic operators +, -, and * are overloaded on signed vectors, unsigned vectors, and integers. Tables1.1–1.4 show the different combinations of target and source types and widths that can be used. Table 1.1: Overloading of Arithmetic Operations (+, -) target src1/2 src2/1 unsigned unsigned integer OK — unsigned signed fails in analysis In these tables “—” means “don’t care”. Also, src1/2 and src2/1 mean first or second operand, and respectively second or first operand. The first line of the table means that either the fist operand is unsigned and the second is an integer, or the second operand is unsigned and the first is an integer. Or, more concisely: one of the operands is unsigned and the other is integer.
  • 91.
    1.10.4 Different Widthsand Arithmetic 69 1.10.4 Different Widths and Arithmetic Table 1.2: Different Vector Widths and Arithmetic Operations (+, -) target src1/2 src2/1 narrow wide — fails in elaboration wide narrow int fails in elaboration wide wide — OK narrow narrow narrow OK narrow narrow int OK Example vectors wide unsigned(7 downto 0) narrow unsigned(4 downto 0) 1.10.5 Overloading of Comparisons Table 1.3: Overloading of Comparison Operations (=, /=, >=, >, <) src1/2 src2/1 unsigned integer OK signed integer OK unsigned signed fails in analysis 1.10.6 Different Widths and Comparisons Table 1.4: Different Vector Widths and Comparison Operations (=, /=, >=, >, <) src1/2 src2/1 wide — OK narrow — OK
  • 92.
    70 CHAPTER 1. VHDL 1.10.7 Type Conversion The functions unsigned, signed, to integer, to unsigned and to signed are used to convert between integers, std-logic vectors, signed vectors and unsigned vectors. If you convert between two types of the same width, then no additional hardware will be generated. The listing below summarizes the types of these functions. unsigned( val : std_logic_vector ) return unsigned; signed( val : std_logic_vector ) return signed; to_integer( val : signed ) return integer; to_integer( val : unsigned ) return integer; to_unsigned( val : integer; width : natural) return unsigned; to_signed( val : integer; width : natural) return signed; The most common need to convert between two types arises when using a signal as an index into an array. To use a signal as an index into an array, you must convert the signal into an integer using the function to_integer (Figure1.26). signal i : unsigned( 3 downto 0); signal a : std_logic_vector(15 downto 0); ... ... a(i) ... -- BAD: won’t typecheck ... a( to_integer(i) ) ... -- OK Avoid (or at least take care when) converting a signal into an integer and then performing arithmetic on the signal. The default size for integers is 32 bits, so sometimes when a signal is converted into an integer, the resulting signals will be 32 bits wide. library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; ... signal bit_sig : std_logic; signal uns_sig : unsigned(7 downto 0); signal vec_sig : std_logic_vector(255 downto 0); ... bit_sig <= vec_sig( to_integer(uns_sig) ); ... Figure 1.26: Using an unsigned signal as an index to array
  • 93.
    1.11. SYNTHESIZABLE VSNON-SYNTHESIZABLE CODE 71 To convert a std_logic_vector signal into an integer, you must first say whether the signal should be interpreted as signed or unsigned. As illustrated in figure1.27, this is done by: 1. Convert the std_logic_vector signal to signed or unsigned, using the function signed or unsigned 2. Convert the signed or unsigned signal into an integer, using to_integer library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; ... signal bit_sig : std_logic; signal std_sig : std_logic_vector(7 downto 0); signal vec_sig : std_logic_vector(255 downto 0); ... bit_sig <= vec_sig( to_integer( unsigned( std_sig ) ) ); ... Figure 1.27: Using a std logic vector as an index to array 1.11 Synthesizable vs Non-Synthesizable Code Synthesis is done by matching VHDL code against templates or patterns. It’s important to use idioms that your synthesis tools recognizes. If you aren’t careful, you could write code that has the same behaviour as one of the idioms, but which results in inefficient or incorrect hardware. Section 1.8 described common idioms and the resulting hardware. Most synthesis tools agree on a large set of idioms, and will reliably generate hardware for these idioms. This section is based on the idioms that Synopsys, Xilinx, Altera, and Mentor Graphics are able to synthesize. One exception is that Altera’s Quartus does not support implicit state machines (as of v5.0). Section 1.11.1 gives rules for unsynthesizable VHDL code. Section 1.11.2 gives rules for code that is synthesizable, but violates the ece327 guidelines for good practices. The ece327 coding guidelines are designed to produce circuits suitable for FPGAs. Bad code for FPGAs produce circuits with the following features: • latches • asynchronous resets • combinational loops • multiple drivers for a signal
  • 94.
    72 CHAPTER 1. VHDL • tri-state buffers We limit our definition of bad practice to code that produces undesirable hardware. Coding styles that lead to inefficient hardware might be useful in the early stages of the design process, when the focus is on functionality and not optimality. As such, inefficient code is not considered bad prac- tice. Poor coding styles that do not affect the hardware, for example, including extraneous signals in a sensitivity list, should certainly be avoided, but fall into the general realm of programming guidelines and will not be discussed. 1.11.1 Unsynthesizable Code 1.11.1.1 Initial Values Initial values on signals (UNSYNTHESIZABLE) signal bad_signal : std_logic := ’0’; Reason: In most implementation technologies, when a circuit powers up, the values on signals are completely random. Some FPGAs are an exception to this. For some FPGAs, when a chip is powered up, all flip flops will be ’0’. For other FPGAs, the initial values can be programmed. 1.11.1.2 Wait For Wait for length of time (UNSYNTHESIZABLE) wait for 10 ns; Reason: Delays through circuits are dependent upon both the circuit and its operating environment, particularly supply voltage and temperature. 1.11.1.3 Different Wait Conditions wait statements with different conditions in a process (UNSYNTHESIZABLE) -- different clock signals -- different clock edges process process begin begin wait until rising_edge(clk1); wait until rising_edge(clk); x <= a; x <= a; wait until rising_edge(clk2); wait until falling_edge(clk); x <= a; x <= a; end process; end process; Reason: processes with multiple wait statements are turned into finite state machines. The wait statements denote transitions between states. The target signals in the process are outputs of flip
  • 95.
    1.11.1 Unsynthesizable Code 73 flops. Using different wait conditions would require the flip flops to use different clock signals at different times. Multiple clock signals for a single flip flop would be difficult to synthesize, inefficient to build, and fragile to operate. 1.11.1.4 Multiple “if rising edge” in Process Multiple if rising edge statements in a process (UNSYNTHESIZABLE) process (clk) begin if rising_edge(clk) then q0 <= d0; end if; if rising_edge(clk) then q1 <= d1; end if; end process; Reason: The idioms for synthesis tools generally expect just a single if rising edge state- ment in each process. The simpler the VHDL code is, the easier it is to synthesize hardware. Programmers of synthesis tools make idiomatic restrictions to make their jobs simpler. 1.11.1.5 “if rising edge” and “wait” in Same Process An if rising edge statement and a wait statement in the same process (UNSYNTHESIZ- ABLE) process (clk) begin if rising_edge(clk) then q0 <= d0; end if; wait until rising_edge(clk); q0 <= d1; end process; Reason: The idioms for synthesis tools generally expect just a single type of flop-generating state- ment in each process.
  • 96.
    74 CHAPTER 1. VHDL 1.11.1.6 “if rising edge” with “else” Clause The if statement has a rising edge condition and an else clause (UNSYNTHESIZABLE). process (clk) begin if rising_edge(clk) then q0 <= d0; else q0 <= d1; end if; end process; Reason: Generally, an if-then-else statement synthesizes to a multiplexer. The condition that is tested in the if-then-else becomes the select signal for the multiplexer. In an if rising edge with else, the select signal would need to detect a rising edge on clk, which isn’t feasible to synthesize. 1.11.1.7 “if rising edge” Inside a “for” Loop An if rising edge statement in a for-loop (UNSYNTHESIZABLE-Synopsys) process (clk) begin for i in 0 to 7 loop if rising_edge(clk) then q(i) <= d; end if; end loop; end process; Reason: just an idiom of the synthesis tool. Some loop statements are synthesizable (Rushton Section 8.7). For-loops in general are de- scribed in Ashenden. Examples of for loops in E&CE will appear when describing testbenches for functional verification (Chapter 4).
  • 97.
    1.11.1 Unsynthesizable Code 75 Synthesizable Alternative .............................................................. A synthesizable alternative to an if rising edge statement in a for-loop is to put the if-rising- edge outside of the for loop. process (clk) begin if rising_edge(clk) then for i in 0 to 7 loop q(i) <= d; end loop; end if; end process; 1.11.1.8 “wait” Inside of a “for loop” wait statements in a for loop (UNSYNTHESIZABLE) process begin for i in 0 to 7 loop wait until rising_edge(clk); x <= to_unsigned(i,4); end loop; end process; Reason: Unknown. while-loops with the same behaviour are synthesizable. Note: Combinational for-loops Combinational for-loops are usually synthesizable. They are often used to build a combinational circuit for each element of an array. Note: Clocked for-loops Clocked for-loops are not synthesizable, but are very useful in simulation, particular to generate test vectors for test benches.
  • 98.
    76 CHAPTER 1. VHDL Synthesizable Alternative to Wait-Inside-For .......................................... . while loop (synthesizable) This is the synthesizable alternative to the the wait statement in a for loop above. process begin -- output values from 0 to 4 on i -- sending one value out each clock cycle i <= to_unsigned(0,4); wait until rising_edge(clk); while (4 > i) loop i <= i + 1; wait until rising_edge(clk); end loop; end process; 1.11.2 Synthesizable, but Bad Coding Practices Note: For some of the results in this section, the results are highly depen- dent upon the synthesis tool that you use and the target technology library. 1.11.2.1 Asynchronous Reset In an asynchronous reset, the test for reset occurs outside of the test for the clock edge. process (reset, clk) begin if (reset = ’1’) then q <= ’0’; elsif rising_edge(clk) then q <= d1; end if; end process; Asynchronous resets are bad, because if a reset occurs very close to a clock edge, some parts of the circuit might be reset in one clock cycle and some in the subsequent clock cycle. This can lead the circuit to be out of sync as it goes through the reset sequence, potentially causing erroneous internal state and output values.
  • 99.
    1.11.2 Synthesizable, butBad Coding Practices 77 1.11.2.2 Combinational “if-then” Without “else” process (a, b) begin if (a = ’1’) then c <= b; end if; end process; Reason: This code synthesizes c to be a latch, and latches are undesirable. 1.11.2.3 Bad Form of Nested Ifs if rising edge statement inside another if (BAD HARDWARE) In Synopsys, with some target libraries, this design results in a level-sensitive latch whose input is a flop. process (ce, clk) begin if (ce = ’1’) then if rising_edge(clk) then q <= d1; end if; end if; end process; 1.11.2.4 Deeply Nested Ifs Deeply chained if-then-else statements can lead to long chains of dependent gates, rather than checking different cases in parallel. Slow (maybe) Fast (hopefully) if cond1 then if only one of the conditions can be true at a stmts1 time, then try using a case statement or some elsif cond2 then other technique that allows the conditions to stmts2 be evaluated in parallel. elsif cond3 then stmts3 elsif cond4 then stmts4 end if;
  • 100.
    78 CHAPTER 1. VHDL 1.11.3 Synthesizable, but Unpredictable Hardware Some coding styles are synthesizable and might produce desirable hardware with a particular syn- thesis tool, but either be unsynthesizable or produce undesirable hardware with another tool. • variables • level-sensitive wait statements • missing signals in sens list If you are using a single synthesis tool for an extended period of time, and want to get the full power of the tool, then it can be advantageous to write your code in a way that works for your tool, but might produce undesirable results with other tools. 1.12 Synthesizable VHDL Coding Guidelines This section gives guidelines for building robust, portable, and synthesizable VHDL code. Porta- bility is both for different simulation and synthesis tools and for different implementation tech- nologies. Remember, there is a world of difference between getting a design to work in simulation and getting it to work on a real FPGA. And there is also a huge difference between getting a design to work in an FPGA for a few minutes of testing and getting thousands of products to work for months at a time in thousands of different environments around the world. The coding guidelines here are designed both for helping you to get your E&CE 327 project to work as well as all of the subsequent industrial designs. Finally, note that there are exceptions to every rule. You might find yourself in a circumstance where your particular situation (e.g. choice of tool, target technology, etc) would benefit from bending or breaking a guideline here. Within E&CE 327, of course, there won’t be any such circumstances. 1.12.1 Signal Declarations • Use signals, do not use variables reason The intention of the creators of VHDL was for signals to be wires and variables to be just for simulation. Some synthesis tools allow some uses of variables, but when using variables, it is easy to create a design that works in simulation but not in real hardware. • Use std_logic signals, do not use bit or Boolean reason std_logic is the most commonly used signal type across synthesis tools, simulation tools, and cell libraries • Use in or out, do not use inout reason inout signals are tri-state.
  • 101.
    1.12.2 Flip-Flops andLatches 79 note If you have an output signal that you also want to read from, you might be tempted to declare the mode of the signal to be inout. A better solution is to create a new, internal, signal that you both read from and write to. Then, your output signal can just read from the internal signal. • Declare the primary inputs and outputs of chips as either std logic and std logic vector. Do not use signed or unsigned for primary inputs or outputs. reason Both the Altera tool Quartus and the Xilinx tool ngd2vhdl convert signed and unsigned vectors in entities into std-logic-vectors. If you want your same testbench to work for both functional simulation and timing simulation, you must not use signed or unsigned signals in the top-level entity of your chip. note Signed and unsigned signals are fine inside testbenches, for non-top-level entities, and inside architectures. It is only the top-level entity that should not use signed or unsigned signals. 1.12.2 Flip-Flops and Latches • Use flops, not latches (see section 1.8.2). • Use D-flops, not T, JK, etc (see section 1.8.2). • For every signal in your design, know whether it should be a flip-flop or combinational. Before simulating your design, examine the log file e.g. LOG/dc shell.log to see if the flip flops in your circuit match your expectations, and to check that you don’t have any latches in your design. • Do not assign a signal to itself (e.g. a <= a; is bad). If the signal is a flop, use a chip enable to cause the signal to hold its value. If the signal is combinational, then assigning a signal to itself will cause combinational loops, which are bad. 1.12.3 Inputs and Outputs • Put flip flops on primary inputs and outputs of a chip reason Creates more robust implementations. Signal delays between chips are unpredictable. Signal integrity can be a problem (remember transmission lines from E&CE 324?). Putting flip flops on inputs and outputs of chip provides clean boundaries between circuits. note This only applies to primary inputs and outputs of a chip (the signals in the top-level entity). Within a chip, you should adopt a standard of putting flip-flops on either inputs or outputs of modules. Within a chip, you do not need to put flip-flops on both inputs and outputs. 1.12.4 Multiplexors and Tri-State Signals • Use multiplexors, not tri-state buffers (see section 1.8.2).
  • 102.
    80 CHAPTER 1. VHDL 1.12.5 Processes • For a combinational process, the sensitivity list should contain all of the signals that are read in the process. reason Gives consistent results across different tools. Many synthesis tools will implicitly include all signals that a process reads in its sensitivity list. This differs from the VHDL Standard. A tool that adheres to the standard will introduce latches if not all signals that are read from are included in the sensitivity list. exception In a clocked process using an if rising edge, it is acceptable to have only the clock in the sensitivity list • For a combinational process, every signal that is assigned to, must be assigned to in every branch of if-then and case statements. reason If a signal is not assigned a value in a path through a combinational process, then that signal will be a latch. note For a clocked process, if a signal is not assigned a value in a clock cycle, then the flip-flop for that signal will have a chip-enable pin. Chip-enable pins are fine; they are available on flip-flops in essentially every cell library. • Each signal should be assigned to in only one process. reason Multiple processes driving the same signal is the same as having multiple gates driving the same wire. This can cause contention, short circuits, and other bad things. exception Multiple drivers are acceptable for tri-state busses or if your implementation tech- nology has wired-ANDs or wired-ORs. FPGAs don’t have wired-ANDs or wired-ORs. • Separate unrelated signals into different processes reason Grouping assignments to unrelated signals into a single process can complicate the control circuitry for that process. Each branch in a case statement or if-then-else adds a multiplexor or chip-enable circuitry. reason Synthesis tools generally optimize each process individually, the larger a process is, the longer it will take the synthesis program to optimize the process. Also, larger processes tend to be more complicated and can cause synthesis programs to miss helpful optimiza- tions that they would notice in smaller processes. 1.12.6 State Machines • In a state machine, illegal and unreachable states should transition to the reset state reason Creates more robust implementations. In the field, your circuit will be subjected to illegal inputs, voltage spikes, temperature fluctuations, clock speed variations, etc. At some point in time, something weird will happen that will cause it to jump into an illegal state. Having a system reset and reboot is much better than having it generate incorrect outputs that aren’t detected. • If your state machine has less than 16 states, use a one-hot encoding.
  • 103.
    1.12.7 Reset 81 reason For n states, a one-hot encoding uses n flip-flops, while a binary encoding uses log2 n flip-flops. One-hot signals are simpler to decode, because only one bit must be checked to determine if the circuit is in a particular state. For small values of n, a one-hot signal results in a smaller and faster circuit. For large values of n, the number of signals required for a one-hot design is too great of a penalty to compensate for the simplicity of the decoding circuitry. note Using an enumerated type for states allows the synthesis tool to choose state encodings that it thinks will work well to balance area and clock speed. Quartus uses a “modified one-hot” encoding, where the bit that denotes the reset state is inverted. That is, when the reset bit is ’0’, the system is in the reset state and when the reset bit is a ’1’ the system is not in the reset state. The other bits have the normal polarity. The result is that when the system is in the reset state, all bits are ’0’ and when the system is in a non-reset state, two bits are ’1’. note Using your own encoding allows you to leverage knowledge about your design that the synthesis tool might not be able to deduce. 1.12.7 Reset • Include a reset signal in all clocked circuits. reason For most implementation technologies, when you power-up the circuit, you do not know what state it will start in. You need a reset signal to get the circuit into a known state. reason If something goes wrong while the circuit is running, you need a way to get it into a known state. • For implicit state machines (section 2.5.1.3), check for reset after every wait statement. reason Missing a wait statement means that your circuit might not notice a reset signal, or different signals could reset in different clock cycles, causing your circuit to get out of synch. • Connect reset to the important control signals in the design, such as the state signal. Do not reset every flip flop. reason Using reset adds area and delay to a circuit. The fewer signals that need reset, the faster and smaller your design will be. note Connect the reset signal to critical flip-flops, such as the state signal. Datapath signals rarely need to be reset. You do not need to reset every signal • Use synchronous, not asynchronous, reset reason Creates more robust implementations. Signal propagation delays mean that asyn- chronous resets cause different parts of the circuit to be reset at different times. This can lead to glitches, which then might cause the circuit to move to an illegal state.
  • 104.
    82 CHAPTER 1. VHDL Covering All Cases .................................................................... When writing case statements or selected assignments that test the value of std logic signals, you will get an error unless you include a provision for non ’1’/’0’ signals. For example: signal t : std_logic; ... case t is when ’1’ => ... when ’0’ => ... end case; will result in an error message about missing cases. You must provide for t being ’H’, ’U’, etc. The simplest thing to do is to make the last test when other.
  • 105.
    1.13. VHDL PROBLEMS 83 1.13 VHDL Problems P1.1 IEEE 1164 For each of the values in the list below, answer whether or not it is defined in the ieee.std_logic_1164 library. If it is part of the library, write a 2–3 word description of the value. Values: ’-’, ’#’, ’0’, ’1’, ’A’, ’h’, ’H’, ’L’, ’Q’, ’X’, ’Z’. P1.2 VHDL Syntax Answer whether each of the VHDL code fragments q2a through q2f is legal VHDL code. NOTES: 1) “...” represents a fragment of legal VHDL code. 2) For full marks, if the code is illegal, you must explain why. 3) The code has been written so that, if it is illegal, then it is illegal for both simulation and synthesis. q2a architecture main of anchiceratops is signal a, b, c : std_logic; begin process begin wait until rising_edge(c); a <= if (b = ’1’) then ... else ... end if; end process; end main; q2b architecture main of tulerpeton is begin lab: for i in 15 downto 0 loop ... end loop; end main;
  • 106.
    84 CHAPTER 1. VHDL q2c architecture main of metaxygnathus is signal a : std_logic; begin lab: if (a = ’1’) generate ... end generate; end main; q2d architecture main of temnospondyl is component compa port ( a : in std_logic; b : out std_logic ); end component; signal p, q : std_logic; begin coma_1 : compa port map (a => p, b => q); ... end main; q2e architecture main of pachyderm is function inv(a : std_logic) return std_logic is begin return(NOT a); end inv; signal p, b : std_logic; begin p <= inv(b => a); ... end main; q2f architecture main of apatosaurus is type state_ty is (S0, S1, S2); signal st : state_ty; signal p : std_logic; begin case st is when S0 | S1 => p <= ’0’; when others => p <= ’1’; end case; end main;
  • 107.
    P1.3 Flops, Latches,and Combinational Circuitry 85 P1.3 Flops, Latches, and Combinational Circuitry For each of the signals p...z in the architecture main of montevido, answer whether the signal is a latch, combinational gate, or flip-flop. entity montevido is port ( a, b0, b1, c0, c1, d0, d1, e0, e1 : in std_logic; l : in std_logic_vector (1 downto 0); p, q, r, s, t, u, v, w, x, y, z : out std_logic ); end montevido; architecture main of montevido is process begin signal i, j : std_logic; wait until rising_edge(a); begin t <= b0 XOR b1; i <= c0 XOR c1; u <= NOT t; j <= c0 XOR c1; v <= NOT x; process (a, i, j) begin end process; if (a = ’1’) then p <= i AND j; process begin else case l is p <= NOT i; when "00" => end if; wait until rising_edge(a); end process; w <= b0 AND b1; process (a, b0, b1) begin x <= ’0’; if rising_edge(a) then when "01" => q <= b0 AND b1; wait until rising_edge(a); end if; w <= ’-’; end process; x <= ’1’; when "1-" => process wait until rising_edge(a); (a, c0, c1, d0, d1, e0, e1) w <= c0 XOR c1; begin x <= ’-’; if (a = ’1’) then end case; r <= c0 OR c1; end process; s <= d0 AND d1; y <= c0 XOR c1; else z <= x XOR w; r <= e0 XOR e1; end main; end if; end process;
  • 108.
    86 CHAPTER 1. VHDL P1.4 Counting Clock Cycles This question refers to the VHDL code shown below. NOTES: 1. “...” represents a legal fragment of VHDL code 2. assume all signals are properly declared 3. the VHDL code is intendend to be legal, synthesizable code 4. all signals are initially ’U’
  • 109.
    P1.4 Counting ClockCycles 87 architecture main of tinyckt is entity bigckt is component bigckt ( ... ); port ( signal ... : std_logic; a, b : in std_logic; begin c : out std_logic p0 : process begin ); wait until rising_edge(clk); end bigckt; p0_a <= i; wait until rising_edge(clk); architecture main of bigckt is end process; begin p1 : process begin process (a, b) wait until rising_edge(clk); begin p1_b <= p1_d; if (a = ’0’) then p1_c <= p1_b; c <= ’0’; p1_d <= s2_k; else end process; if (b = 1’) then c <= ’1’ p2 : process (p1_c, p3_h, p4_i, clk) begin else if rising_edge(clk) then c <= ’0’; p2_e <= p3_h; end if; p2_f <= p1_c = p4_i; end if; end if; end process; end process; end main; p3 : process (i, s4_m) begin entity tinyckt is p3_g <= i; port ( p3_h <= s4_m; clk : in std_logic; end process; i : in std_logic; o : out std_logic p4 : process (clk, i) begin ); if (clk = ’1’) then end tinyckt; p4_i <= i; else p4_i <= ’0’; end if; end process; huge : bigckt (a => p2_e, b => p1_d, c => h_y); s1_j <= s3_l; s2_k <= p1_b XOR i; s3_l <= p2_f; s4_m <= p2_f; end main; For each of the pairs of signals below, what is the minimum length of time between when a change occurs on the source signal and when that change affects the destination signal?
  • 110.
    88 CHAPTER 1. VHDL src dst Num clock cycles i p0 a i p1 b i p1 b i p1 c i p2 e i p3 g i p4 i s4 m hy p1 b p1 d p2 f s1 j p2 f s2 k P1.5 Arithmetic Overflow Implement a circuit to detect overflow in 8-bit signed addition. An overflow in addition happens when the carry into the most significant bit is different from the carry out of the most significant bit. When performing addition, for overflow to happen, both operands must have the same sign. Pos- itive overflow occurs when adding two positive operands results in a negative sum. Negative overflow occurs when adding two negative operands results in a positive sum.
  • 111.
    P1.6 Delta-Cycle Simulation:Pong 89 P1.6 Delta-Cycle Simulation: Pong Perform a delta-cycle simulation of the following VHDL code by drawing a waveform diagram. INSTRUCTIONS: 1. The simulation is to be done at the granularity of simulation-steps. 2. Show all changes to process modes and signal values. 3. Each column of the timing diagram corresponds to a simulation step that changes a signal or process. 4. Clearly show the beginning and end of each simulation cycle, delta cycle, and simulation round by writing in the appropriate row a B at the beginning and an E at the end of the cycle or round. 5. End your simulation just before 20 ns. architecture main of pong_machine is signal ping_i, ping_n, pong_i, pong_n : std_logic; begin reset_proc: process next_proc: process (clk) reset <= ’1’; begin wait for 10 ns; if rising_edge(clk) then reset <= ’0’; ping_n <= ping_i; wait for 100 ns; pong_n <= pong_i; end process; end if; end process; clk_proc: process clk <= ’0’; comb_proc: process (pong_n, ping_n, reset) wait for 10 ns; begin clk <= ’1’; if (reset = ’1’) then wait for 10 ns; ping_i <= ’1’; end process; pong_i <= ’0’; else ping_i <= pong_n; pong_i <= ping_n; end if; end process; end main; P1.7 Delta-Cycle Simulation: Baku Perform a delta-cycle simulation of the following VHDL code by drawing a waveform diagram. INSTRUCTIONS:
  • 112.
    90 CHAPTER 1. VHDL 1. The simulation is to be done at the granularity of simulation-steps. 2. Show all changes to process modes and signal values. 3. Each column of the timing diagram corresponds to a simulation step. 4. Clearly show the beginning and end of each simulation cycle, delta cycle, and simulation round by writing in the appropriate row a B at the beginning and an E at the end of the cycle or round. 5. Write “t=5ns” and “t=10ns” at the top of columns where time advances to 5 ns and 10 ns. 6. Begin your simulation at 5 ns (i.e. after the initial simulation cycles that initialize the signals have completed). 7. End your simulation just before 15 ns; entity baku is port ( clk, a, b : in std_logic; f : out std_logic ); end baku; architecture main of baku is signal c, d, e : std_logic; begin proc_clk: process begin proc_1 : process (a, b, c) clk <= ’0’; begin wait for 10 ns; c <= a and b; clk <= ’1’; d <= a xor c; wat for 10 ns; end process; end process; proc_2 : process proc_extern : process begin begin e <= d; a <= ’0’; wait until rising_edge(clk); b <= ’0’; end process; wait for 5 ns; proc_3 : process (c, e) begin a <= ’1’; f <= c xor e; b <= ’1’; end process; wait for 15 ns; end main; end process;
  • 113.
    P1.8 Clock-Cycle Simulation 91 P1.8 Clock-Cycle Simulation Given the VHDL code for anapurna and waveform diagram below, answer what the values of the signals y, z, and p will be at the given times. entity anapurna is port ( clk, reset, sel : in std_logic; a, b : in unsigned(15 downto 0); p : out unsigned(15 downto 0) ); end anapurna; architecture main of anapurna is type state_ty is (mango, guava, durian, papaya); signal y, z : unsigned(15 downto 0); signal state : state_ty; begin proc_herzog: process begin top_loop: loop wait until (rising_edge(clk)); next top_loop when (reset = ’1’); proc_hillary: process (clk) state <= durian; begin wait until (rising_edge(clk)); if rising_edge(clk) then state <= papaya; if (state = durian) then while y < z loop z <= a; wait until (rising_edge(clk)); else if sel = ’1’ then z <= z + 2; wait until (rising_edge(clk)); end if; next top_loop when (reset = ’1’); end if; state <= mango; end process; end if; y <= b; state <= papaya; p <= y + z; end loop; end main; end loop; end process;
  • 114.
    92 CHAPTER 1. VHDL P1.9 VHDL — VHDL Behavioural Comparison: Teradactyl For each of the VHDL architectures q3a through q3c, does the signal v have the same behaviour as it does in the main architecture of teradactyl? NOTES: 1) For full marks, if the code has different behaviour, you must explain why. 2) Ignore any differences in behaviour in the first few clock cycles that is caused by initialization of flip-flops, latches, and registers. 3) All code fragments in this question are legal, synthesizable VHDL code. entity teradactyl is architecture q3a of teradactyl is port ( signal b, c, d : std_logic; a : in std_logic; begin v : out std_logic b <= a; ); c <= b; end teradactyl; d <= c; architecture main of teradactyl is v <= d; signal m : std_logic; end q3a; begin m <= a; v <= m; end main; architecture q3b of teradactyl is architecture q3c of teradactyl is signal m : std_logic; signal m : std_logic; begin begin process (a, m) begin process (a) begin v <= m; m <= a; m <= a; end process; end process; process (m) begin end q3b; v <= m; end process; end q3c;
  • 115.
    P1.10 VHDL —VHDL Behavioural Comparison: Ichtyostega 93 P1.10 VHDL — VHDL Behavioural Comparison: Ichtyostega For each of the VHDL architectures q4a through q4c, does the signal v have the same behaviour as it does in the main architecture of ichthyostega? NOTES: 1) For full marks, if the code has different behaviour, you must explain why. 2) Ignore any differences in behaviour in the first few clock cycles that is caused by initialization of flip-flops, latches, and registers. 3) All code fragments in this question are legal, synthesizable VHDL code. entity ichthyostega is architecture q4a of ichthyostega is port ( signal bx, cx : signed(3 downto 0); clk : in std_logic; begin b, c : in signed(3 downto 0); process begin v : out signed(3 downto 0) wait until (rising_edge(clk)); ); bx <= b; end ichthyostega; cx <= c; end process; architecture main of ichthyostega is process begin signal bx, cx : signed(3 downto 0); if (cx > 0) then begin wait until (rising_edge(clk)); process begin v <= bx; wait until (rising_edge(clk)); else bx <= b; wait until (rising_edge(clk)); cx <= c; v <= to_signed(-1, 4); end process; end if; process begin end process; wait until (rising_edge(clk)); end q4a; if (cx > 0) then v <= bx; else v <= to_signed(-1, 4); end if; end process; end main;
  • 116.
    94 CHAPTER 1. VHDL architecture q4b of ichthyostega is architecture q4c of ichthyostega is signal bx, cx : signed(3 downto 0); signal bx, cx, dx : signed(3 downto 0); begin begin process begin process begin wait until (rising_edge(clk)); wait until (rising_edge(clk)); bx <= b; bx <= b; cx <= c; cx <= c; wait until (rising_edge(clk)); end process; if (cx > 0) then process begin v <= bx; wait until (rising_edge(clk)); else v <= dx; v <= to_signed(-1, 4); end process; end if; dx <= bx when (cx > 0) end process; else to_signed(-1, 4); end q4b; end q4c;
  • 117.
    P1.11 Waveform —VHDL Behavioural Comparison 95 P1.11 Waveform — VHDL Behavioural Comparison Answer whether each of the VHDL code fragments q3a through q3d has the same behaviour as the timing diagram. NOTES: 1) “Same behaviour” means that the signals a, b, and c have the same values at the end of each clock cycle in steady-state simulation (ignore any irregularities in the first few clock cycles). 2) For full marks, if the code does not match, you must explain why. 3) Assume that all signals, constants, variables, types, etc are properly defined and declared. 4) All of the code fragments are legal, synthesizable VHDL code. clk a b c q3a q3b architecture q3a of q3 is begin architecture q3b of q3 is process begin begin a <= ’1’; process begin loop b <= ’0’; wait until rising_edge(clk); a <= ’1’; a <= NOT a; wait until rising_edge(clk); end loop; a <= b; end process; b <= a; b <= NOT a; wait until rising_edge(clk); c <= NOT b; end process; end q3a; c <= a; end q3b; ˜
  • 118.
    96 CHAPTER 1. VHDL q3c q3d architecture q3c of q3 is architecture q3d of q3 is begin begin process begin process (b, clk) begin a <= ’0’; a <= NOT b; b <= ’1’; end process; wait until rising_edge(clk); process (a, clk) begin b <= a; b <= NOT a; a <= b; end process; wait until rising_edge(clk); c <= NOT b; end process; end q3d; c <= NOT b; end q3c; ˜ q3e q3f architecture q3e of q3 is begin architecture q3f of q3 is process begin begin process begin b <= ’0’; a <= ’1’; a <= ’1’; b <= ’0’; wait until rising_edge(clk); c <= ’1’; a <= c; wait until rising_edge(clk); b <= a; a <= c; wait until rising_edge(clk); b <= a; end process; c <= NOT b; c <= not b; wait until rising_edge(clk); end q3e; end process; end q3f; ˜
  • 119.
    P1.12 Hardware —VHDL Comparison 97 P1.12 Hardware — VHDL Comparison For each of the circuits q2a–q2d, answer entity q2 is whether the signal d has the same behaviour port ( as it does in the main architecture of q2. a, clk, reset : in std_logic; d : out std_logic ); end q2; architecture main of q2 is signal b, c : std_logic; begin b <= ’0’ when (reset = ’1’) else a; process (clk) begin if rising_edge(clk) then c <= b; d <= c; end if; end process; end main; reset reset 0 d 0 a d a q2a clk q2b clk reset clk reset 0 0 d a d a q2c clk q2d clk
  • 120.
    98 CHAPTER 1. VHDL P1.13 8-Bit Register Implement an 8-bit register that has: • clock signal clk • input data vector d • output data vector q • synchronous active-high input reset • synchronous active-high input enable P1.13.1 Asynchronous Reset Modify your design so that the reset signal is asynchronous, rather than synchronous. P1.13.2 Discussion Describe the tradeoffs in using synchonous versus asynchronous reset in a circuit implemented on an FPGA. P1.13.3 Testbench for Register Write a test bench to validate the functionality of the 8-bit register with synchronous reset.
  • 121.
    P1.14 Synthesizable VHDLand Hardware 99 P1.14 Synthesizable VHDL and Hardware For each of the fragments of VHDL q4a...q4f, answer whether the the code is synthesizable. If the code is synthesizable, draw the circuit most likely to be generated by synthesizing the datapath of the code. If the the code is not synthesizable, explain why. process begin wait until rising_edge(a); e <= d; q4a wait until rising_edge(b); e <= NOT d; end process; process begin while (c /= ’1’) loop if (b = ’1’) then wait until rising_edge(a); e <= d; q4b else e <= NOT d; end if; end loop; e <= b; end process; process (a, d) begin e <= d; end process; process (a, e) begin q4c if rising_edge(a) then f <= NOT e; end if; end process; process (a) begin if rising_edge(a) then if b = ’1’ then e <= ’0’; q4d else e <= d; end if; end if; end process;
  • 122.
    100 CHAPTER 1. VHDL process (a,b,c,d) begin if rising_edge(a) then e <= c; else q4e if (b = ’1’) then e <= d; end if; end if; end process; process (a,b,c) begin if (b = ’1’) then e <= ’0’; else q4f if rising_edge(a) then e <= c; end if; end if; end process;
  • 123.
    P1.15 Datapath Design 101 P1.15 Datapath Design Each of the three VHDL fragments q4a–q4c, is intended to be the datapath for the same circuit. The circuit is intended to perform the following sequence of operations (not all operations are required to use a clock cycle): • read in source and destination addresses from i src1, i src2, i dst clk • read operands op1 and op2 from memory i_src1 o_result • compute sum of operands sum i_src2 • write sum to memory at destination address dst i_dst • write sum to output o result P1.15.1 Correct Implementation? For each of the three fragments of VHDL q4a–q4c, answer whether it is a correct implementation of the datapath. If the datapath is not correct, explain why. If the datapath is correct, answer in which cycle you need load=’1’. NOTES: 1. You may choose the number of clock cycles required to execute the sequence of operations. 2. The cycle in which the addresses are on i src1, i src2, and i dst is cycle #0. 3. The control circuitry that controls the datapath will output a signal load, which will be ’1’ when the sum is to be written into memory. 4. The code fragment with the signal declaractions, connections for inputs and outputs, and the instantiation of memory is to be used for all three code fragments q4a–q4c. 5. The memory has registered inputs and combinational (unregistered) outputs. 6. All of the VHDL is legal, synthesizable code.
  • 124.
    102 CHAPTER 1. VHDL -- This code is to be used for -- all three code fragments q4a--q4c. signal state : std_logic_vector(3 downto 0); signal src1, src2, dst, op1, op2, sum, mem_in_a, mem_out_a, mem_out_b, mem_addr_a, mem_addr_b : unsigned(7 downto 0); ... process (clk) begin if rising_edge(clk) then src1 <= i_src1; src2 <= i_src2; dst <= i_dst; o_result <= sum; end if; end process; mem : ram256x16d port map (clk => clk, i_addr_a => mem_addr_a, i_addr_b => mem_addr_b, i_we_a => mem_we, i_data_a => mem_in_a, o_data_a => mem_out_a, o_data_b => mem_out_b);
  • 125.
    P1.15 Datapath Design 103 q4a op1 <= mem_out_a when state = "0010" else (others => ’0’); op2 <= mem_out_b when state = "0010" else (others => ’0’); sum <= op1 + op2 when state = "0100" else (others => ’0’); mem_in_a <= sum when state = "1000" else (others => ’0’); mem_addr_a <= dst when state = "1000" else src1; mem_we <= ’1’ when state = "1000" else ’0’; mem_addr_b <= src2; process (clk) begin if rising_edge(clk) then if (load = ’1’) then state <= "1000"; else -- rotate state vector one bit to left state <= state(2 downto 0) & state(3); end if; end if; end process; q4b process (clk) begin if rising_edge(clk) then op1 <= mem_out_a; op2 <= mem_out_b; end if; end process; sum <= op1 + op2; mem_in_a <= sum; mem_we <= load; mem_addr_a <= dst when load = ’1’ else src1; mem_addr_b <= src2;
  • 126.
    104 CHAPTER 1. VHDL q4c process begin wait until rising_edge(clk); op1 <= mem_out_a; op2 <= mem_out_b; sum <= op1 + op2; mem_in_a <= sum; end process; process (load, dst, src1) begin if load = ’1’ then mem_addr_a <= dst; else mem_addr_a <= src1; end if; end process; mem_addr_b <= src2; P1.15.2 Smallest Area Of all of the circuits (q4a–q4c), including both correct and incorrect circuits, predict which will have the smallest area. If you don’t have sufficient information to predict the relative areas, explain what additional infor- mation you would need to predict the area prior to synthesizing the designs. P1.15.3 Shortest Clock Period Of all of the circuits (q4a–q4c), including both correct and incorrect circuits, predict which will have the shortest clock period. If you don’t have sufficient information to predict the relative periods, explain what additional information you would need to predict the period prior to performing any synthesis or timing analysis of the designs.
  • 127.
    Chapter 2 RTL Designwith VHDL: From Requirements to Optimized Code 2.1 Prelude to Chapter 2.1.1 A Note on EDA for FPGAs and ASICs The following is from John Cooley’s column The Industry Gadfly from 2003/04/30. The title of this article is: “The FPGA EDA Slums”. For 2001, Dataquest reported that the ASIC market was US$16.6 billion while the FPGA market was US$2.6 billion. What’s more interesting is that the 2001 ASIC EDA market was US$2.2 billion while the FPGA EDA market was US$91.1 million. Nope, that’s not a mistake. It’s ASIC EDA and billion versus FPGA EDA and million. Do the math and you’ll see that for every dollar spent on an ASIC project, roughly 12 cents of it goes to an EDA vendor. For every dollar spent on a FPGA project, roughly 3.4 cents goes to an EDA vendor. Not good. It’s the old free milk and a cow story according to Gary Smith, the Senior EDA Analyst at Dataquest. “Altera and Xilinx have fowled their own nest. Their free tools spoil the FPGA EDA market,” says Gary. “EDA vendors know that there’s no money to be made in FPGA tools.” 105
  • 128.
    106 CHAPTER 2. RTL DESIGN WITH VHDL 2.2 FPGA Background and Coding Guidelines 2.2.1 Generic FPGA Hardware 2.2.1.1 Generic FPGA Cell “Cell” = “Logic Element” (LE) in Altera = “Configurable Logic Block” (CLB) in Xilinx carry_in comb_data_out comb_data_in comb R D Q flop_data_out CE S flop_data_in ctrl_in carry_out 2.2.2 Area Estimation To estimate the number of FPGA cells that will be required to implement a circuit, recall that an FPGA lookup-table can implement any function with up to four inputs and one output. We will describe two methods to estimate the area (number of FPGA cells) required to implement a gate-level circuit: 1. Rough estimate based simply upon the number of flip-flops and primary inputs that are in the fanin of each flip-flop. 2. A more accurate estimate, based upon greedily including as many gates as possible into each FPGA cell. Allocating gates to FPGA cells is a form of technology mapping: moving from the implementation technology of generic gates to the implementation technology of FPGA cells. As with almost all other design tasks, allocating gates to cells is an NP-complete problem: the only way to ensure that we get the smallest design possible is to try all possible designs. To deal with NP-complete problems, design tools use heuristics or search techniques to explore efficiently a subset of the options and hopefully produce a design that is close to the absolute smallest. Because
  • 129.
    2.2.2 Area Estimation 107 different synthesis tools use different heuristics and search algorithms, different tools will give results. The circuitry for any flip-flop signal with up to four source flip-flops can be implemented on a single FPGA cell. If a flip-flop signal is dependent upon five source flip-flops, then two FPGA cells are required. Source flops/inputs Minimum cells 1 1 2 1 3 1 4 1 5 2 6 2 7 2 8 3 9 3 10 3 11 4 For a single target signal, this technique gives a lower bound on the number of cells needed. For example, some functions of seven inputs require more than two cells. As a particular example, a four-to-one multiplexer has six inputs and requires three cells. When dealing with multiple target signals, this technique might be an overestimate, because a single cell can drive several other cells (common subexpression elimination). PLA and Flop for Different Functions .................................................. carry_in comb_data_out comb_data_in comb R D Q flop_data_out CE S flop_data_in ctrl_in carry_out
  • 130.
    108 CHAPTER 2. RTL DESIGN WITH VHDL PLA and Flop for Same Function ..................................................... . carry_in comb_data_out comb_data_in comb R D Q flop_data_out CE S flop_data_in ctrl_in carry_out
  • 131.
    2.2.2 Area Estimation 109 PLA and Flop for Same Function ...................................................... carry_in comb_data_out comb_data_in comb R D Q flop_data_out CE S flop_data_in ctrl_in carry_out
  • 132.
    110 CHAPTER 2. RTL DESIGN WITH VHDL Estimate Area for Circuit .............................................................. To have a more accurate estimate of the area of a circuit, we begin with each flip-flop and output, then traverse backward through the fanin gathering as much combinational circuitry as possible into the FPGA cell. Usually, this means that we continue as long as we have four or fewer inputs to the cell. However, when traversing through some circuits, we will temporarily have five or more signals as input, then further back in the fanin, the circuit will collapse back to less than five signals. Once we can no longer include more circuitry into an FPGA cell, we start with a fresh FPGA cell and continue to traverse backward through the fanin. Many signals have more than one target, so many FPGA cells will be connected to multiple des- tinations. When choosing whether to include a gate in an FPGA cell, consider whether the gate drives multiple targets. There are two options: include the gate in an FPGA cell that drives both targets, or duplicate the gate and incorporate it into two FPGA cells. The choice of which option will lead to the smaller circuit is dependent on the details of the design. Question: Map the combinational circuits below onto generic FPGA cells. a z a b b c comb R d D Q CE z c S d z x a g a b z y e x c comb R comb R b d D CE Q y D CE Q h S S c f z y d i
  • 133.
    2.2.2 Area Estimation 111 z x a z b y c comb R comb R d D Q y D Q CE CE S S a g e x b h c f z y d w i b c d comb R D Q CE w S
  • 134.
    112 CHAPTER 2. RTL DESIGN WITH VHDL 2.2.2.1 Interconnect for Generic FPGA Note: In these slides, the space between tightly grouped wires sometimes disappears, making a group of wires appear to be a single large wire. There are two types of wires that connect a cell to the rest of the chip: • General purpose interconnect (configurable, slow) • Carry chains and cascade chains (verticaly adjacent cells, fast) 2.2.2.2 Blocks of Cells for Generic FPGA Cells are organized into blocks. There is a great deal of interconnect (wires) between cells within a single block. In large FPGAs, the blocks are organized into larger blocks. These large blocks might themselves be organized into even larger blocks. Think of an FPGA as bunch of nested for-generate statements that replicate a single component (cell) hundreds of thousands of times.
  • 135.
    2.2.2 Area Estimation 113 Cells not used for computation can be used as “wires” to shorten length of path between cells.
  • 136.
    114 CHAPTER 2. RTL DESIGN WITH VHDL 2.2.2.3 Clocks for Generic FPGAs Characteristics of clock signals: • High fanout (drive many gates) • Long wires (destination gates scattered all over chip) Characteristics of FPGAs: • Very few gates that are large (strong) enough to support a high fanout. • Very few wires that traverse entire chip and can be connected to every flip-flop. 2.2.2.4 Special Circuitry in FPGAs Memory ............................................................................. . For more than five years, FPGAs have had special circuits for RAM and ROM. In Altera FPGAs, these circuits are called ESBs (Embedded System Blocks). These special circuits are possible because many FPGAs are fabricated on the same processes as SRAM chips. So, the FPGAs simply contain small chunks of SRAM. Microprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. A new feature to appear in FPGAs in 2001 and 2002 is hardwired microprocessors on the same chip as programmable hardware. Hard Soft Altera Arm 922T with 200 MIPs Nios with ?? MIPs Xilinx: Virtex-II Pro Power PC 405 with 420 D-MIPs Microblaze with 100 D-MIPs The Xilinx-II Pro has 4 Power PCs and enough programmable hardware to implement the first- generation Intel Pentium microprocessor. Arithmetic Circuitry ................................................................. . A new feature to appear in FPGAs in 2001 and 2002 is hardwired circuits for multipliers and adders. Altera: Mercury 16 × 16 at 130MHz Xilinx: Virtex-II Pro 18 × 18 at ???MHz Using these resources can improve significantly both the area and performance of a design.
  • 137.
    2.2.3 Generic-FPGA CodingGuidelines 115 Input / Output ....................................................................... . Recently, high-end FPGAs have started to include special circuits to increase the bandwidth of communication with the outside world. Product Altera True-LVDS (1 Gbps) Xilinx Rocket I/O (3 Gbps) 2.2.3 Generic-FPGA Coding Guidelines • Flip-flops are almost free in FPGAs reason In FPGAs, the area consumed by a design is usually determined by the amount of combinational circuitry, not by the number of flip-flops. • Aim for using 80–90% of the cells on a chip. reason If you use more than 90% of the cells on a chip, then the place-and-route program might not be able to route the wires to connect the cells. reason If you use less than 80% of the cells, then probably: there are optimizations that will increase performance and still allow the design to fit on the chip; or you spent too much human effort on optimizing for low area; or you could use a smaller (cheaper!) chip. exception In E&CE 327 (unlike in real life), the mark is based on the actual number of cells used. • Use just one clock signal reason If all flip-flops use the same clock, then the clock does not impose any constraints on where the place-and-route tool puts flip-flops and gates. If different flip-flops used different clocks, then flip-flops that are near each other would probably be required to use the same clock. • Use only one edge of the clock signal reason There are two ways to use both rising and falling edges of a clock signal: have rising- edge and falling-edge flip flops, or have two different clock signals that are inverses of each other. Most FPGAs have only rising-edge flip flops. Thus, using both edges of a clock signal is equivalent to having two different clock signals, which is deprecated by the preceding guideline.
  • 138.
    116 CHAPTER 2. RTL DESIGN WITH VHDL 2.3 Design Flow 2.3.1 Generic Design Flow Most people agree on the general terminology and process for a digital hardware design flow. However, each book and course has its own particular way of presenting the ideas. Here we will lay out the consistent set of definitions that we will use in E&CE 327. This might be different from what you have seen in other courses or on a work term. Focus on the ideas and you will be fine both now and in the future. The design flow presented here focuses on the artifacts that we work with, rather than the opera- tions that are performed on the artifacts. This is because the same operations can be performed at different points in the design flow, while the artifacts each have a unique purpose. Requirements Modify Algorithm Analyze Modify High-Level Model Analyze dp/ctrl specific Modify DP+Ctrl Code Analyze Modify Opt. RTL Code Analyze Modify Implementation Analyze Hardware Figure 2.1: Generic Design Flow
  • 139.
    2.3.2 Implementation Flows 117 Table 2.1: Artifacts in the Design Flow Requirements Description of what the customer wants Algorithm Functional description of computation. Probably not syn- thesizable. Could be a flowchart, software, diagram, mathematical equation, etc.. High-Level Model HDL code that is not necessarily synthesizable, but di- vides algorithm into signals and clock cycles. Possibly mixes datapath and control. In VHDL, could be a single process that captures the behaviour of the algorithm. Usu- ally synthesizable; resulting hardware is usually big and slow compared to optimized RTL code. Dataflow Diagram A picture that depicts the datapath computation over time, clock-cycle by clock-cycle (Section 2.6) Hardware Block Diagram A picture that depicts the structure of the datapath: the components and the connections between the compo- nents. (e.g., netlist or schematic) State Machine A picture that depicts the behaviour of the control cir- cuitry over time (Section 2.5) DP+Ctrl RTL code Synthesizable HDL code that separates the datapath and control into separate processes and assignments. Optimized RTL Code HDL code that has been written to meet design goals (high performance, low power, small, etc.) Implementation Code A collection of files that include all of the information needed to build the circuit: HDL program targeted for a particular implementation technology (e.g. a specific FPGA chip), constraint files, script files, etc. Note: Recomendation Spend the time up front to plan a good design on paper. Use dataflow diagrams and state machines to predict performance and area. The E&CE 327 project might appear to be sufficiently small and simple that you can go straight to RTL code. However, you will probably produce a more optimal design with less effort if you explore high-level optimizations with dataflow diagrams and state machines. 2.3.2 Implementation Flows Synopsys Design Compiler and FPGA Compiler are general-purpose synthesis programs. They have very few, if any, technology-specific algorithms. Instead, they rely on libraries to describe technology-specific parameters of the primitive building blocks (e.g. the delay and area of individ- ual gates, PLAs, CLBs, flops, memory arrays).
  • 140.
    118 CHAPTER 2. RTL DESIGN WITH VHDL Mentor Graphic’s product Leonardo Spectrum, Cadence’s product BuildGates, and Synplicity’s product Synplify are similar. In comparison, Avant! (Now owned by Synopsys) and Cadence sell separate tools that do place-and-route and other low-level (physical design) tasks. These general-purpose synthesis tools do not (generally) do the final stages of the design, such as place-and-route and timing analysis, which are very specific to a given implementation technology. The implementation-technology-specific tools generally also produce a VHDL file that accurately models the chip. We will refer to this file as the “implementation VHDL code”. With Synopsys and the Altera tool Quartus, we compile the VHDL code into an EDIF file for the netlist and a TCL file for the commands to Quartus. Quartus then generates a sof (SRAM Object File), which can be downloaded to an Altera SRAM-based FPGA. The extension of the implementation VHDL file is often .vho, for “VHDL output”. With the Synopsys and Xilinx tools, we compile VHDL code into a Xilinx-specific design file (xnf — Xilinx netlist file). We then use the Xilinx tools to generate a bit file, which can be downloaded to a Xilinx FPGA. The name of the implementation VHDL file is often suffixed with routed.vhd. Terminology: “Behavioural” and “Structural” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Note: behavioural and structural models The phrases “behavioural model” and “structural model” are commonly used for what we’ll call “high-level models” and “synthesizable models”. In most cases, what people call struc- tural code contains both structural and behavioural code. The technically cor- rect definition of a structural model is an HDL program that contains only component instantiations and generate statements. Thus, even a program with c <= a AND b; is, strictly speaking, behavioural. 2.3.3 Design Flow: Datapath vs Control vs Storage 2.3.3.1 Classes of Hardware Each circuit tends to be dominated by either its datapath, control (state machine) or storage (mem- ory). • Datapath – Purpose: compute output data based on input data – Each “parcel” of input produces one “parcel” of output – Examples: arithmetic, decoders
  • 141.
    2.3.3 Design Flow:Datapath vs Control vs Storage 119 • Storage – Purpose: hold data for future use – Data is not modified while stored – Examples: register files, FIFO queues • Control – Purpose: modify internal state based on inputs, compute outputs from state and inputs – Mostly individual signals, few data (vectors) – Examples: bus arbiters, memory-controllers All three classes of circuits (datapath, control, and storage) follow the same generic design flow (Figure2.1) and use dataflow diagrams, hardware block diagrams, and state machines. The differ- ences in the design flows appear in the relative amount of effort spent on each type of description and the order in which the different descriptions are used. The differences are most pronounced in the transition from the high-level model to the model that separates the datapath and control circuitry. 2.3.3.2 Datapath-Centric Design Flow High-Level Model Modify Dataflow Analyze Modify Block Diagram State Machine Analyze DP+Ctrl RTL Code Figure 2.2: Datapath-Centric Design Flow
  • 142.
    120 CHAPTER 2. RTL DESIGN WITH VHDL 2.3.3.3 Control-Centric Design Flow High-Level Model Modify State Machine Analyze Modify Dataflow Diagram Analyze Modify Block Diagram Analyze DP+Ctrl RTL Code Figure 2.3: Control-Centric Design Flow 2.3.3.4 Storage-Centric Design Flow In E&CE 327, we won’t be discussing storage-centric design. Storage-centric design differs from datapath- and control-centric design in that storage-centric design focusses on building many repli- cated copies of small cells. Storage-centric designs include a wide range of circuits, from simple memory arrays to compli- cated circuits such as register files, translation lookaside buffers, and caches. The complicated circuits can contain large and very intricate state machines, which would benefit from some of the techniques for control-centric circuits. 2.4 Algorithms and High-Level Models For designs with significant control flow, algorithms can be described in software languages, flow- charts, abstract state machines, algorithmic state machines, etc. For designs with trivial control flow (e.g. every parcel of input data undergoes the same computa- tion), data-dependency graphs (section 2.4.2) are a good way to describe the algorithm. For designs with a small amount of control flow (e.g. a microprocessor, where a single decision is made based upon the opcode) a set of data-dependency graphs is often a good choice.
  • 143.
    2.4.1 Flow Chartsand State Machines 121 Software executes in series; hardware executes in parallel When creating an algorithmic description of your hardware design, think about how you can repre- sent parallelism in the algorithmic notation that you are using, and how you can exploit parallelism to improve the performance of your design. 2.4.1 Flow Charts and State Machines Flow charts and various flavours of state machines are covered well in many courses. Generally everything that you’ve learned about these forms of description are also applicable in hardware design. In addition, you can exploit parallelism in state machine design to create communicating finite state machines. A single complex state machine can be factored into multiple simple state machines that operate in parallel and communicate with each other. 2.4.2 Data-Dependency Graphs In software, the expression: (((((a + b) + c) + d) + e) + f) takes the same amount of time to execute as: (a + b) + (c + d) + (e + f). But, remember: hardware runs in parallel. In algorithmic descriptions, parentheses can guide parallel vs serial execution. Datadependency graphs capture algorithms of datapath-centric designs. Datapath-centric designs have few, if any, control decisions: every parcel of input data undergroes the same computation. Serial Parallel (((((a+b)+c)+d)+e)+f) (a+b)+(c+d)+(e+f) a b c d e f + + a b c d e f + + + + + + + + 5 adders on longest path (slower) 3 adders on longest path (faster) 5 adders used (equal area) 5 adders used (equal area)
  • 144.
    122 CHAPTER 2. RTL DESIGN WITH VHDL 2.4.3 High-Level Models There are many different types of high-level models, depending upon the purpose of the model and the characteristics of the design that the model describes. Some models may capture power consumption, others performance, others data functionality. High-level models are used to estimate the most important design metrics very early in the design cycle. If power consumption is more important that performance, then you might write high- level models that can predict the power consumption of different design choices, but which has no information about the number of clock cycles that a computation takes, or which predicts the latency inaccurately. Conversely, if performance is important, you might write clock-cycle accurate high-level models that do not contain any information about power consumption. Conventionally, performance has been the primary design metric. Hence, high-level models that predict performance are more prevalent and more well understood than other types of high-level models. There are many research and entrepreneurial opportunities for people who can develop tools and/or languages for high-level models for estimating power, area, maximum clock speed, etc. In E&CE 327 we will limit ourselves to the well-understood area of high-level models for perfor- mance prediction.
  • 145.
    2.5. FINITE STATEMACHINES IN VHDL 123 2.5 Finite State Machines in VHDL 2.5.1 Introduction to State-Machine Design 2.5.1.1 Mealy vs Moore State Machines Moore Machines ..................................................................... . s0/0 a !a • Outputs are dependent upon only the state s1/1 s2/0 • No combinational paths from inputs to outputs s3/0 Mealy Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. s0 a/1 !a/0 • Outputs are dependent upon both the state and the in- puts s1 s2 • Combinational paths from inputs to outputs /0 /0 s3 2.5.1.2 Introduction to State Machines and VHDL A state machine is generally written as a single clocked process, or as a pair of processes, where one is clocked and one is combinational.
  • 146.
    124 CHAPTER 2. RTL DESIGN WITH VHDL Design Decisions ..................................................................... . • Moore vs Mealy (Sections 2.5.2 and 2.5.3) • Implicit vs Explicit (Section 2.5.1.3) • State values in explicit state machines: Enumerated type vs constants (Section 2.5.5.1) • State values for constants: encoding scheme (binary, gray, one-hot, ...) (Section 2.5.5) VHDL Constructs for State Machines .................................................. The following VHDL control constructs are useful to steer the transition from state to state: • if ... then ... else • loop • case • next • for ... loop • exit • while ... loop 2.5.1.3 Explicit vs Implicit State Machines There are two broad styles of writing state machines in VHDL: explicit and implicit. “Explicit” and “implicit” refer to whether there is an explicit state signal in the VHDL code. Explicit state machines have a state signal in the VHDL code. Implicit state machines do not contain a state signal. Instead, they use VHDL processes with multiple wait statements to control the execution. In the explicit style of writing state machines, each process has at most one wait statement. For the explicit style of writing state machines, there are two sub-categories: “current state” and “cur- rent+next state”. In the explicit-current style of writing state machines, the state signal represents the current state of the machine and the signal is assigned its next value in a clocked process. In the explicit-current+next style, there is a signal for the current state and another signal for the next state. The next-state signal is assigned its value in a combinational process or concurrent state- ment and is dependent upon the current state and the inputs. The current-state signal is assigned its value in a clocked process and is just a flopped copy of the next-state signal. For the implicit style of writing state machines, the synthesis program adds an implicit register to hold the state signal and combinational circuitry to update the state signal. In Synopsys synthesis tools, the state signal defined by the synthesizer is named multiple wait state reg. In Mentor Graphics, the state signal is named STATE VAR We can think of the VHDL code for implicit state machines as having zero state signals, explicit- current state machines as having one state signal (state), and explicit-current+next state ma- chines as having two state signals (state and state next).
  • 147.
    2.5.2 Implementing aSimple Moore Machine 125 As with all topics in E&CE 327, there are tradeoffs between these different styles of writing state machines. Most books teach only the explicit-current+next style. This style is the style closest to the hardware, which means that they are more amenable to optimization through human interven- tion, rather than relying on a synthesis tool for optimization. The advantage of the implicit style is that they are concise and readable for control flows consisting of nested loops and branches (e.g. the type of control flow that appears in software). For control flows that have less structure, it can be difficult to write an implicit state machine. Very few books or synthesis manuals describe multiple-wait statement processes, but they are relatively well supported among synthesis tools. Because implicit state machines are written with loops, if-then-elses, cases, etc. it is difficult to write some state machines with complicated control flows in an implicit style. The following example illustrates the point. a s0/0 s2/0 !a !a a s3/0 s1/1 Note: The terminology of “explicit” and “implicit” is somewhat standard, in that some descriptions of processes with multiple wait statements describe the processes as having “implicit state machines”. There is no standard terminology to distinguish between the two explicit styles: explicit-current+next and explicit-current. 2.5.2 Implementing a Simple Moore Machine s0/0 entity simple is a !a port ( a, clk : in std_logic; s1/1 s2/0 z : out std_logic ); end simple; s3/0
  • 148.
    126 CHAPTER 2. RTL DESIGN WITH VHDL 2.5.2.1 Implicit Moore State Machine Flops 3 architecture moore_implicit_v1a of simple is Gates 2 begin Delay 1 gate process begin z <= ’0’; wait until rising_edge(clk); if (a = ’1’) then z <= ’1’; else z <= ’0’; end if; wait until rising_edge(clk); z <= ’0’; wait until rising_edge(clk); end process; end moore_implicit; !a s2/0
  • 149.
    2.5.2 Implementing aSimple Moore Machine 127 2.5.2.2 Explicit Moore with Flopped Output architecture moore_explicit_v1 of simple is type state_ty is (s0, s1, s2, s3); Flops 3 signal state : state_ty; Gates 10 begin Delay 3 gates process (clk) begin if rising_edge(clk) then case state is when s0 => if (a = ’1’) then state <= s1; z <= ’1’; else state <= s2; z <= ’0’; end if; when s1 | s2 => state <= s3; z <= ’0’; when s3 => state <= s0; z <= ’1’; end case; end if; end process; end moore_explicit_v1;
  • 150.
    128 CHAPTER 2. RTL DESIGN WITH VHDL 2.5.2.3 Explicit Moore with Combinational Outputs Flops 2 architecture moore_explicit_v2 of simple is Gates 7 type state_ty is (s0, s1, s2, s3); Delay 4 gates signal state : state_ty; begin process (clk) begin if rising_edge(clk) then case state is when s0 => if (a = ’1’) then state <= s1; else state <= s2; end if; when s1 | s2 => state <= s3; when s3 => state <= s0; end case; end if; end process; z <= ’1’ when (state = s1) else ’0’; end moore_explicit_v2;
  • 151.
    2.5.2 Implementing aSimple Moore Machine 129 2.5.2.4 Explicit-Current+Next Moore with Concurrent Assignment Flops 2 architecture moore_explicit_v3 of simple is Gates 7 type state_ty is (s0, s1, s2, s3); Delay 4 signal state, state_nxt : state_ty; begin process (clk) begin if rising_edge(clk) then state <= state_nxt; end if; end process; state_nxt <= s1 when (state = s0) and (a = ’1’) else s2 when (state = s0) and (a = ’0’) else s3 when (state = s1) or (state = s2) else s0; z <= ’1’ when (state = s1) else ’0’; end moore_explicit_v3; The hardware synthesized from this architecture is the same as that synthesized from moore explicit v2, which is written in the current-explicit style.
  • 152.
    130 CHAPTER 2. RTL DESIGN WITH VHDL 2.5.2.5 Explicit-Current+Next Moore with Combinational Process architecture moore_explicit_v4 of simple is For this architecture, we type state_ty is (s0, s1, s2, s3); change the selected assign- signal state, state_nxt : state_ty; ment to state into a combi- begin national process using a case process (clk) statement. begin if rising_edge(clk) then Flops 2 state <= state_nxt; Gates 7 end if; Delay 4 end process; process (state, a) The hardware synthe- begin sized from this archi- case state is tecture is the same as when s0 => that synthesized from if (a = ’1’) then moore explicit v2 and state_nxt <= s1; v3. else state_nxt <= s2; end if; when s1 | s2 => state_nxt <= s3; when s3 => state_nxt <= s0; end case; end process; z <= ’1’ when (state = s1) else ’0’; end moore_explicit_v4;
  • 153.
    2.5.3 Implementing aSimple Mealy Machine 131 2.5.3 Implementing a Simple Mealy Machine Mealy machines have a combinational path from inputs to outputs, which often violates good coding guidelines for hardware. Thus, Moore machines are much more common. You should know how to write a Mealy machine if needed, but most of the state machines that you design will be Moore machines. This is the same entity as for the simple Moore state machine. The behaviour of the Mealy machine is the same as the Moore machine, except for the timing relationship between the output (z) and the input (a). s0 entity simple is a/1 !a/0 port ( a, clk : in std_logic; s1 s2 z : out std_logic ); /0 /0 end simple; s3
  • 154.
    132 CHAPTER 2. RTL DESIGN WITH VHDL 2.5.3.1 Implicit Mealy State Machine Note: An implicit Mealy state machine is nonsensical. In an implicit state machine, we do not have a state signal. But, as the example below illustrates, to create a Mealy state machine we must have a state signal. An implicit style is a nonsensical choice for Mealy state machines. Because the output is depen- dent upon the input in the current clock cycle, the output cannot be a flop. For the output to be combinational and dependent upon both the current state and the current input, we must create a state signal that we can read in the assignment to the output. Creating a state signal obviates the advantages of using an implicit style of state machine. Flops 4 architecture implicit_mealy of simple is Gates 8 type state_ty is (s0, s1, s2, s3); Delay 2 gates signal state : state_ty; begin process begin state <= s0; wait until rising_edge(clk); if (a = ’1’) then state <= s1; else state <= s2; end if; wait until rising_edge(clk); state <= s3; wait until rising_edge(clk); end process; z <= ’1’ when (state = s0) and a = ’1’ else ’0; end mealy_implicit; /0 !a/0 s2
  • 155.
    2.5.3 Implementing aSimple Mealy Machine 133 2.5.3.2 Explicit Mealy State Machine Flops 2 architecture mealy_explicit of simple is Gates 7 type state_ty is (s0, s1, s2, s3); Delay 3 signal state : state_ty; begin process (clk) begin if rising_edge(clk) then case state is when s0 => if (a = ’1’) then state <= s1; else state <= s2; end if; when s1 | s2 => state <= s3; when others => state <= s0; end case; end if; end process; z <= ’1’ when (state = s0) and a = ’1’ else ’0’; end mealy_explicit;
  • 156.
    134 CHAPTER 2. RTL DESIGN WITH VHDL 2.5.3.3 Explicit-Current+Next Mealy Flops 2 architecture mealy_explicit_v2 of simple is Gates 4 type state_ty is (s0, s1, s2, s3); Delay 3 signal state, state_nxt : state_ty; begin process (clk) begin if rising_edge(clk) then state <= state_nxt; end if; end process; state_nxt <= s1 when (state = s0) and a = ’1’ else s2 when (state = s0) and a = ’0’ else s3 when (state = s1) or (state = s2) else s0; z <= ’1’ when (state = s0) and a = ’1’ else ’0’; end mealy_explicit_v2; For the Mealy machine, the explicit-current+next style is smaller than the the explicit-current style. In contrast, for the Moore machine, the two styles produce exactly the same hardware.
  • 157.
    2.5.4 Reset 135 2.5.4 Reset All circuits should have a reset signal that puts the circuit back into a good initial state. However, not all flip flops within the circuit need to be reset. In a circuit that has a datapath and a state machine, the state machine will probably need to be reset, but datapath may not need to be reset. There are standard ways to add a reset signal to both explicit and implicit state machines. It is important that reset is tested on every clock cycle, otherwise a reset might not be noticed, or your circuit will be slow to react to reset and could generate illegal outputs after reset is asserted. Reset with Implicit State Machine . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . . .. . . . . ... With an implicit state machine, we need to insert a loop in the process and test for reset after each wait statement. Here is the implicit Moore machine from section 2.5.2.1 with reset code added in bold. architecture moore_implicit of simple is begin process begin init : loop -- outermost loop z <= ’0’; wait until rising_edge(clk); next init when (reset = ’1’); -- test for reset if (a = ’1’) then z <= ’1’; else z <= ’0’; end if; wait until rising_edge(clk); next init when (reset = ’1’); -- test for reset z <= ’0’; wait until rising_edge(clk); next init when (reset = ’1’); -- test for reset end process; end moore_implicit;
  • 158.
    136 CHAPTER 2. RTL DESIGN WITH VHDL Reset with Explicit State Machine ...................................................... Reset is often easier to include in an explicit state machine, because we need only put a test for reset = ’1’ in the clocked process for the state. The pattern for an explicit-current style of machine is: process (clk) begin if rising_edge(clk) then if reset = ’1’ then state <= S0; else if ... then state <= ...; elif ... then ... -- more tests and assignments to state end if; end if; end if; end process; Applying this pattern to the explicit Moore machine from section 2.5.2.3 produces: architecture moore_explicit_v2 of simple is type state_ty is (s0, s1, s2, s3); signal state : state_ty; begin process (clk) begin if rising_edge(clk) then if (reset = ’1’) then state <= s0; else case state is when s0 => if (a = ’1’) then state <= s1; else state <= s2; end if; when s1 | s2 => state <= s3; when s3 => state <= s0; end case; end if; end if; end process; z <= ’1’ when (state = s1) else ’0’; end moore_explicit_v2;
  • 159.
    2.5.5 State Encoding 137 The pattern for an explicit-current+next style is: process (clk) begin if rising_edge(clk) then if reset = ’1’ then state_cur <= reset state; else state_cur <= state_nxt; end if; end if; end process; 2.5.5 State Encoding When working with explicit state machines, we must address the issue of state encoding: what bit-vector value to associate with each state? With implicit state machines, we do not need to worry about state encoding. The synthesis program determines the number of states and the encoding for each state. 2.5.5.1 Constants vs Enumerated Type Using an enumerated type, the synthesis tools chooses the encoding: type state_ty is (s0, s1, s2, s3); signal state : state_ty; Using constants, we choose the encoding: type state_ty is std_logic_vector(1 downto 0); constant s0 : state_ty := "11"; constant s1 : state_ty := "10"; constant s2 : state_ty := "00"; constant s3 : state_ty := "01"; signal state : state_ty; Providing Encodings for Enumerated Types ............................................ Many synthesizers allow the user to provide hints on how to encode the states, or allow the user to provide explicitly the desire encoding. These hints are done either through VHDL attributes or special comments in the code.
  • 160.
    138 CHAPTER 2. RTL DESIGN WITH VHDL Simulation ............................................................................ When doing functional simulation with enumerated types, simulators often display waveforms with “pretty-printed” values rather than bits (e.g. s0 and s1 rather than 11 and 10). However, when simulating a design that has been mapped to gates, the enumerated type dissappears and you are left with just bits. If you don’t know the encoding that the synthesis tool chose, it can be very difficult to debug the design. However, this opens you up to potential bugs if the enumerated type you are testing grows to include more values, which then end up unintentionally executing your when other branch, rather than having a special branch of their own in the case statement. Unused Values ....................................................................... . If the number of values you have in your datatype is not a power of two, then you will have some unused values that are representable. For example: type state_ty is std_logic_vector(2 downto 0); constant s0 : state_ty := "011"; constant s1 : state_ty := "000"; constant s2 : state_ty := "001"; constant s3 : state_ty := "011"; constant s4 : state_ty := "101"; signal state : state_ty; This type only needs five unique values, but can represent eight different values. What should we do with the three representable values that we don’t need? The safest thing to do is to code your design so that if an illegal value is encountered, the machine resets or enters an error state. 2.5.5.2 Encoding Schemes • Binary: Conventional binary counter. • One-hot: Exactly one bit is asserted at any time. • Modified one-hot: Altera’s Quartus synthesizer generates an almost-one-hot encoding where the bit representing the reset state is inverted. This means that the reset state is all ’O’s and all other states have two ’1’s: one for the reset state and one for the current state. • Gray: Transition between adjacent values requires exactly one bit flip. • Custom: Choose encoding to simplify combinational logic for specific task.
  • 161.
    2.6. DATAFLOW DIAGRAMS 139 Tradeoffs in Encoding Schemes ....................................................... . • Gray is good for low-power applications where consecutive data values typically differ by 1 (e.g. no random jumps). • One-hot usually has less combinational logic and runs faster than binary for machines with up to a dozen or so states. With more than a dozen states, the extra flip-flops required by one-hot encoding become too expensive. • Custom is great if you have lots of time and are incredibly intelligent, or have deep insight into the guts of your design. Note: Don’t care values When we don’t care what is the value of a signal we assign the signal ’-’, which is “don’t care” in VHDL. This should allow the synthesis tool to use whatever value is most helpful in simplifying the Boolean equations for the signal (e.g. Karnaugh maps). In the past, some groups in E&CE 327 have used ’-’ quite succesfuly to decrease the area of their design. However, a few groups found that using ’-’ increased the size of their design, when they were expecting it to decrease the size. So, if you are tweaking your design to squeeze out the last few unneeded FPGA cells, pay close attention as to whether using ’-’ hurts or helps. 2.6 Dataflow Diagrams 2.6.1 Dataflow Diagrams Overview • Dataflow diagrams are data-dependency graphs where the computation is divided into clock cycles. • Purpose: – Provide a disciplined approach for designing datapath-centric circuits – Guide the design from algorithm, through high-level models, and finally to register transfer level code for the datapath and control circuitry. – Estimate area and performance – Make tradeoffs between different design options • Background – Based on techniques from high-level synthesis tools – Some similarity between high-level synthesis and software compilation – Each dataflow diagram corresponds to a basic block in software compiler terminology.
  • 162.
    140 CHAPTER 2. RTL DESIGN WITH VHDL a b c d e f + x1 + x2 + x3 + x4 + z Data-dependency graph for z = a + b + c + d + e + f a b c d e f + x1 + x2 + x3 + x4 + z Dataflow diagram for z = a + b + c + d + e + f
  • 163.
    2.6.1 Dataflow DiagramsOverview 141 a b c d e f + x1 Horizontal lines mark clock cycle boundaries + x2 + x3 + x4 + z The use of memory arrays in dataflow diagrams is described in section 2.11.4.
  • 164.
    142 CHAPTER 2. RTL DESIGN WITH VHDL 2.6.2 Dataflow Diagrams, Hardware, and Behaviour Primary Input ....................................................................... . Behaviour clk Dataflow Diagram Hardware i α β i x − α i x x Register Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Dataflow Diagram Behaviour clk i Hardware i i α β x x − α x Register Signal ........................................................................ Dataflow Diagram Hardware Behaviour i1 i2 clk i1 x i1 α β γ + + i2 α β γ i2 x − − α x Combinational-Component Output ................................................... . Dataflow Diagram Behaviour Hardware clk i1 i2 i1 i1 α β γ x α β γ + + i2 x i2 x − α β
  • 165.
    2.6.3 Dataflow DiagramExecution 143 2.6.3 Dataflow Diagram Execution Execution with Registers on Both Inputs and Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. a b c d e f 0 0 1 2 3 4 5 6 clk x1 + 1 a x1 + x2 2 x2 x3 + x3 3 x4 x5 + x4 4 z + x5 5 z 6 Execution Without Output Registers ................................................... a b c d e f 0 0 1 2 3 4 5 6 clk + x1 1 a x1 + x2 2 x2 x3 + x3 3 x4 x5 + x4 4 z + x5 5 z
  • 166.
    144 CHAPTER 2. RTL DESIGN WITH VHDL 2.6.4 Performance Estimation Performance Equations . . . . .. . . . . .. . . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . .. 1 Performance ∝ TimeExec TimeExec = Latency × ClockPeriod Definition Latency: Number of clock cycles from inputs to outputs. A combinational circuit has latency of zero. A single register has a latency of one. A chain of n registers has a latency of n. There is much more information on performance in chapter3, which is devoted to performance. Performance of Dataflow Diagrams ................................................... . • Latency: count horizontal lines in diagram • Min clock period (Max clock speed) limited by longest path in a clock cycle 2.6.5 Area Estimation • Maximum number of blocks in a clock cycle is total number of that component that are needed • Maximum number of signals that cross a cycle boundary is total number of registers that are needed • Maximum number of unconnected signal tails in a clock cycle is total number of inputs that are needed • Maximum number of unconnected signal heads in a clock cycle is total number of outputs that are needed The information above is only for estimating the number of components that are needed. In fact, these estimates give lower bounds. There might be constraints on your design that will force you to use more components (e.g., you might need to read all of your inputs at the same time). Implementation-technology factors, such as the relative size of registers, multiplexers, and datapath components, might force you to make tradeoffs that increase the number of datapath components to decrease the overall area of the circuit. Of particular relevance to FPGAs: • With some FPGA chips, a 2:1 multiplexer has the same area as an adder. • With some FPGA chips, a 2:1 multiplexer can be combined with an adder into one FPGA cell per bit.
  • 167.
    2.6.6 Design Analysis 145 • In FPGAs, registers are usually “free”, in that the area consumed by a circuit is limited by the amount of combinational logic, not the number of flip-flops. In comparison, with ASICs and custom VLSI, 2:1 multiplexers are much smaller than adders, and registers are quite expensive in area. 2.6.6 Design Analysis a b c d e f + x1 num inputs 6 + num outputs 1 x2 num registers 6 + x3 num adders 1 + min clock period delay through flop and one adder x4 latency 6 clock cycles + z 2.6.7 Area / Performance Tradeoffs one add per clock cycle two adds per clock cycle a b c d e f a b c d e f 0 0 + 1 + x1 x1 1 + 2 + x2 x2 + 3 + x3 x3 2 + 4 + x4 x4 + 5 + 3 x5 x5 6 z z 4 Note: In the “Two-add” design, half of the last clock cycle is wasted. Two Adds per Clock Cycle .............................................................
  • 168.
    146 CHAPTER 2. RTL DESIGN WITH VHDL a b c d e f 0 0 1 2 3 4 5 6 clk + x1 a 1 x1 + x2 x2 x3 + x3 x4 2 x5 + x4 z + 3 x5 z 4
  • 169.
    2.6.7 Area /Performance Tradeoffs 147 Design Comparison .................................................................. . One add per clock cycle Two adds per clock cycle a b c d e f a b c d e f 0 0 + 1 + x1 x1 1 + 2 + x2 x2 + 3 + x3 x3 2 + 4 + x4 x4 + 5 + 3 x5 x5 6 z z 4 inputs 6 6 outputs 1 1 registers 6 6 adders 1 2 clock period flop + 1 add flop + 2 add latency 6 4 Question: Under what circumstances would each design option be fastest? Answer: time = latency * clock period compare execution times for both options T1 = 6 × (T f + Ta ) T2 = 4 × (T f + 2 × Ta ) One-add will be faster when T1 < T2 : 6 × (T f + Ta ) < 4 × (T f + 2 × Ta ) 6T f + 6Ta < 4T f + 8Ta 2T f < 2Ta Tf < Ta Sanity check: If add is slower than flop, then want to minimize the number of adds. One-add has fewer adds, so one-add will be faster when add is slower than flop.
  • 170.
    148 CHAPTER 2. RTL DESIGN WITH VHDL 2.7 Design Example: Massey We’ll go through the following artifacts: 1. requirements 2. algorithm 3. dataflow diagram 4. high-level models 5. hardware block diagram 6. RTL code for datapath 7. state machine 8. RTL code for control Design Process ....................................................................... . 1. Scheduling (allocate operations to clock cycles) 2. I/O allocation 3. First high-level model 4. Register allocation 5. Datapath allocation 6. Connect datapath components, insert muxes where needed 7. Design implicit state machine 8. Optimize 9. Design explicit-current state machine 10. Optimize 2.7.1 Requirements Functional requirements: • Compute the sum of six 8-bit numbers: output = a + b + c + d + e + f • Use registers on both inputs and outputs Performance requirements: • Maximum clock period: unlimited • Maximum latency: four Cost requirements: • Maximum of two adders
  • 171.
    2.7.2 Algorithm 149 • Small miscellaneous hardware (e.g. muxes) is unlimited • Maximum of three inputs and one output • Design effort is unlimited Note: In reality multiplexers are not free. In FPGAs, a 2:1 mux is more ex- pensive than a full-adder. A 2:1 mux has three inputs while an adder has only two inputs (the carry-in and carry-out signals usually use the special “verti- cal” connections on the FPGA cell). In FPGAs, sharing an adder between two signals can be more expensive than having two adders. In a “generic-gate” technology, a multiplexor contains three two-input gates, while a full-adder contains fourteen two-input gates. 2.7.2 Algorithm We’ll use parentheses to group operations so as to maximize our opportunities to perform the work in parallel: z = (a + b) + (c + d) + (e + f) This results in the following data-dependency graph: a b c d e f + + + + + 2.7.3 Initial Dataflow Diagram a b c d + + e f + + + z This dataflow diagram violates the require- ment to use at most three inputs.
  • 172.
    150 CHAPTER 2. RTL DESIGN WITH VHDL 2.7.4 Dataflow Diagram Scheduling We can potentially optimize the inputs, outputs, area, and performance of a dataflow diagram by rescheduling the operations, that is allocating the operations to different clock cycles. Parallel algorithms have higher performance and greater scheduling flexibility than serial algo- rithms Serial algorithms tend to have less area than parallel algorithms Serial Parallel (((((a+b)+c)+d)+e)+f) (a+b)+(c+d)+(e+f) a b c d e f + + a b c d e f + + + + + + + +
  • 173.
    2.7.4 Dataflow DiagramScheduling 151 Scheduling to Optimize Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Original parallel Parallel after scheduling a b c d e f a b c d + + + + + e f + + + + + inputs 6 4 outputs 1 1 registers 6 4 adders 3 2 clock period flop + 1 add flop + 1 add latency 3 3 Scheduling to Optimize Inputs ......................................................... a b + c d + e f Rescheduling the dataflow diagram from the parallel algorithm reduced the area from + + three adders to two. However, it still vio- lates the restriction of a maximum of three + inputs. We can reschedule the operations to keep the same area, but reduce the number z of inputs. The tradeoff is that reducing the number of inputs causes an increase in the latency from four to five. A latency of five violates the design requirement of a maximum latency of four clock cycles. In comparing the dataflow diagram above with the design requirements, we notice that the require- ments allow a clock cycle that includes two additions and three inputs.
  • 174.
    152 CHAPTER 2. RTL DESIGN WITH VHDL a b c + x1 + d e x2 It appears that the parallel algorithm will not + lead us to a design that satisfies the require- x3 ments. + f x4 We revisit the algorithm and try a serial al- + gorithm: z z = ((((a + b) + c) + d) + e) + f The corresponding dataflow diagram is shown to the right. 2.7.5 Optimize Inputs and Outputs When we rescheduled the parallel algorithm, we rescheduled the input values. This requires rene- gotiating the schedule of input values with our environment. Sometimes the environment of our circuit will be willing to reschedule the inputs, but in other situations the environment will impose a non-negotiable schedule upon us. If you are currently storing all inputs and can change environment’s behaviour to delay sending some inputs, then you can reduce the number of inputs and registers. We will illustrate this on both the one-add and the two-add designs. One-add before I/O opt One-add after I/O opt a b c d e f a b + + c x1 x1 + + d x2 x2 + + e x3 x3 + + f x4 x4 + + z z inputs 6 2 regs 6 2
  • 175.
    2.7.5 Optimize Inputsand Outputs 153 Two-add before I/O opt Two-add after I/O opt a b c d e f a b c + + x1 x1 + + d e x2 x2 + + x3 x3 + + f x4 x4 + + z z inputs 6 2 regs 6 2 Design Comparison Between One and Two Add ....................................... . One-add after I/O opt Two-add after I/O opt a b a b c + c + x1 x1 + d + d e x2 x2 + e + x3 x3 + f + f x4 x4 + + z z inputs 2 3 outputs 1 1 registers 2 3 adders 1 2 clock period flop + 1 add flop + 2 add latency 6 4
  • 176.
    154 CHAPTER 2. RTL DESIGN WITH VHDL Hardware Recipe for Two-Add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Based on the dataflow diagram, we can de- We return now to the two-add design, with termine the hardware resources required for the dataflow diagram: a b c the datapath. + Table 2.2: Hardware Recipe for Two-Add x1 + d e x2 inputs 3 + adders 2 x3 registers 3 + f output 1 x4 registered inputs YES + registered outputs YES clock cycles from inputs to outputs 4 z 2.7.6 Input/Output Allocation Our first step after settling on a hardware recipe is I/O allocation, because that determines the interface between our circuit and the outside world. From the hardware recipe, we know that we need only three inputs and one output. However, we have six different input values. We need to allocate these input values to input signals before we can write a high-level model that performs the computation of our design. Based on the input and output information in the hardware recipe, we can define our entity: entity massey is port ( clk : in std_logic; i1, i2, i3 : in unsigned(7 downto 0); o1 : out unsigned(7 downto 0) ); end massey;
  • 177.
    2.7.6 Input/Output Allocation 155 i1 i2 i3 i1 i2 a b i3 c + x1 + i2 d i3 e x2 + + x3 + i2 f x4 + + z o1 o1 Figure 2.4: Dataflow diagram and hardware block diagram with I/O port allocation Based upon the dataflow diagram after I/O architecture hlm_v1 of massey is allocation, we can write our first high-level ...internal signal decls... model (hlm v1). process begin wait until rising_edge(clk); In the high-level model the entire circuit will a <= i1; be implemented in a single process. For b <= i2; larger circuits it may be beneficial to have c <= i3; separate processes for different groups of wait until rising_edge(clk); signals. x2 <= (a + b) + c; d <= i2; In the high-level model, the code between e <= i3; wait statements describes the work that is wait until rising_edge(clk); done in a clock cycle. x4 <= (x2 + d) + e; f <= i2; The hlm v1 architecture uses an implicit wait until rising_edge(clk); state machine. z <= (x4 + f); end process; Because the process is clocked, all of the o1 <= z; signals that are assigned to in the process are end hlm_v1; registers. Combinational signals would need to be done using concurrent assignments or combinational processes.
  • 178.
    156 CHAPTER 2. RTL DESIGN WITH VHDL 2.7.7 Register Allocation The next step after I/O allocation could be either register allocation or datapath allocation. The benefit of doing register allocation first is that it is possible to write VHDL code after register allocation is done but before datapath allocation is done, while the inverse (datapath done but register allocation not done) does not make sense if written in a hardware description language. In this example, we will do register allocation before datapath allocation, and show the resulting VHDL code. i1 i2 i3 i1 i2 a b i3 c r1 r2 r3 + x1 + i2 d i3 e x2 r1 r2 r3 r1 r2 r3 + + x3 + i2 f x4 r1 r2 + + r3 z o1 o1 architecture hlm_v2 of massey is ...internal signal decls... process begin wait until rising_edge(clk); r1 <= i1; r2 <= i2; i1 a r3 <= i3; i2 b, d, f wait until rising_edge(clk); I/O Allocation i3 c, e r1 <= (r1 + r2) + r3; o1 z r2 <= i2; r1 a, x2, x4 r3 <= i3; Register Allocation r2 b, d, f wait until rising_edge(clk); r3 c, e r1 <= (r1 + r2) + r3; r2 <= i2; wait until rising_edge(clk); r3 <= (r1 + r2); end process; o1 <= r3; end hlm_v2; Figure 2.5: Block diagram after I/O and register allocation
  • 179.
    2.7.8 Datapath Allocation 157 2.7.8 Datapath Allocation In datapath allocation, we allocate each of the data operations in the dataflow diagram to one of the datapath components in the hardware block diagram. i1 i2 i3 i1 i2 a b i3 c r1 r2 r3 a1 + x1 a2 + i2 d i3 e x2 r1 r2 r3 r1 r2 r3 a1 + a1 + x3 a2 + i2 f x4 r1 r2 a1 + a2 + r3 z o1 o1 architecture hlm_dp of massey is ...internal signal decls... process begin wait until rising_edge(clk); r1 <= i1; r2 <= i2; i1 a r3 <= i3; i2 b, d, f wait until rising_edge(clk); I/O Allocation i3 c, e r1 <= a2; o1 z r2 <= i2; r1 a, x2, x4 r3 <= i3; Register Allocation r2 b, d, f wait until rising_edge(clk); r3 c, e r1 <= a2; a1 x1, x3, z r2 <= i2; Datapath Allocation wait until rising_edge(clk); a2 x2, x4 r3 <= a1; end process; a1 <= r1 + r2; a2 <= a1 + r3; o1 <= r3; end hlm_dp; Figure 2.6: Block diagram after I/O, register, and datapath allocation
  • 180.
    158 CHAPTER 2. RTL DESIGN WITH VHDL 2.7.9 Datapath for DP+Ctrl Model We will now evolve from an implicit state machine to an explicit state machine. The first step is to label the states in the dataflow diagram and then construct tables to find the values for chip-enable and mux-select signals. S0 i1 i2 a b i3 c r1 r2 r3 a1 + S1 x1 a2 + i2 d i3 e x2 r1 r2 r3 a1 + S2 x3 a2 + i2 f x4 r1 r2 S3 a1 + r3 S0 z o1 Datapath for DP+Ctrl Model ......................................................... . r1 r2 r3 a1 a2 S0 ce=1 , d=i1 ce=1 , d=i2 ce=1 , d=i3 S0 src1=–, src2=– src1=–, src2=– S1 ce=1 , d=a2 ce=1 , d=i2 ce=1 , d=i3 S1 src1=r1, src2=r2 src1=a1, src2=r3 S2 ce=1 , d=a2 ce=1 , d=i2 ce=–, d=– S2 src1=r1, src2=r2 src1=a1, src2=r3 S3 ce=–, d=– ce=–, d=– ce=1 , d=a1 S3 src1=r1, src2=r2 src1=–, src2=– Choose Don’t-Care Values ............................................................. r1 r2 r3 a1 a2 S0 ce=1, d=i1 ce=1, d=i2 ce=1, d=i3 S0 src1=r1, src2=r2 src1=a1, src2=r3 S1 ce=1, d=a2 ce=1, d=i2 ce=1, d=i3 S1 src1=r1, src2=r2 src1=a1, src2=r3 S2 ce=1, d=a2 ce=1, d=i2 ce=1, d=i3 S2 src1=r1, src2=r2 src1=a1, src2=r3 S3 ce=1, d=a2 ce=1, d=i2 ce=1, d=a1 S3 src1=r1, src2=r2 src1=a1, src2=r3
  • 181.
    2.7.9 Datapath forDP+Ctrl Model 159 Simplify ............................................................................. . r1 r2 = i2 r3 S0 d=i1 d=i3 a1 a2 S1 d=a2 d=i3 src1=r1, src2=r2 src1=a1, src2=r3 S2 d=a2 d=i3 S3 d=a2 d=a1 VHDL Code ......................................................................... . architecture explicit_v1 of massey is signal type state_ty is std_logic_vector(3 downto 0); constant s0 : state_ty := "0001"; constant s1 : state_ty := "0010"; constant s2 : state_ty := "0100"; constant s3 : state_ty := "1000"; signal state : state_ty; begin
  • 182.
    160 CHAPTER 2. RTL DESIGN WITH VHDL ---------------------- ---------------------- -- r1 -- combinational datapath process (clk) begin a_1 <= r_1 + r_2; if rising_edge(clk) then a_2 <= a_1 + r_3; if state = S0 then o_1 <= r_3; r_1 <= i_1; else ---------------------- r_1 <= a_2; -- state machine end if; process (clk) begin end if; if rising_edge(clk) then end process; if reset = ’1’ then ---------------------- state <= S0; -- r_2 else process (clk) begin case state is if rising_edge(clk) then when S0 => state <= S1; r_2 <= i_2; when S1 => state <= S2; end if; when S2 => state <= S3; end process; when S3 => state <= S0; end case; ---------------------- end if; -- r_3 end if; process (clk) begin end process; if rising_edge(clk) then end explicit_v1; if state = S3 then r_3 <= a_1; else r_3 <= i_3; end if; end if; end process 2.7.10 Peephole Optimizations Peephole optimizations are localized optimizations to code, in that they affect only a few lines of code. In hardware design, peephole optimizations are usually done to decrease the clock period, although some optimizations might also decrease area. There are many different types of opti- mizations, and many optimizations that designers do by hand are things that you might expect a synthesis tool to do automatically. In a comparison such as: state = S0, when we use a one-hot state encoding, we need com- pare only one of the bits of the state. The comparison can be simplified to: state(0) = ’1’. Without this optimization, many synthesis tools will produce hardware that tests all of the bits of the state signal. This increases the area, because more bits are required as inputs to the compari- son, and increases the clock period because the wider comparison leads to a tree-like structure of combinational logic, or an increased number of FPGA cells.
  • 183.
    2.7.10 Peephole Optimizations 161 In this example, we will take advantage of our state encoding to optimize the code for r 1, r 3, and the state machine. -- r_1 -- r_1 (optimized) process (clk) begin process (clk) begin if rising_edge(clk) then if rising_edge(clk) then if state = S0 then if state(0) = ’1’ then r_1 <= i_1; r_1 <= i_1; else else r_1 <= a_2; r_1 <= a_2; end if; end if; end if; end if; end process; end process; The code for r 2 remains unchanged. -- r_3 -- r_3 (optimized) process (clk) begin process (clk) begin if rising_edge(clk) then if rising_edge(clk) then if state = S3 then if state(3) then r_3 <= a_1; r_3 <= a_1; else else r_3 <= i_3; r_3 <= i_3; end if; end if; end if; end if; end process; end process; -- state machine (optimized) -- state machine -- NOTE: "st" = "state" process (clk) begin process (clk) begin if rising_edge(clk) then if rising_edge(clk) then if reset = ’1’ then if reset = ’1’ then state <= S0; st <= S0; else else case state is for i in 0 to 3 loop when S0 => state <= S1; st( (i+1)mod4 ) <= st(i); when S1 => state <= S2; end loop; when S2 => state <= S3; end if; when S3 => state <= S0; end if; end case; end if; end if; end process;
  • 184.
    162 CHAPTER 2. RTL DESIGN WITH VHDL The hardware-block diagram that corresponds to the tables and VHDL code is: reset State(0) State(1) State(2) State(3) i1 i2 i3 r1 r2 r3 a1 + a2 + o1 2.8 Design Example: Vanier We’ll go through the following artifacts: 1. requirements 2. algorithm 3. dataflow diagram 4. high-level models 5. hardware block diagram 6. RTL code for datapath 7. state machine 8. RTL code for control Design Process ....................................................................... . 1. Scheduling (allocate operations to clock cycles) 2. I/O allocation 3. First high-level model
  • 185.
    2.8.1 Requirements 163 4. Register allocation 5. Datapath allocation 6. Connect datapath components, insert muxes where needed 7. Design implicit state machine 8. Optimize 9. Design explicit-current state machine 10. Optimize 2.8.1 Requirements • Functional requirements: compute the following formula: output = (a × d) + c + (d × b) + b • Performance requirement: – Max clock period: flop plus (2 adds or 1 multiply) – Max latency: 4 • Cost requirements – Maximum of two adders – Maximum of two multipliers – Unlimited registers – Maximum of three inputs and one output – Maximum of 5000 student-minutes of design effort • Registered inputs and outputs 2.8.2 Algorithm a d b c Create a data-dependency graph for the algo- rithm. + NOTE: if draw data-dep graph in alphabetical order, it’s ugly. Lesson is to think about layout + and possibly re-do the layout to make it simple and easy to understand before proceeding. + z
  • 186.
    164 CHAPTER 2. RTL DESIGN WITH VHDL 2.8.3 Initial Dataflow Diagram a d b c Schedule operations into clock cycles. Use an “as soon as possible” schedule, obeying perfor- mance requirement of a maximum clock period + of one multiply or two additions. In this initial diagram, we ignore the resource requirements. + This allows us to establish a lower bound on the latency, which gives us the maximum per- formance that we can hope to achieve. + z 2.8.4 Reschedule to Meet Requirements We have four inputs, but the requirements allow a maximum of three. We need to move one input into the second clock cycle. We want to choose an input that can be delayed by one clock cycle without violating a requirement and with minimal degradation of performance (clock period and latency). If delaying an input by a clock cycle causes a requirement to be violated, we can often reschedule the operations to remove the violation. So, we sometimes create an intermediate dataflow diagram that violates a requirement, then reschedule the operations to bring the dataflow diagram back into compliance. The critical path is from d and b, through a multiplier, the middle adder, the final adder, and then out through z. Because the inputs d and b are on the critical path, it would be preferable to choose another input (either a or c) as the input to move into the second clock cycle. If we move c, we will move the first addition in the second clock cycle, which will force us to use three adders, which violates our resource requirement of a maximum of two adders. d b c By process of elimination, we have settled on a as our input to be delayed. This causes one of a + the multiply operations to be moved into second clock cycle, which is good because it reduces + our resources from two multipliers to just one. + z
  • 187.
    2.8.5 Optimize Resources 165 d b c Moving a into the second clock cycle has caused a clock period violation, because our clock pe- a + riod is now a register, a multiply, and an add. This forces us to add an additional clock cycle, + which gives us a latency of four. + z 2.8.5 Optimize Resources d b a c We can exploit the additional clock cycle to reschedule our operations to reduce the number of inputs from three to two. The disadvantage is + that we have increased the number of registers from four to five. + + z Two side comments: • Moving the second addition from the third clock cycle to the second will not improve the per- formance or the area. The number of adders will remain at two, the number of registers will remain at five, and the clock period will remain at the maximum of a multiply or two additions. • In hindsight, if we had chosen originally to move c, rather than a into the second clock cycle, we would likely have produced this same dataflow diagram. After moving c, we would see the resource violation of three adders in the second clock cycle. This violation would cause us to add a third clock cycle, and given us an opportunity to move a into the second clock cycle. The lesson is that there are usually several different ways to approach a design problem, and it is infeasible to predict which approach will result in the best design. At best, we have many heuristics, or “rules of thumb”, that give us guidelines for techniques that usually work well. Having finalized our input/output scheduling, we can write our entity. Note: we will add a reset signal later, when we design the state machine to control the datapath.
  • 188.
    166 CHAPTER 2. RTL DESIGN WITH VHDL entity vanier is port ( clk : in std_logic; i_1, i_2 : in std_logic_vector(15 downto 0); o_1 : out std_logic_vector(15 downto 0) ); end vanier;
  • 189.
    2.8.6 Assign Namesto Registered Values 167 2.8.6 Assign Names to Registered Values We must assign a name to each registered value. Optionally, we may also assign names to com- binational values. Registers require names, because in VHDL each register (except implicit state registers) is associated with a named signal. Combinational signals do not require names, be- cause VHDL allows anonymous (unnamed) combinational signals. For example, in the expression (a+b)+c we do not need to provide a name for the sum of a and b. d b x1 x2 a c x3 x4 x5 If a single value spans multiple clock cycles, it only needs to be named once. In our example + x 1, x 2, and x 4 each cross two boundaries. x6 x7 + + x8 z
  • 190.
    168 CHAPTER 2. RTL DESIGN WITH VHDL 2.8.7 Input/Output Allocation Now that we have names for all of our registered signals, we can allocate input and output ports to signals. After the input and output ports have been allocated to signals, we can write our first model. We use an implicit state machine and define only the registered values. In each state, we define the values of the registered values that are computed in that state. architecture hlm_v1 of vanier is i1 i2 signal x_1, x_2, x_3, x_4, x_5, x_6, d b x_7, x_8 : unsigned(15 downto 0); x1 x2 begin process begin i1 a i2 c ------------------------------ wait until rising_edge(clk); x3 x4 x5 ------------------------------ + x_1 <= unsigned(i_1); x_2 <= unsigned(i_2); x6 x7 ------------------------------ + wait until rising_edge(clk); ------------------------------ + x_3 <= unsigned(i_1); x_4 <= x_1(7 downto 0) * x_2(7 downto 0); x8 z x_5 <= unsigned(i_2); o1 ------------------------------ wait until rising_edge(clk); ------------------------------ x_6 <= x_3(7 downto 0) * x_1(7 downto 0); x_7 <= x_2 + x_5; ------------------------------ wait until rising_edge(clk); ------------------------------ x_8 <= x_6 + (x_4 + x_7); end process; o_1 <= std_logic_vector(x_8); end hlm_v1;
  • 191.
    2.8.7 Input/Output Allocation 169 0 1 2 3 4 5 i1 i2 x1 i1 d i2 b 0 x2 x1 x2 x3 1 i1 a i2 c x4 x5 x3 x4 x5 + 2 x6 x7 x6 x7 x8 + 3 + 0 1 2 3 4 5 x8 z i1 o1 4 i2 r1 r2 r3 r4 r5 The model hlm v1 is synthesizable. If we are happy with the clock speed and area, we can stop now! The remaining steps of the design process seek to optimize the design by reducing the area and clock period. For area, we will reduce the number of registers, datapath components, and multiplexers. Reducing the clock period will occur as we reduce the number of multiplexers and potentially perform peephole (localized) optimizations, such as Boolean simplification.
  • 192.
    170 CHAPTER 2. RTL DESIGN WITH VHDL 2.8.8 Tangent: Combinational Outputs To demonstrate a high-level model where the output is combinational, we modify hlm v1 so that the output is combinational, rather than a register (see hlm v1c). To make the output (x 8) com- binational, we move the assignment to x 8 out of the main clocked process and into a concurrent statement. architecture hlm_v1c of vanier is signal x_1, x_2, x_3, x_4, x_5, x_6, x_7 : unsigned(15 downto 0); begin process begin ------------------------------ i1 d i2 b wait until rising_edge(clk); x1 x2 ------------------------------ x_1 <= unsigned(i_1); i1 i2 a c x_2 <= unsigned(i_2); ------------------------------ x3 x4 x5 wait until rising_edge(clk); ------------------------------ + x_3 <= unsigned(i_1); x_4 <= x_1(7 downto 0) * x_2(7 downto 0); x6 x7 x_5 <= unsigned(i_2); + ------------------------------ wait until rising_edge(clk); + ------------------------------ x_6 <= x_3(7 downto 0) * x_1(7 downto 0); z o1 x_7 <= x_2 + x_5; end process; o_1 <= std_logic_vector(x_6 + (x_4 + x_7)); end hlm_v1c;
  • 193.
    2.8.9 Register Allocation 171 2.8.9 Register Allocation Our previous model (hlm v1) uses eight registers (x 1. . . x 8). However, our analysis of the dataflow diagrams says that we can implement the diagram with just five registers. Also, the code for hlm v1 contains two occurrences of the multiplication symbol (*) and three occurrences of the addition symbol (+). Our analysis of the dataflow diagram showed that we need only one multiply and two adds. In hlm v1 we are relying on the synthesis tool to recognize that even though the code contains two multiplies and three adds, the hardware needs only one multiply and two adds. Register allocation is the task of assigning each of our registered values to a register signal. Dat- apath allocation is the task of assigning each datapath operation to a datapath component. Only high-level synthesis tools (and software compilers) do register allocation. So, as hardware design- ers, we are stuck with the task of doing register allocation ourselves if we want to further optimize our design. Some register-transfer-level synthesis tools do datapath allocation. If your synthesis tool does datapath allocation, it is important to learn the idioms and limitations of the tool so that you can write your code in a style that allows the tool to do a good job of allocation and optimiza- tion. In most cases where area or clock speed are important design metrics, design engineers do datapath allocation by hand or ad-hoc software and spreadsheets. We will now step through the tasks of register allocation and datapath allocation. In our eight- register model, each register holds a unique value — we do not reuse registers. To reduce the number of registers from eight to five, we will need to reuse registers, so that a register potentially holds different values in different clock cycles. When doing register allocation, we assign a register to each signal that crosses a clock cycle bound- ary. When creating the hardware block diagram, we will need to add multiplexers to the inputs of modules that are connected to multiple registers. To reduce the number of multiplexers, we try to allocate the same registers to the same inputs of the same type of module. For example, x 7 is an input to an adder, we allocate r 5 to x 7, because r 5 was also an input to an adder in another clock cycle. Also in the third clock cycle, we allocate r 2 to x 6, because in the second clock cycle, the inputs to an adder were r 2 and r 5. In the last clock cycle, we allocate r 5 to x 8, because previously r 5 was used as the output of r 2 + r 5. We update our model to reflect register allocation, by replacing the signals for registered values (x 1. . . x 8) with the registers r 1. . . r 5.
  • 194.
    172 CHAPTER 2. RTL DESIGN WITH VHDL architecture hlm_v2 of vanier is i1 i2 signal r_1, r_2, r_3, r_4, r_5 d b : unsigned(15 downto 0); r1 x1 r2 x2 begin process begin i1 a i2 c ------------------------------ r3 r4 r5 wait until rising_edge(clk); x3 x4 x5 ------------------------------ + r_1 <= unsigned(i_1); r_2 <= unsigned(i_2); r2 x6 r5 x7 ------------------------------ + wait until rising_edge(clk); ------------------------------ + r_3 <= unsigned(i_1); r_4 <= r_1(7 downto 0) * r_2(7 downto 0); r5 x8 r_5 <= unsigned(i_2); z o1 ------------------------------ wait until rising_edge(clk); ------------------------------ r_2 <= r_3(7 downto 0) * r_1(7 downto 0); r_5 <= r_2 + r_5; ------------------------------ wait until rising_edge(clk); ------------------------------ r_5 <= r_2 + (r_4 + r_5); end process; o_1 <= std_logic_vector(r_5); end hlm_v2; Both of our models so far (hlm v1 and hlm v2) have used implicit state machines. The optimiza- tion from hlm v1 to hlm v2 was done to reduce the number of registers by performing register allocation. Most of the remaining optimizations require an explicit state machine. We will con- struct an explicit state machine using a methodical procedure that gradually adds more information to the dataflow diagram. The first step in this procedure is to datapath allocation, which is similar to register allocation, except that we allocate datapath components to datapath operations, rather than allocate registers to names. To control the datapath, we need to provide the following signals for registers and datapath com- ponents: registers chip-enable and mux-select signals datapath components instruction (e.g. add, sub, etc for ALUs) and mux-select After we determine the chip-enable, mux-select, and instruction signals, and then calculate what value each signal needs in each clock cycle, we can build the explicit state machine to control the datapath. After we build the state machine, we will add a reset to the design.
  • 195.
    2.8.10 Datapath Allocation 173 2.8.10 Datapath Allocation i1 d i2 b r1 x1 r2 x2 In datapath allocation, we allocate an adder (ei- i1 a m1 i2 c ther a1 or a2) to each addition operation and a r3 x3 r4 x4 r5 x5 multiplier (either m1 or m2) to each multiplica- m1 tion operation. As with register allocation, we a1 + attempt to reduce the number of multiplexers will be required by connecting the same data- r2 x6 r5 x7 path component to the same register in multiple a2 + clock cycles. a1 + r5 x8 z o1 2.8.11 Hardware Block Diagram and State Machine To build an explicit state machine, we first determine what states we need. In this circuit, we need four states, one for each clock cycle in the dataflow diagram. If our algorithmic description had included control flow, such as loops and branches, then it becomes more difficult to determine the states that are needed. We will use four states: S0..S3, where S0 corresponds to the first clock cycle (during which the input is read) and S3 corresponds to the last clock cycle. 2.8.11.1 Control for Registers To determine the chip enable and mux select signals for the registers, we build a table where each state corresponds to a row and each register corresponds to a column. For each register and each state, we note whether the register loads in a new value (ce) and what signal is the source of the loaded data (d). r1 r2 r3 r4 r5 ce d ce d ce d ce d ce d S0 1 i1 1 i2 – – – – – – S1 0 – 0 – 1 i1 1 m1 1 i2 S2 – – 1 m1 – – 0 – 1 a1 S3 – – – – – – – – 1 a1 S3 – – – – – – – – 1 a1
  • 196.
    174 CHAPTER 2. RTL DESIGN WITH VHDL Eliminate unnecessary chip enables and muxes. • A chip enable is needed if a register must hold a single value for multiple clock cycles (ce=0). • A multiplexer is needed if a register loads in values from different sources in different clock cycles. The register simplifications are as follows: r1 Chip-enable, because S1 has ce=0. No multiplexer, because i1 is the only input. r2 Chip-enable, because S1 has ce=0. Multiplexer to choose between i2 and m1. r3 No chip enable, no multiplexer. The register r3 simplifies to be just r3=i1 without a mul- tiplexer or chip-enable, because there is only one state where we care about its behaviour (S1) — all of the other states are don’t cares for both chip enable and mux. r4 Chip-enable, because S2 has ce=0. No multiplexer, because m1 is the only input. r5 No chip-enable, because do not have any states with ce=0. Multiplexer between i2 and a1. The simplified register table is shown below. For registers that do not have multiplexers, we show their input on the top row. For registers that need neither a chip enable nor a mux (e.g. r3), we write the assignment in the first row and leave the other rows blank. r1=i1 r2 r3=i1 r4=m1 r5 ce ce d ce d S0 1 1 i2 – – S1 0 0 – 1 i2 S2 – 1 m1 0 a1 S3 – – – – a1 The chip-enable and mux-select signals that are needed for the registers are: r1 ce, r2 ce, r2 sel, r4 ce, and r5 sel. 2.8.11.2 Control for Datapath Components Analogous to the table for registers, we build a table for the datapath components. Each of our components has two inputs (src1 and src2). Each component performs a single operation (either addition or multiplication), so we do not need to define operation or instruction signals for the datapath components. a1 a2 m1 src1 src2 src1 src2 src1 src2 S0 – – – – – – S1 – – – – r1 r2 S2 r2 r5 – – r3 r1 S3 r2 a2 r4 r5 – –
  • 197.
    2.8.11 Hardware BlockDiagram and State Machine 175 Based on the table above, the adder a1 will need a multiplexer for src2. The multiplier m1 will need two multiplexers: one for each input. Note that the operands to addition and multiplication are commutative, so we can choose which signal goes to src1 and which to src2 so as to minimize the need for multiplexers. We notice that for m1, we can reduce the number of multiplexers from 2 to 1 by swapping the operands in the second clock cycle. This makes r1 the only source of operands for the src1 input. This optimization is reflected in the table below. a1 a2 m1 src1 src2 src1 src2 src1 src2 S0 – – – – – – S1 – – – – r1 r2 S2 r2 r5 – – r1 r3 S3 r2 a2 r4 r5 – – The mux-select signals for the datapath components are: a1 src2 sel and m1 src2 sel. 2.8.11.3 Control for State We need to control the transition from one state to the next. For this example, the transition is very simple, each state transitions to its successor: S0 → S1 → S2 → S3 → S0.... 2.8.11.4 Complete State Machine Table The state machine table is shown below. Note that the state signal is a register; the table shows the next value of the signal. r1 ce r2 ce r2 sel r4 ce r5 sel a1 src2 sel m1 src2 sel state S0 1 1 i2 – – – – S1 S1 0 0 – 1 i2 – r2 S2 S2 – 1 m1 0 a1 r5 r3 S3 S3 – – – – a1 a2 – S0 We now choose instantiations for the don’t care values so as to simplify the circuitry. Different state encodings will lead to different simplifications. For fully-encoded states, Karnaugh maps are helpful in doing simplifications. For a one-hot state encoding, it is usually better to create situations where conditions are based upon a single state. The reason for this heuristic with one-hot encodings will be clear when we get to explicit v2.
  • 198.
    176 CHAPTER 2. RTL DESIGN WITH VHDL r1 ce We first choose 0 as the don’t care instantiation, because that leaves just one state where we need to load. Additionally, it is conceptually cleaner to do an assignment in just the one clock cycle where we care about the value, rather than not do an assignment in the one clock cycle where we must hold the value. (At the end of the don’t care allocation, we’ll revisit this decision and change our mind.) r2 ce We choose 1 for S3, so that we have just one state where we do not do a load. If we had chosen 0 for r2ce in S3, we would have two states where we do a load and two where we do not load. If we were using fully-encoded states, this even separation might have left us with a very nice Karnaugh map; or it might have left us with a Karnaugh map that has a checkerboard pattern, which would not simplify. This helps illustrate why state encoding is a difficult problem. r2 sel We choose m1 arbitrarily. The choice of i2 would have also resulted in three assign- ments from one signal and one assignment from the other signal. r4 ce We choose 0 as we did for r1 ce. r5 sel Choose a1 so that we have three assignments from the same signal and just one assign- ment from the other signal. a1 src2 Choose a2 arbitrarily. m1 src2 Choose r3 arbitrarily. r1 ce (again) We examine r1 ce and r2 ce and see that if we choose 1 for the don’t care instantiation of r1 ce, we will have the same choices for both chip enables. This will simplify our state machine. Also, r4 ce is the negation of r2 ce, so we can use just an inverter to control r4 ce. r1 ce r2 ce r2 sel r4 ce r5 sel a1 src2 sel m1 src2 sel state S0 1 1 i2 0 a1 a2 r3 S1 S1 0 0 m1 1 i2 a2 r2 S2 S2 1 1 m1 0 a1 r5 r3 S3 S3 1 1 m1 0 a1 a2 r3 S0 2.8.12 VHDL Code with Explicit State Machine VHDL code can be written directly from the tables and the dataflow diagram that shows register allocation, input allocation, and datapath allocation. As a simplification, rather than write explicit signals for the chip-enable and mux-select signals, we use select and conditional assignment state- ments that test the state in the condition. We chose a one-hot encoding of the state, which usually results in small and fast hardware for state machines with sixteen or fewer states.
  • 199.
    2.8.12 VHDL Codewith Explicit State Machine 177 architecture explicit_v1 of vanier is signal r_1, r_2, r_3, r_4, r_5 : std_logic_vector(15 downto 0); type state_ty is std_logic_vector(3 downto 0); constant s0 : state_ty := "0001"; constant s1 : state_ty := "0010"; constant s2 : state_ty := "0100"; constant s3 : state_ty := "1000"; signal state : state_ty;
  • 200.
    178 CHAPTER 2. RTL DESIGN WITH VHDL begin ---------------------- ---------------------- -- r_5 -- r_1 process (clk) begin process (clk) begin if rising_edge(clk) then if rising_edge(clk) then if state = S1 then if state != S1 then r_5 <= i_2; r_1 <= i_1; else end if; r_5 <= a_1; end if; end if; end process; end if; ---------------------- end process; -- r_2 ---------------------- process (clk) begin -- combinational datapath if rising_edge(clk) then with state select if state != S1 then a1_src2 <= r_5 when S2, if state = S0 then a_2 when others; r_2 <= i_2; with state select else m1_src2 <= r_2 when S1 r_2 <= m_1; r_3 when others; end if; a_1 <= a_2 + a1_src2; end if; a_2 <= r_4 + r_5; end if; m_1 <= r_1 * m1_src2; end process; o_1 <= r_5; ---------------------- ---------------------- -- r_3 -- state machine process (clk) begin process (clk) begin if rising_edge(clk) then if rising_edge(clk) then r_3 <= i_1; if reset = ’1’ then end if; state <= S0; end process; else ---------------------- case state is -- r_4 when S0 => state <= S1; process (clk) begin when S1 => state <= S2; if rising_edge(clk) then when S2 => state <= S3; if state = S1 then when S3 => state <= S0; r_4 <= m_1; end case; end if; end if; end if; end if; end process; end process; ---------------------- end explicit_v1; The hardware-block diagram that corresponds to the tables and VHDL code is:
  • 201.
    2.8.13 Peephole Optimizations 179 i1 i2 S0 i1 d i2 b r1 x1 r2 x2 S1 i1 a m1 i2 c r3 x3 r4 x4 r5 x5 S2 m1 a1 + r1 r2 r3 r5 r2 x6 r5 x7 a2 + S3 o1 a1 + r5 x8 m1 z S0 r4 a2 + a1 + 2.8.13 Peephole Optimizations We will illustrate several peephole optimizations that take advantage of our state encoding. -- r_1 -- r_1 (optimized) process (clk) begin process (clk) begin if rising_edge(clk) then if rising_edge(clk) then if state != S1 then if state(1) = ’0’ then r_1 <= i_1; r_1 <= i_1; end if; end if; end if; end if; end process; end process; Analogous optimizations can be used when comparing against multiple states:
  • 202.
    180 CHAPTER 2. RTL DESIGN WITH VHDL -- r_2 -- r_2 (optimized) process (clk) begin process (clk) begin if rising_edge(clk) then if rising_edge(clk) then if state != S1 if state(1) = ’0’ then if state = S0 then if state(0) = ’1’ then r_2 <= i_2; r_2 <= i_2; else else r_2 <= m_1; r_2 <= m_1; end if; end if; end if; end if; end if; end if; end process; end process; Next-state assignment for a one-hot state machine can be done with a simple shift register: -- state machine (optimized) -- state machine -- NOTE: "st" = "state" process (clk) begin process (clk) begin if rising_edge(clk) then if rising_edge(clk) then if reset = ’1’ then if reset = ’1’ then state <= S0; st <= S0; else else case state is for i in 0 to 3 loop when S0 => state <= S1; st( (i+1) mod 4 ) <= st( i ); when S1 => state <= S2; end loop; when S2 => state <= S3; end if; when S3 => state <= S0; end if; end case; end process; end if; end if; end process;
  • 203.
    2.8.13 Peephole Optimizations 181 The resulting optimized code is shown on the next page. architecture explicit_v2 of vanier is signal r_1, r_2, r_3, r_4, r_5 : std_logic_vector(15 downto 0); type state_ty is std_logic_vector(3 downto 0); constant s0 : state_ty := "0001"; constant s1 : state_ty := "0010"; constant s2 : state_ty := "0100"; constant s3 : state_ty := "1000"; signal state : state_ty; begin ---------------------- ---------------------- -- r_5 -- r_1 process (clk) begin process (clk) begin if rising_edge(clk) then if rising_edge(clk) then if state(1) = ’1’ then if state(1) = ’0’ then r_5 <= i_2; r_1 <= i_1; else end if; r_5 <= a_1; end if; end if; end process; end if; ---------------------- end process; -- r_2 ---------------------- process (clk) begin -- combinational datapath if rising_edge(clk) then a1_src2 <= r_5 when state(2) = ’1’ if state(1) = ’0’ then else a_2; if state(0) = ’1’ then m1_src2 <= r_2 when state(1)= ’1’ r_2 <= i_2; else r_3; else a_1 <= a_2 + a1_src2; r_2 <= m_1; a_2 <= r_4 * r_5; end if; m_1 <= r_1 * m1_src2; end if; o_1 <= r_5; end if; ---------------------- end process; -- state machine ---------------------- process (clk) begin -- r_3 if rising_edge(clk) then process (clk) begin if reset = ’1’ then if rising_edge(clk) then state <= S0; r_3 <= i_1; else end if; for i in 0 to 3 loop end process; state( (i+1) mod 4) <= ---------------------- state(i); -- r_4 end loop; process (clk) begin end if; if rising_edge(clk) then end if; if state(1) = ’1’ then end process; r_4 <= m_1; ---------------------- end if; end explicit_v1; end if; end process;
  • 204.
    182 CHAPTER 2. RTL DESIGN WITH VHDL 2.8.14 Notes and Observations Our functional requirements were written as: output = (a × d) + (d × b) + b + c Alternatively, we could have achieved exactly the same functionality with the functional require- ments written as (the two statements are mathematically equivalent): output = (a × d) + b + (d × b) + c The naive data dependency graph for the alternative formulation is much messier than the data dependency graph for the original formulation: Original Alternative (a × d) + (d × b) + b + c (a × d) + c + (d × b) + b a d b c a d b c + + + + + + z z An observation: it can be helpful to explore several equivalent formulations of the mathematical equations while constructing the data dependency graph. A mathematical formulation that places occurrences of the same identifier close to each other often results in a simpler data dependency graph. The simpler the data dependency graph, the easier it will be to identify helpful optimizations and efficient schedules.
  • 205.
    2.9. PIPELINING 183 2.9 Pipelining Pipelining is one of the most common and most effective performance optimizations in hardware. Pipelining is used in systems ranging from simple signal-processing filters to high-performance microprocessors. Pipelining increases performance by overlapping the execution of multiple in- structions or parcels of data, analogous to the way that multiple cars flow through an automobile assembly line. Pipelines are difficult to design and verify, because subtle bugs can arise from the interactions between instructions flowing through the pipeline. There are intended interactions, which must happen correctly, and there might be unintended interactions which constitute bugs. Computer architects categorize the interactions between instructions according to three principles: struc- tural hazards, control hazards, and data hazards. Our examples will all be pure datapath pipelines without any data or control dependencies between parcels of data. This eliminates most of the complexities of implementing pipelines correctly. 2.9.1 Introduction to Pipelining Review of unpipelined dataflow diagram .............................................. . As a quick review of an unpipelined (also called “sequential”) dataflow diagram we revisit the one-add example from section 2.6.3. a b 0 r1 r2 add1 + c 1 r1 r2 add1 + d 2 0 1 2 3 4 5 6 7 8 9 10 11 r1 r2 clk add1 + e 3 a α β r1 r2 r1 α α α α α β β β β add1 + f 4 z α β r1 r2 add1 + 5 z The key feature to notice, in comparison to a pipelined dataflow diagram, is that the second parcel (β) begins execution only after the first parcel (α) has finished executing. Pipelined dataflow diagram ............................................................ In a pipeline, each stage is a separate circuit, in that we cannot reuse the same component in mul- tiple stages. When drawing a pipelined dataflow diagram, we effectively have multiple dataflow
  • 206.
    184 CHAPTER 2. RTL DESIGN WITH VHDL diagrams: one for each stage. As a notational shorthand to avoid drawing multiple dataflow di- agrams, we introduce a new bit of notation: a double line denotes a boundary between stages. We perform scheduling, resource allocation, and all of the other design steps individually for each stage. Our first example of a pipelined dataflow diagram is a fully pipelined version of the previous example. In a fully pipelined dataflow diagram, each clock becomes a separage stage. Notationally, we simply replace the single-line clock cycle boundaries with double-line stage boundaries. a b 0 r1 r2 stage 5 stage 4 stage 3 stage 2 stage 1 add1 + c 1 0 1 2 3 4 5 6 7 8 9 10 11 r3 r4 clk add2 + d 2 a α β r5 r5 (stage1) r1 α β add3 + e 3 (stage2) r3 α β r7 r8 (stage3) r5 α β add4 + f 4 (stage4) r7 α β r9 r10 (stage5) r9 α β add5 + 5 z α β z Sequential (Unpipelined) Hardware .................................................... The hardware for the unpipelined dataflow diagram contains two registers, one adder, a multiplexer and a state machine to control the multiplexer. When the data is produced by the adder at the end of each clock cycle, it is fed back to multiplexer as a value for the next clock cycle. reset i1 i2 State(0) State(1) State(2) State(3) State(4) r1 r2 add1 + o1 Pipelined Hardware and VHDL Code ..................................................
  • 207.
    2.9.1 Introduction toPipelining 185 The hardware for the pipelined dataflow diagram contains two registers and one adder for each stage. The registers and adders do the same thing in each clock cycle, so there is no need for chip-enables, multiplexers, or a state machine. -- stage 1 process begin i1 i2 wait until rising_edge(clk); r1 <= i1; r2 <= i2; end process; r1 r2 -- stage 2 stage 1 i3 process begin add1 + wait until rising_edge(clk); r3 r4 r3 <= r1 + r2; r4 <= i3; stage 2 end process; i4 add2 + -- stage 3 process begin r5 r6 wait until rising_edge(clk); stage 3 i5 r5 <= r3 + r4; r6 <= i4; add3 + end process; -- stage 4 r7 r8 process begin stage 4 i6 wait until rising_edge(clk); add4 + r7 <= r5 + r6; r8 <= i5; end process; r9 r10 -- stage 5 stage 5 process begin add5 + wait until rising_edge(clk); r9 <= r7 + r8; r10 <= i6; o1 end process; -- output o1 <= r9 + r10; The VHDL code above is designed to be easy to read by matching the structure of the hardware. An alternative style is to be more concise by grouping all of the registered assignments in a single clocked process as shown below. The two styles are equivalent with respect to simulation and synthesis. -- group all registered assignments into a single process process begin wait until rising_edge(clk); r1 <= i1; r2 <= i2; r3 <= r1 + r2; r4 <= i3; r5 <= r3 + r4; r6 <= i4; r7 <= r5 + r6; r8 <= i5; r9 <= r7 + r8; r10 <= i6; end process; o1 <= r9 + r10;
  • 208.
    186 CHAPTER 2. RTL DESIGN WITH VHDL 2.9.2 Partially Pipelined The previous section illustrated a fully pipelined circuit, which means that the circuit could accept a new parcel every clock cycle. Sometimes we want to sacrifice performance (throughput) in order to reduce area. We can do this by having a throughput that is less than one parcel per clock-cycle and reusing some hardware. A pipeline that has a throughput of less than one is said to be partially pipelined. If a pipeline is essentially two pipelines running in parallel, then it is said to be superscalar and will usually have a throughput that is more than one parcel per clock cycle. A superscalar pipeline that has n pipelines in parallel is said to be n-way superscalar and has a maximum throughput of n parcels per clock cycle. a b 0 r1 r2 add1 + c 1 stage 1 r1 r2 0 1 2 3 4 5 6 7 8 9 10 11 add1 + d 2 clk r3 r4 a α β add2 + e 3 (stage1) r1 α α β β stage 2 r3 r4 (stage2) r3 α α β β add2 + f 4 (stage3) r5 α β β r5 r6 z α β stage 3 add3 + 5 z Hardware for Partially Pipelined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
  • 209.
    2.9.3 Terminology 187 i1 i2 reset State(0) State(1) stage 1 r1 r2 add1 + i2 stage 2 r3 r4 add2 + i2 stage 3 r5 r6 add3 + o1 2.9.3 Terminology Definition Depth: The depth of a pipeline is the number of stages on the longest path through the pipeline. Definition Latency: The latency of a pipeline is measured the same as for an unpipelined circuit: the number of clock cycles from inputs to outputs. Definition Throughput: The number of parcels consumed or produced per clock cycle. Definition Upstream/downstream: Because parcels flow through the pipeline analogously to water in a stream, the terms upstream and downstream are used respectively to refer to earlier and later stages in the pipeline. For example, stage1 is upstream from stage2. Definition Bubble: When a pipe stage is empty (contains invalid data), it is said to contain a “bubble”.
  • 210.
    188 CHAPTER 2. RTL DESIGN WITH VHDL Question: How do we know whether the output of the pipeline is a bubble or is valid data? Answer: Add one register per stage to hold valid bit. If valid=’0’; then the pipe stage contains a bubble. 2.10 Design Example: Pipelined Massey In this section, we revisit the Massey example from section 2.7, but now do it with a pipelined implementation. To allow us to implement a pipelined design, we need to relax our resource requirements. Originally, we were allowed two adders and three inputs. For the pipeline, we will allow ourselves six inputs and five adders. There are six input values and five additions in the dataflow diagram, so these requirements will enable us to build a fully pipelined implementation. If we were forced to reuse a component (e.g., a maximum of two adders), then we would need to build a partially pipelined circuit. To stay within the normal design rules for pipelines, we will register our inputs but not our outputs. In summary, the requirements are: Requirements ......................................................................... Functional requirements: • Compute the sum of six 8-bit numbers: output = a + b + c + d + e + f • Registered inputs, combinational outputs Performance requirements: • Maximum clock period: unlimited • Maximum latency: four Cost requirements: • Maximum of five adders • Small miscellaneous hardware (e.g. muxes) is unlimited • Maximum of six inputs and one output • Design effort is unlimited
  • 211.
    2.10. DESIGN EXAMPLE:PIPELINED MASSEY 189 Initial Dataflow Diagrams ............................................................ . Our goal is to first maximize performance and then minimize area with the bounds of the require- ments. To maximize performance, we want a throughput of one and a minimum clock period. Revisiting the dataflow diagrams from the unpipelined Massey, we find the two diagrams below as promising candidates for the pipelined Massey. Original dataflow Final unpipelined dataflow a b c d a b c + + e f + + + + d e + + z + f + z For the unpipelined design, we rejected the original dataflow diagram because it violated the re- source requirement of a maximum of three inputs. If we fully pipeline the design, both dataflow diagrams will use six inputs and five adders. The first diagram uses ten registers, while the second uses eight (remember, there is no reuse of components in a fully pipelined design). However, the first dataflow diagram has a shorter clock period, and so will lead higher performance. Because our primary goal is to maximime performance, we will pursue the first dataflow diagram. Dataflow Diagram Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. As a variation of the first dataflow diagram, we reschedule all of inputs to be read in the first clock cycle.
  • 212.
    190 CHAPTER 2. RTL DESIGN WITH VHDL Variation on original dataflow a b c d e f + + + + + z The variation has the disadvantage of using one additional register. However, it has the potential advantage of a simpler interface to the upstream environment, because all of the inputs are pro- vided at the same time. Conversely, this rescheduling would be a disadvantage if the upstream environment was optimized to take advantage of the fact that e and f are produced one clock cycle later than the other values. We do not know anything about the upstream environment, and so will reject this variation, because it increases the number of registers that we need. As we said before, to maximize performace, we will fully pipeline the design, so every clock cycle boundary becomes a stage boundary. At this time, we also add a valid bit to keep track of whether a stage has a bubble or valid parcel. Pipelined dataflow diagram a b c d i_valid + + e f + + + z o_valid VHDL Code ......................................................................... . For this simple example, there are no further optimizations, and can write the VHDL code directly from the dataflow diagram.
  • 213.
    2.10. DESIGN EXAMPLE:PIPELINED MASSEY 191 -- stage 1 process begin wait until rising_edge(clk); r1 <= i1; r2 <= i2; r3 <= i3; r4 <= i4; v1 <= i_valid; end process; a1 <= r1 + r2; a2 <= r3 + r4; -- stage 2 process begin wait until rising_edge(clk); r5 <= a1; r6 <= a2; r7 <= i5; r8 <= i6; v2 <= v1; end process; a3 <= r5 + r6; a4 <= r7 + r8; -- stage 3 process begin wait until rising_edge(clk); r9 <= a3; r10 <= a4; v3 <= v2; end process; a5 <= r9 + r10; -- outputs z <= a5; o_valid <= v3;
  • 214.
    192 CHAPTER 2. RTL DESIGN WITH VHDL 2.11 Memory Arrays and RTL Design 2.11.1 Memory Operations Read of Memory with Registered Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Behaviour clk Hardware we - we WE a A DO do a αa M DI M(αa) αd clk do - αd Write to Memory with Registered Inputs ............................................... Behaviour clk we - Hardware we WE a αa - a A DO do di DI M di αd - clk M(αa) - αd do - U Dual-Port Memory with Registered Inputs ............................................ . clk we - a0 αa - we WE di0 αd - a0 A0 DO0 do0 M a1 βa - di0 DI0 a1 A1 DO1 do1 M(αa) - αd clk M(βa) βd do0 - U do1 - βd
  • 215.
    2.11.2 Memory Arraysin VHDL 193 Sequence of Memory Operations ....................................................... clk we - a0 αa γa - di0 αd γd2 - a1 βa θa - M(αa) - αd M(βa) βd M(γa) γd1 M(θa) θd do0 ? γd1 do1 - βd θd 2.11.2 Memory Arrays in VHDL 2.11.2.1 Using a Two-Dimensional Array for Memory A memory array can be written in VHDL as a two-dimensional array: subtype data is std_logic_vector(7 downto 0); type data_vector is array( natural range <> ) of data; signal mem : data_vector(31 downto 0); These two-dimensional arrays can be useful in high-level models and in specifications. However, it is possible to write code using a two-dimensional array that cannot be synthesized. Also, some synthesis tools (including Synopsys Design Compiler and FPGA Compiler) will synthesize two- dimensional arrays very inefficiently. The example below illustrates: lack of interface protocol, combinational write, multiple write ports, multiple read ports.
  • 216.
    194 CHAPTER 2. RTL DESIGN WITH VHDL architecture main of mem_not_hw is subtype data is std_logic_vector(7 downto 0); type data_vector is array( natural range <> ) of data; signal mem : data_vector(31 downto 0); begin y <= mem( a ); mem( a ) <= b; -- comb read process (clk) begin if rising_edge(clk) then mem( c ) <= w; -- write port #1 end if; end process; process (clk) begin if rising_edge(clk) then mem( d ) <= v; -- write port #2 end if; end process; u <= mem( e ); -- read port #2 end main; 2.11.2.2 Memory Arrays in Hardware WE Most simple memory arrays are single- or dual- WE A0 DO0 ported, support just one write operation at a time, A DO DI0 and have an interface protocol using a clock and DI A1 DO1 write-enable.
  • 217.
    2.11.2 Memory Arraysin VHDL 195 2.11.2.3 VHDL Code for Single-Port Memory Array package mem_pkg is subtype data is std_logic_vector(7 downto 0); type data_vector is array( natural range <> ) of data; end; entity mem is port ( clk : in std_logic; we : in std_logic -- write enable a : in unsigned(4 downto 0); -- address di : in data; -- data_in do : out data -- data_out ); end mem; architecture main of mem is signal mem : data_vector(31 downto 0); begin do <= mem( to_integer( a ) ); process (clk) begin if rising_edge(clk) then if we = ’1’ then mem( to_integer( a ) ) <= di; end if; end if; end process; end main; The VHDL code above is accurate in its behaviour and interface, but might be synthesized as distributed memory (a large number of flip flops in FPGA cells), which will be very large and very slow in comparison to a block of memory. Synopsys synthesis tools implement each bit in a two-dimensional array as a flip-flop. Each FPGA and ASIC vendors supplies libraries of memory arrays that are smaller and faster than a two-dimensional array of flip flops. These libraries exploit specialized hardware on the chips to implement the memory. Note: To synthesize a reasonable implementation of a memory array with Synopsys, you must instantiate a vendor-supplied memory component. Some other synthesis tools, such as Xilinx XST, can infer memory arrays from two-dimensional arrays and synthesize efficient implementations.
  • 218.
    196 CHAPTER 2. RTL DESIGN WITH VHDL Recommended Design Process with Memory .......................................... . 1. high-level model with two-dimensional array 2. two-dimensional array packaged inside memory entity/architecture 3. vendor-supplied component 2.11.2.4 Using Library Components for Memory Altera ............................................................................... . Altera uses “MegaFunctions” to implement RAM in VHDL. A MegaFunction is a black-box de- scription of hardware on the FPGA. There are tools in Quartus to generate VHDL code for RAM components of different sizes. In E&CE 327 we will provide you with the VHDL code for the RAM components that you will need in Lab-3 and the Project. The APEX20KE chips that we are using have dedicated SRAM blocks called Embedded System Blocks (ESB). Each ESB can store 2048 bits and can be configured in any of the following sizes: Number of Elements Word Size (bits) 2048 1 1024 2 512 4 256 8 128 16 Xilinx . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . .. Use component instantiation to get these components ram16x1s 16 × 1 single ported memory ram16x1d 16 × 1 dual-ported memory Other sizes are also available, consult the datasheet for your chip.
  • 219.
    2.11.2 Memory Arraysin VHDL 197 2.11.2.5 Build Memory from Slices If the vendor’s libraries of memory components do not include one that is the correct size for your needs, you can construct your own component from smaller ones. WriteEn Addr DataIn[W-1..0] DataIn[2W-1..2] Clk WE WE A DO A DO DI NxW DI NxW DataOut[W-1..0] DataOut[2W-1..W] Figure 2.7: An N×2W memory from N×W components WriteEn WE Addr[logN] Addr[logN-1..0] A DO DataIn DI NxW Clk WE A DO DI NxW 0 1 DataOut Figure 2.8: A 2N×W memory from N×W components
  • 220.
    198 CHAPTER 2. RTL DESIGN WITH VHDL A 16×4 Memory from 16×1 Components ............................................. . library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity ram16x4s is port ( clk, we : in std_logic; data_in : in std_logic_vector(3 downto 0); addr : in unsigned(3 downto 0); data_out : out std_logic_vector(3 downto 0) ); end ram16x4s; architecture main of ram16x4s is component ram16x1s port (d : in std_logic; -- data in a3, a2, a1, a0 : in std_logic; -- address we : in std_logic; -- write enable wclk : in std_logic; -- write clock o : out std_logic -- data out ); end component; begin mem_gen: for i in 0 to 3 generate ram : ram16x1s port map ( we => we, wclk => clk, ---------------------------------------------- -- d and o are dependent on i a3 => addr(3), a2 => addr(2), a1 => addr(1), a0 => addr(0), d => data_in(i), o => data_out(i) ---------------------------------------------- ); end generate; end main;
  • 221.
    2.11.3 Data Dependencies 199 2.11.2.6 Dual-Ported Memory Dual ported memory is similar to single ported memory, except that it allows two simultaneous reads, or a simultaneous read and write. When doing a simultaneous read and write to the same address, the read will usually not see the data currently being written. Question: Why do dual-ported memories usually not support writes on both ports? Answer: What should your memory do if you write different values to the same address in the same clock cycle? 2.11.3 Data Dependencies Definition of Three Types of Dependencies .............................................. There are three types of data dependencies. The names come from pipeline terminology in com- puter architecture. M[i] := M[i] := := M[i] := := := := M[i] M[i] := M[i] := Read after Write Write after Write Write after Read (True dependency) (Load dependency) (Anti dependency) Instructions in a program can be reordered, so long as the data dependencies are preserved.
  • 222.
    200 CHAPTER 2. RTL DESIGN WITH VHDL Purpose of Dependencies ............................................................. . W0 R3 := ...... WAW ordering prevents W0 from happening after W1 W1 R3 := ...... producer RAW ordering prevents R1 from happening before W1 R1 ... := ... R3 ... consumer WAR ordering prevents W2 from happening before R1 W2 R3 := ...... Each of the three types of memory dependencies (RAW, WAW, and WAR) serves a specific purpose in ensuring that producer-consumer relationships are preserved. Ordering of Memory Operations ....................................................... M[3] M[2] M[1] M[0] 30 20 10 0 M[2] := 21 21 M[2] := 21 M[2] := 21 M[3] := 31 B := M[0] B := M[0] A := M[2] A := M[2] A := M[2] B := M[0] M[3] := 31 M[3] := 31 M[3] := 32 M[3] := 32 C := M[3] M[0] := 01 M[0] := 01 M[3] := 32 C := M[3] C := M[3] M[0] := 01 Initial Program with Dependencies Valid Modification Valid (or Bad?) Modification Answer: Bad modification: M[3] := 32 must happen before C := M[3].
  • 223.
    2.11.4 Memory Arraysand Dataflow Diagrams 201 2.11.4 Memory Arrays and Dataflow Diagrams Legend for Dataflow Diagrams ......................................................... name name name name (rd) name(wr) Input port Output port State signal Array read Array write Basic Memory Operations ............................................................. mem data addr mem addr mem(rd) mem(wr) mem data (anti-dependency) mem data := mem[addr]; mem[addr] := data; Memory Read Memory Write Dataflow diagrams show the dependencies between operations. The basic memory operations are similar, in that each arrow represents a data dependency. There are a few aspects of the basic memory operations that are potentially surprising: • The anti-dependency arrow producing mem on a read. • Reads and writes are dependent upon the entire previous value of the memory array. • The write operation appears to produce an entire memory array, rather than just updating an individual element of an existing array. Normally, we think of a memory array as stationary. To do a read, an address is given to the array and the corresponding data is produced. In datalfow diagrams, it may be somewhat suprising to see the read and write operations consuming and producing memory arrays. Our goal is to support memory operations in dataflow diagrams. We want to model memory oper- ations similarly to datapath operations. When we do a read, the data that is produced is dependent upon the contents of the memory array and the address. For write operations, the apparent depen- dency on, and production of, an entire memory array is because we do not know which address in the array will be read from or written to. The antidependency for memory reads is related to Write-after-Read dependencies, as discussed in Section 2.11.3. There are optimizations that can be performed when we know the address (Section 2.11.4).
  • 224.
    202 CHAPTER 2. RTL DESIGN WITH VHDL Dataflow Diagrams and Data Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Algo: mem[wr addr] := data in; Algo: mem[wr addr] := data in; data out := mem[rd addr]; data out := mem[rd addr]; mem data_in wr_addr mem data_in wr_addr rd_addr mem(wr) rd_addr mem(wr) mem(rd) mem(rd) mem data_out Optimization when rd addr = wr addr mem data_out Read after Write Algo: mem[wr1 addr] := data1; mem[wr2 addr] := data2; mem data1 wr1_addr mem(wr) data2 wr2_addr mem(wr) mem Write after Write
  • 225.
    2.11.4 Memory Arraysand Dataflow Diagrams 203 Algo: mem[wr1 addr] := data1; mem[wr2 addr] := data2; mem data2 wr2_addr mem(wr) data1 wr1_addr mem(wr) mem Scheduling option when wr1 addr = wr2 addr Algo: rd data := mem[rd addr]; Algo: rd data := mem[rd addr]; mem[wr addr] := wr data; mem[wr addr] := wr data; mem rd_addr mem rd_addr wr_data wr_addr mem(rd) mem(wr) mem(rd) wr_data wr_addr rd_data mem mem(wr) Optimization when rd addr = wr addr rd_data mem Write after Read
  • 226.
    204 CHAPTER 2. RTL DESIGN WITH VHDL 2.11.5 Example: Memory Array and Dataflow Diagram mem data_in wr_addr M 21 2 1 M(wr) 31 3 2 M(wr) 2 0 3 M(rd) 4 M(rd) 32 3 1 M[2] := 21 A B 5 M(wr) 01 0 2 M[3] := 31 3 A := M[2] 6 M(wr) 3 4 B := M[0] 5 M[3] := 32 7 M(rd) 6 M[0] := 01 7 C := M[3] M C Figure 2.9: Memory array example code and initial dataflow diagram The dependency and anti-dependency arrows in dataflow diagram in Figure2.9 are based solely upon whether an operation is a read or a write. The arrows do not take into account the address that is read from or written to. In figure2.10, we have used knowledge about which addresses we are accessing to remove un- needed dependencies. These are the real dependencies and match those shown in the code fragment for figure2.9. In figure2.11 we have placed an ordering on the read operations and an ordering on the write operations. The ordering is derived by obeying data dependencies and then rearranging the operations to perform as many operations in parallel as possible.
  • 227.
    2.11.5 Example: MemoryArray and Dataflow Diagram 205 M 0 21 2 31 3 M 0 21 2 31 3 M(rd) M(wr) M(wr) 1 M(rd) 1 M(wr) 2 M(wr) B B 01 0 2 32 3 01 0 2 32 3 M(wr) M(rd) M(wr) 4 M(wr) 2 M(rd) 3 M(wr) A 3 A 3 M(rd) 3 M(rd) M C M C Figure 2.10: Memory array with minimal dependencies Figure 2.11: Memory array with orderings M 0 21 2 1 M(rd) 1 M(wr) B 2 31 3 2 M(rd) 2 M(wr) A 32 3 3 M(wr) 3 01 0 3 M(rd) 4 M(wr) C M Figure 2.12: Final version of Figure2.9 Put as many parallel operations into same clock cycle as allowed by resources. Preserve depencies by putting dependent operations in separate clock cycles.
  • 228.
    206 CHAPTER 2. RTL DESIGN WITH VHDL 2.12 Input / Output Protocols An important aspect of hardware design is choosing a input/output protocol that is easy to im- plement and suits both your circuit and your environment. Here are a few simple and common protocols. rdy data ack Figure 2.13: Four phase handshaking protocol Used when timing of communication between producer and consumer is unpredictable. The dis- advantage is that it is cumbersome to implement and slow to execute. clk valid data Figure 2.14: Valid-bit protocol A low overhead (both in area and performance) protocol. Consumer must always be able to accept incoming data. Often used in pipelined circuits. More complicated versions of the protocol can handle pipeline stalls. clk start data_in done data_out Figure 2.15: Start/Done protocol A low overhead (both in area and performance) protocol. Useful when a circuit works on one piece of data at a time and the time to compute the result is unpredictable.
  • 229.
    2.13. EXAMPLE: MOVINGAVERAGE 207 2.13 Example: Moving Average In this section we will design a circuit that performs a moving average as it receives a stream of data. When each new data item is received, the output is the average of the four most recently received data. Time 0 1 2 3 4 5 6 7 8 9 10 i_data 2 3 5 6 6 0 2 2 5 3 1 o_avg 4 5 4 3 2.13.1 Requirements and Environmental Assumptions 1. Input data is sent sporadically, with at least 2 clock cycles of bubbles (invalid data) between valid data. 2. When the input data is valid, the signal i valid is asserted for exactly one clock cycle. 3. Input data will be 8-bit signed numbers. 4. When output data is ready, o valid shall be asserted. 5. The output data (o avg) shall be the average of the four most recently received input data. Output numbers shall be truncated to integer values. 2.13.2 Algorithm We begin by exploring the mathematical behaviour of the system. To simplify the analysis at this abstract level, we ignore bubbles and time. We focus only the valid data. If we had an input stream of data xi (e.g., xi is the value of the ith valid data of i data, the equation for the output would be: avgi = (xi−3 + xi−2 + xi−1 + xi )/4 To simplify our analysis of the equation, we decompose the computation into computing the sum of the four most recent data and dividing the sum by four: sumi = xi−3 + xi−2 + xi−1 + xi avgi = sumi /4 We look at the equation of sum over several iterations to try to identify patterns that we can use to optimize our design:
  • 230.
    208 CHAPTER 2. RTL DESIGN WITH VHDL sum5 = x2 + x3 + x4 + x5 sum6 = x3 + x4 + x5 + x6 sum7 = x4 + x5 + x6 + x7 We see that part of the calculations that are done for index i are the same as those for i + 1: sum5 = x2 + (x3 + x4 + x5 ) sum6 = (x3 + x4 + x5 ) + x6 = sum5 − x2 + x6 We check a few more samples and conclude that we can generalize the above for index i as: sumi = sumi−1 − xi−4 + xi avgi = sumi /4 The equation for sumi is dependent on xi and xi−4 , therefore we need the current input value and we need to store the four most recent input data. These four most recent data form a sliding window: each time we receive valid data, we remove the oldest data value (xi−4 ) and insert the new data (xi ). Summary of system behaviour deduced from exploring requirements and algorithm: 1. Define a signal new for the value of i data each time that i valid is ’1’. 2. Define a memory array M to store a sliding window of the four most recent values of i data. 3. Define a signal old for the oldest data value from the sliding window. 4. Update sumi with sumi−1 – oldi + newi Sliding Window ....................................................................... There are two principal ways to implement a sliding window: shift-register Each time new data is loaded, all of the registers are loaded with the data in the register to their right or left and the leftmost or rightmost register is loaded with new data: R[0] = new and R[i] = R[i − 1]. circular buffer Once a data value is loaded into the buffer, the data remains in the same location until it is overwritten. When new a value is loaded, the new value overwrites the oldest value in the buffer. None of the other elements in the buffer change. A state machine keeps track of the position (address) of the oldest piece of data. The state machine increments to point to the next register, which now holds the oldest piece of data.
  • 231.
    2.13.2 Algorithm 209 old M[0..3] new ε old M[3] M[2] M[1] M[0] new α β γ δ α α β γ δ ε α ζ ε β γ δ β β γ δ ε ζ β η γ γ δ ε ζ η ε ζ γ δ γ ι δ δ ε ζ η ι ε ζ η δ δ ε ε ζ η ι κ κ ε ζ η ι ζ ζ η ι κ λ ε λ Shift register κ ζ η ι ζ Circular Buffer The circular buffer design is usually preferable, because only one element changes value per clock cycle. This allows the buffer to implemented with a memory array rather than a set of regis- ters. Also, by having only element change value, power consumption is reduced (fewer capacitors charging and discharging). We have only four items to store, so we will use registers, rather than a memory array. For less than sixteen items, registers are generally cheaper. For sixteen items, the choice between registers and a memory array is highly dependent on the design goals (e.g. speed vs area) and implementation technology. Now that we have designed the storage module, we see that rather than a write-enable and address signal, the actual signals we need are four chip-enable signals. This suggests that we should use a one-hot encoding for the index of the oldest element in the circular buffer. Because we have a one-hot encoding for the index, we do not use normal multiplexers to select which register to read from. Normal multiplexers take a binary-encoded select signal. Instead, we will use a 4:1 decoded mux, which is just four AND gates followed by a 4-input OR gate. Because the data is 8-bits wide, each of the AND gates and the OR gate are 8-bits wide.
  • 232.
    210 CHAPTER 2. RTL DESIGN WITH VHDL 8 d M[0] 8 D Q we ce[0] idx[0] CE addr M[1] 8 D Q ce[1] idx[1] CE 8 q M[2] 8 D Q ce[2] idx[2] CE M[3] 8 D Q ce[3] idx[3] CE Register array with chip-enables and decoded multiplexer 2.13.3 Pseudocode and Dataflow Diagrams There are three different notations that we use to describe the behaviour of hardware systems abstractly: mathematical equations (for datapath centric designs), state machines (for control- dominated designs), and pseudocode (for algorithms or designs with memory). Our pseudocode is similar to three-address assembly code: each line of code has a target variable, an operation, and one or two operand variables (e.g., C = A + B). The name “three address” comes from the fact that there are three addresses, or variables, in each line. We use the three-address style of pseudocode, because each line of pseudocode then corresponds to a single datapath operation in the dataflow diagram. This gives us greater flexibility to optimize the pseudocode by rescheduling operations. From the three-address pseudocode, we will construct dataflow diagrams. As an aside, in constrast to three-address languages, some assembly languages for extremely small processors are limited to two addresses. The target must be the same as one of the operands (e.g., A = A + B).
  • 233.
    2.13.3 Pseudocode andDataflow Diagrams 211 First Pseudocode ...................................................................... For the first pseudocode, we do not restrict ourselves to three-addresses. In the second version of the code, we decompose the first line into two separate lines that obey the three-address restriction. Pseudo pseudocode Real 3-address pseudocode new = i_data new = i_data old = M[idx] old = M[idx] sum = sum - old + new tmp = sum - old M[idx] = new sum = tmp + new idx = idx rol 1 M[idx] = new o_avg = sum/4 idx = idx rol 1 o_avg = sum/4 Data-Dependency Graph ............................................................. . To begin to understand what the hardware might be, we draw a data-dependency graph for the pseudocode. sum M idx i_data new Rd old Wr 1 tmp (wired shift) sum o_avg M idx Optimizing the Data-Dependency Graph .............................................. . In our design work so far, we have ignored bubbles and time. As we evolve from the pseudocode to a datadependency graph and then to a dataflow graph, we will include the effect of the bubbles in our analysis.
  • 234.
    212 CHAPTER 2. RTL DESIGN WITH VHDL In the datadependency graph we observe that we have two arithmetic operations: subtraction and addition. The requirements guarantee that there are at least two clock cycles of bubbles between each parcel of valid data, so we have the ability to reuse hardware. In contrast, we would not be able to reuse hardware if either we had to accept new data in each clock cycle or we needed a fully pipelined circuit. If we had to accept new data in each clock cycle, and were not pipelined, then the work would need to be completed in a single clock cycle. If the design was to be fully pipelined, then each parcel of data would stay in each stage for exactly one clock cycle: there would be no opportunity for a parcel to visit a stage twice, and hence no opportunity for reuse. For our design, where we are attempting to reuse hardware, we hypothesize that a single adder/subtracter is cheaper than a separate adder and a subtracter. We would like to combine the two lines: tmp = sum - old sum = tmp + new Looking at the data-dependency graph, we see that old is coming from memory and new is coming from either a register or combinational logic. We cannot allocate new and old to the same hardware, because new and old are not the same type of hardware: new is an array of registers and old is a register. So, we will need a multiplexer for the second operand, to choose between reading from old or and new. A multiplexer might also be required for the first operand to choose between sum and tmp. But, both of these signals are regular signals, so we might be able to allocate both sum and tmp to the same register or datapath output, and hence avoid a multiplexer for the first operand. We will make decide how to deal with the first operand when we do register and datapath allocation. We remove the need for a multiplexer for the second operand by reading new from memory. To accomplish this, we re-write the pseudocode so that we first write i data to memory, and then read new from memory. The three versions of the pseudocode below show the transformations. The datadependency graph is for the third version of the pseudocode.
  • 235.
    2.13.3 Pseudocode andDataflow Diagrams 213 Data-dependency graph after removing new Remove intermediate signal old new = i_data sum M idx i_data tmp = sum - M[idx] sum = tmp + new M[idx] = new idx = idx rol 1 Rd o_avg = sum/4 Optimize code byreading new from memory old tmp = sum - M[idx] Wr M[idx] = i_data new = M[idx] sum = tmp + new Rd idx = idx rol 1 1 o_avg = sum/4 tmp new Remove intermediate signal new tmp = sum - M[idx] (wired shift) M[idx] = i_data sum = tmp + M[idx] idx = idx rol 1 sum o_avg M idx o_avg = sum/4 Dataflow Diagram ..................................................................... To construct a dataflow diagram, we divide the data-dependency graph into clock cycles. Because we are using registers rather than a memory array, we can schedule the first read and first write operations in the same clock cycle, even though they use the same address. In contrast, with memory arrays it generally is risky to rely on the value of the output data in the clock cycle in which we are doing a write (Section 2.11.1). We need a second clock cycle for the second read from memory. We now explore two options: with and without a third clock cycle; both are shown below. The difference between the two options is whether the signals idx and sum refers to the output of registers or the combinational datapath units (sum being the output of the adder/subtracter and idx being the output of a rotation). With a latency of three clock cycles, idx is a registered signal. With a latency of two clock cycles, idx and sum are combinational. It is a bit misleading to describe the rotate-left unit for idx as combinational, because it is simply a wire connecting one flip-flop to another. However, conceptually and for correct behaviour, it is helpful to think of the rotation unit as a block of combinational circuitry. This allows us to distinguish between the output of the idx register and the input to the register (which is the output of the rotation unit). Without this distinction, we might read the wrong value of idx and be out-of-sync by one clock cycle.
  • 236.
    214 CHAPTER 2. RTL DESIGN WITH VHDL Latency of three clock cycles Latency of two clock cycles M i_data idx sum M i_data idx sum S0 S0 Wr Rd Wr Rd S1 S1 Rd Rd 1 1 S2 S0 (wired shift) (wired shift) S0 M sum o_avg idx M sum o_avg idx From a performance point of view, a latency of two is somewhat preferable. By keeping our latency low, there may be another module that will benefit by having an additional clock cycle in which to do its work. The counter argument is that we have two clock cycles of bubbles, which means that we can tolerate a latency of up to three without a need to pipeline. We’ll be efficient engineers and try to achieve a latency of two. The two dataflow diagrams appear to be very similar, but in the dataflow diagram with a latency of two, a multiplexer will be needed for the address signal of the circular buffer. In S0, the address input to the circular buffer is the output of the rotator. In S1, the address is the output of a register. To eliminate the need for a multiplexer on the address input to the circular buffer, we move the rotation from S0 to S1, so that the address is always a registered signal.
  • 237.
    2.13.3 Pseudocode andDataflow Diagrams 215 Latency of two clock cycles with registered address M i_data idx sum S0 Wr Rd 1 S1 Rd S0 (wired shift) M sum o_avg idx Register and Datapath Allocation ...................................................... Register allocation is simple: idx and sum are each allocated to registers with their same names (e.g., idx and sum) on the first clock cycle boundary. For the second boundary, we similarly allocate idx to the register idx. This leaves us with the register sum for the output of the adder/subtracter. Datapath Allocation ................................................................... Datapath allocation is even simpler than register allocation: we have one adder/subtracter (as1) and one rotate-left (rol).
  • 238.
    216 CHAPTER 2. RTL DESIGN WITH VHDL M i_data idx sum S0 Wr idx sum Rd 1 S1 as1 rol Rd sum idx as1 S0 (wired shift) M sum o_avg idx 2.13.4 Control Tables and State Machine From the dataflow diagram, we construct a control table. For memory (M) we need: write enable, address, and data input columns. For registers (idx, sum) we need chip enable and data input columns. For datapath components we need data inputs, plus a control signal to determine whether as1 does addition or subtraction. We name the signal as1.sub, where a value of true means to do a subtraction and false means do addition. We proceed in two steps, first ignoring bubbles, then extending our design to handle bubbles. Register control table Datapath control table M idx sum as1 rol we addr d ce d ce d sub src1 src2 src1 src2 S0 1 idx x 0 – 1 as1 S0 0 M sum – – S1 0 idx – 1 rol 1 as1 S1 1 sum M idx 1 Optimized control table Static assignments in control table M idx as1 M.addr = idx we ce sub M.d = x S0 1 1 0 idx.d = rol S1 0 0 1 sum.d = as1 as1.src1 = sum as1.src2 = M
  • 239.
    2.13.4 Control Tablesand State Machine 217 Control Table and Bubbles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. If the circuit always had valid parcels arriving in every other clock cycle, then we could proceed directly from our dataflow diagram and optimized control table to VHDL code. However, the indeterminate number of bubbles complicates the design of our state machine. We add an idle mode to our state machine. The circuit is in idle mode when there is not a valid parcel in the circuit. By “idle”, we mean that all write enable signals are turned off, chip enable signals are turned off, and the state machine does not change state. The state machine for the control table must resume in state S0 when i valid becomes true. In the optimized control table, sum does need a chip enable, but with the addition of ide mode, we will need to use a chip enable with sum. The multiplexers for the datapath components are unaffected by the addition of idle mode. When the circuit is in idle mode, the registers do not load new data, and so the behaviour of the datapath components is unconstrained. The final control table is below. Almost final control table Static assignments M.addr = idx M idx sum as1 M.d = x we ce ce sub idx.d = rol S0 1 0 1 0 sum.d = as1 S1 0 1 1 1 as1.src1 = sum as1.src2 = M idle 0 0 0 – Final control table M idx sum as1 we ce ce sub S0 1 0 1 0 S1 0 1 1 1 idle 0 0 0 0 State Machine . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . . .. . . . . ... The state machine start in idle, transitions to S0 when i valid is true, then goes to S1 in the next clock cycle, and then goes to idle. We will use a modified one-hot encoding and use the valid-bit signals to hold the state. From the dataflow diagram we see that the latency through the circuit is two clock cycles. We need two valid bit registers and will have three valid-bit signals: i valid (input, no register needed), valid1 (register), o valid (register). For the state encoding, we will use i valid and valid1.
  • 240.
    218 CHAPTER 2. RTL DESIGN WITH VHDL i valid valid1 S0 1 0 S1 0 1 idle 0 0 Updating the control table to show the state encoding gives us: Final control table with state encoding state M idx sum as1 i valid valid1 we ce ce sub S0 1 0 1 0 1 0 S1 0 1 0 1 1 1 idle 0 0 0 0 0 0 Using the state encoding and the final control table, we write equations for the write-enable signals, chip-enable signals, and the adder/subtracter control signal. M.we = i_valid idx.ce = valid1 sum.ce = i_valid OR valid1 as1.sub = valid1
  • 241.
    2.13.5 VHDL Code 219 2.13.5 VHDL Code -- valid bits process begin wait until rising_edge(clk); valid1 <= i_valid; o_valid <= valid1; end process; -- idx process begin wait until rising_edge(clk); if reset = ’1’ then idx <= "0001"; else if valid1 = ’1’ then idx <= idx rol 1; end if; end if; end process; -- sliding window process begin wait until rising_edge(clk); for i in 3 downto 0 loop if (i_valid = ’1’) and (idx(i) = ’1’) then M(i) <= i_data; end if; end loop; end process; mem_out <= M(0) when idx(0) = ’1’ else M(1) when idx(1) = ’1’ else M(2) when idx(2) = ’1’ else M(3); -- add sub add_sub <= sum - mem_out when valid1 = ’1’ else sum + mem_out; -- sum process begin wait until rising_edge(clk); if i_valid = ’1’ or valid1 = ’1’ then sum <= add_sub; end if; end process; Hardware .............................................................................
  • 242.
    220 CHAPTER 2. RTL DESIGN WITH VHDL i_valid i_data A CE valid1 M CE (wired shift) idx add/sub CE sum (wired shift) o_valid o_avg
  • 243.
    2.14. DESIGN PROBLEMS 221 2.14 Design Problems P2.1 Synthesis This question is about using VHDL to implement memory structures on FPGAs. P2.1.1 Data Structures If you have to write your own code (i.e. you do not have a library of memory components or a special component generation tool such as LogiBlox or CoreGen), what datastructures in VHDL would you use when creating a register file? P2.1.2 Own Code vs Libraries When using VHDL for an FPGA, under what circumstances is it better to write your own VHDL code for memory, rather than instantiate memory components from a library? P2.2 Design Guidelines While you are grocery shopping you encounter your co-op supervisor from last year. She’s now forming a startup company in Waterloo that will build digital circuits. She’s writing up the de- sign guidelines that all of their projects will follow. She asks for your advice on some potential guidelines. What is your response to each question? What is your justification for your answer? What are the tradeoffs between the two options? 0. Sample Should all projects use silicon chips, or should all use biological chips, or should each project choose its own technique? Answer: All projects should use silicon based chips, because biological chips don’t exist yet. The tradeoff is that if biological chips existed, they would probably con- sume less power than silicon chips. 1. Should all projects use an asynchronous reset signal, or should all use a synchronous reset signal, or should each project choose its own technique? 2. Should all projects use latches, or should all projects use flip-flops, or should each project choose its own technique?
  • 244.
    222 CHAPTER 2. RTL DESIGN WITH VHDL 3. Should all chips have registers on the inputs and outputs or should chips have the inputs and outputs directly connected to combinational circuitry, or should each project choose its own technique? By “register” we mean either flip-flops or latches, based upon your answer to the previous question. If your answer is different for inputs and outputs, explain why. 4. Should all circuit modules on all chips have flip-flops on the inputs and outputs or should chips have the inputs and outputs directly connected to combinational circuitry, or should each project choose its own technique? By “register” we mean either flip-flops or latches, based upon your answer to the previous question. If your answer is different for inputs and outputs, explain why. 5. Should all projects use tri-state buffers, or should all projects use multiplexors, or should each project choose its own technique? P2.3 Dataflow Diagram Optimization Use the dataflow diagram below to answer problems P2.3.1 and P2.3.2. a b c f f d e g f g P2.3.1 Resource Usage List the number of items for each resource used in the dataflow diagram.
  • 245.
    P2.4 Dataflow DiagramDesign 223 P2.3.2 Optimization Draw an optimized dataflow diagram that improves the performance and produces the same output values. Or, if the performance cannot be improved, describe the limiting factor on the preformance. NOTES: • you may change the times when signals are read from the environment • you may not increase the resource usage (input ports, registers, output ports, f components, g components) • you may not increase the clock period P2.4 Dataflow Diagram Design Your manager has given you the task of implementing the following pseudocode in an FPGA: if is_odd(a + d) p = (a + d)*2 + ((b + c) - 1)/4; else p = (b + c)*2 + d; NOTES: 1) You must use registers on all input and output ports. 2) p, a, b, c, and d are to be implemented as 8-bit signed signals. 3) A 2-input 8-bit ALU that supports both addition and subtraction takes 1 clock cycle. 4) A 2-input 8-bit multiplier or divider takes 4 clock cycles. 5) A small amount of additional circuitry (e.g. a NOT gate, an AND gate, or a MUX) can be squeezed into the same clock cycle(s) as an ALU operation, multiply, or divide. 6) You can require that the environment provides the inputs in any order and that it holds the input signals at the same value for multiple clock cycles. P2.4.1 Maximum Performance What is the minimum number of clock cycles needed to implement the pseudocode with a circuit that has two input ports? What is the minimum number of ALUs, multipliers, and dividers needed to achieve the minimum number of clock cycles that you just calculated?
  • 246.
    224 CHAPTER 2. RTL DESIGN WITH VHDL P2.4.2 Minimum area What is the minimum number of datapath storage registers (8, 6, 4, and 1 bit) and clock cycles needed to implement the pseudocode if the circuit can have at most one ALU, one multiplier, and one divider? P2.5 Michener: Design and Optimization Design a circuit named michener that performs the following operation: z = (a+d) + ((b - c) - 1) NOTES: 1. Optimize your design for area. 2. You may schedule the inputs to arrive at any time. 3. You may do algebraic transformations of the specification. P2.6 Dataflow Diagrams with Memory Arrays Component Delay Register 5 ns Adder 25 ns Subtracter 30 ns ALU with +, −, >, =, −, AND, XOR 40 ns Memory read 60 ns Memory write 60 ns Multiplication 65 ns 2:1 Multiplexor 5 ns NOTES: 1. The inputs of the algorithms are a and b. 2. The outputs of the algorithms are p and q. 3. You must register both your inputs and outputs. 4. You may choose to read your input data values at any time and produce your outputs at any time. For your inputs, you may read each value only once (i.e. the environment will not send multiple copies of the same value). 5. Execution time is measured from when you read your first input until the latter of producing your last output or the completion of writing a result to memory 6. M is an internal memory array, which must be implemented as dual-ported memory with one read/write port and one read port. 7. M supports synchronous write and asynchronous read.
  • 247.
    P2.7 2-bit adder 225 8. Assume all memory address and other arithmetic calculations are within the range of repre- sentable numbers (i.e. no overflows occur). 9. If you need a circuit not on the list above, assume that its delay is 30 ns. 10. You may sacrifice area efficiency to achieve high performance, but marks will be deducted for extra hardware that does not contribute to performance. P2.6.1 Algorithm 1 Algorithm q = M[b]; M[a] = b; p = M[b+1] * a; Assuming a ≤ b, draw a dataflow diagram that is optimized for the fastest overall execution time. P2.6.2 Algorithm 2 q = M[b]; M[a] = q; p = (M[b-1]) * b) + M[b]; Assuming a > b, draw a dataflow diagram that is optimized for the fastest overall execution time. P2.7 2-bit adder This question compares an FPGA and generic-gates implementation of 2-bit full adder. P2.7.1 Generic Gates Show the implementation of a 2 bit adder using NAND, NOR, and NOT gates.
  • 248.
    226 CHAPTER 2. RTL DESIGN WITH VHDL P2.7.2 FPGA Show the implementation of a 2 bit adder using generic FPGA cells; show the equations for the lookup tables. c_in sum[0] a[0] b[0] comb R D Q CE S carry_1 sum[1] a[1] b[1] comb R D Q CE S c_out P2.8 Sketches of Problems 1. calculate resource usage for a dataflow diagram (input ports, output ports, registers, datapath components) 2. calculate performance data for a dataflow diagram (clock period and number of cycles to execute (CPI)) 3. given a dataflow diagram, calculate the clock period that will result in the optimum perfor- mance 4. given an algorithm, design a dataflow diagram 5. given a dataflow diagram, design the datapath and finite state machine 6. optimize a dataflow diagram to improve performance or reduce resource usage 7. given fsm diagram, pick VHDL code that “best” implements diagram — correct behaviour, simple, fast hardware — or critique hardware
  • 249.
    Chapter 3 Performance Analysisand Optimization 3.1 Introduction Hennessey and Patterson’s Quantitative Computer Achitecture (textbook for E&CE 429) has good information on performance. We will use some of the same definitions and formulas as Hennessey and Patterson, but we will move away from generic definitions of performance for computer sys- tems and focus on performance for digital circuits. 3.2 Defining Performance Work Performance = Time You can double your performance by: doing twice the work in the same amount of time OR doing the same amount of work in half the time Benchmarking ....................................................................... . Measuring time is easy, but how do we accurately measure work? The game of benchmarketing is finding a definition of work that makes your system appear to get the most work done in the least amount of time. 227
  • 250.
    228 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION Measure of Work Measure of Performance clock cycle MHz instruction MIPs synthetic program Whetstone, Dhrystone, D-MIPs (Dhrystone MIPs) real program SPEC travel 1/4 mile drag race The Spec Benchmarks are among the most respected and accurate predictions of real-world per- formance. Definition SPEC: Standard Performance Evaluation Corporation MISSION: “To establish, maintain, and endorse a standardized set of relevant benchmarks and metrics for performance evaluation of modern computer systems http://www.spec.org.” The Spec organization has different benchmarks for integer software, floating-point software, web- serving software, etc. 3.3 Comparing Performance 3.3.1 General Equations Equation for “Big is n% greater than Small”: Big − Small n% = Small For the above equation, it can be difficult to remember whether the denominator is the larger number or the smaller number. To see why Small is the only sensible choice, consider the situation where a is 100% greater than b. This means that the difference between a and b is 100% of something. Our only variables are a and b. It would be nonsensical for the difference to be a, because that would mean: a − b = a. However, if a − b = b, then for a to be 100% greater than b simply means that a = 2b. Using “n% greater” formula, the phrase “The performance of A is n% greater than the performance of B” is: Performance A − Performance B n% = Performance B
  • 251.
    3.3.2 Example: Performanceof Printers 229 Performance is inversely proportional to time: 1 Performance = Time Substituting the above equation into the equation for “the performance of A is n% greater than the performance of B” gives: TimeB − TimeA n% = TimeA In general, the equation for a fast system to be “n%” faster than a slow system is: TSlow − TFast n% = TFast Another useful formula is the average time to do one of k different tasks, each of which happens %i of the time and takes an amount of time Ti to do each time it is done . k TAvg = ∑ (%i)(Ti ) i=1 We can measure the performance of practically anything (cars, computers, vacuum cleaners, print- ers....) 3.3.2 Example: Performance of Printers Black and White Colour printer1 9ppm 6ppm printer2 12ppm 4ppm Question: Which printer is faster at B&W and how much faster is it? Answer:
  • 252.
    230 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION BW Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. TSlow − TFast n% faster = TFast 1 BW1 = 9ppm = 0.1111min/page 1 BW2 = 12ppm = 0.0833min/page TSlow − TFast BWFaster = TFast BW1 − BW2 = BW2 0.1111 − 0.08333 = 0.08333 = 33%faster Performance for Different Tasks ...................................................... . Question: If average workload is 90% BW and 10% Colour, which printer is faster and how much faster is it?
  • 253.
    3.3.2 Example: Performanceof Printers 231 Answer: TAvg1 = %BW × BW1 + %C × C1 = (0.90 × 0.1111) + (0.10 × 0.1667) = 0.1167min/page TAvg2 = %BW × BW2 + %C × C2 = (0.90 × 0.0833) + (0.10 × 0.2500) = 0.1000min/page TSlow − TFast AvgFaster = TFast Avg1 − Avg2 = Avg2 0.1167 − 0.1000 = 0.1000 = 16.7%faster Optimizing Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Question: If we want to optimize printer1 to match performance of printer2, should we optimize BW or Colour printing? Answer: Colour printing is slower, so appears that can save more time by optimizing colour printing. However, look at extreme case of optimizing colour printing to be instantaneous for P1:
  • 254.
    232 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION 0.150m/p 0.100m/p 0.050m/p 0.000m/p P1 P2 Even if make colour printing instantaneous for printer 1 and kept same for printer 2, printer 1 would not be measurably faster. Amdahl’s law “Make the common case fast.” Optimizations need to take into account both run time and frequency of occurrence. We should optimize black and white printing. Question: If you have to fire all of the engineers because your stock price plummeted, how can you get printer1 to be faster than printer2? Note: This question was actually humorous during the high-tech bubble of 2000... Answer: Hire more marketing people! Notice that colour printing on printer 1 is faster than on printer 2. So, marketing suggests that people are increasing the percentage of printing that is done in colour. Question: Revised question: what percentage of printing must be done in colour for printer1 to beat printer2?
  • 255.
    3.4. CLOCK SPEED,CPI, PROGRAM LENGTH, AND PERFORMANCE 233 Answer: TAvg1 ≤ TAvg2 %BW × BW1 + %C × C1 ≤ %BW × BW2 + %C × C2 %BW = 1 − %C (1 − %C) × BW1 + %C × C1 ≤ (1 − %C) × BW2 + %C × C2 BW1 + %C × (C1 − BW1) ≤ BW2 + %C × (C2 − BW2) BW1 − BW2 %C ≥ BW1 − BW2 + C2 − C1 0.1111 − 0.0833 %C ≥ 0.1111 − 0.0833 + 0.2500 − 0.1667 %C ≥ 0.25 3.4 Clock Speed, CPI, Program Length, and Performance 3.4.1 Mathematics CPI Cycles per instruction NumInsts Number of instructions ClockSpeed Clock speed ClockPeriod Clock period Time = NumInsts × CPI × ClockPeriod Time = NumInsts×CPI ClockSpeed 3.4.2 Example: CISC vs RISC and CPI Clock Speed SPECint AMD Athlon 1.1GHz 409 Fujitsu SPARC64 675MHz 443
  • 256.
    234 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION The AMD Athlon is a CISC microprocessor (it uses the IA-32 instruction set). The Fujitsu SPARC64 is a RISC microprocessor (it uses Sun’s Sparc instruction set). Assume that it requires 20% more instructions to write a program in the Sparc instruction set than the same program re- quires in IA-32. Question: Which of the two processors has higher performance? Answer: SPECint, SPECfp, and SPEC are measures of performance. Therefore, the higher the SPEC number, the higher the performance. The Fujitsu SPARC64 has higher performance Question: What is the ratio between the CPIs of the two microprocessors? Answer: We will use a as the subscript for the Athlon and s as the subscript for the Sparc. Time = NumInsts×CPI ClockSpeed Time × ClockSpeed CPI = NumInsts ClockSpeed CPI = Perf × NumInsts CPIA ClockSpeedA PerfS × NumInstsS = × CPIS PerfA × NumInstsA ClockSpeedS ClockSpeedA = 1.1 ClockSpeedS = 0.675 PerfA = 409 PerfS = 443 NumInstsS = 1.2 × NumInstsA 1.1 443 × 1.2 × NumInstsA = × 409 × NumInstsA 0.675 = 2.1 = 110%more
  • 257.
    3.4.3 Effect ofInstruction Set on Performance 235 Executing the average Athlon instruction requires 110% more clock cycles than executing the average Sparc instruction. Stated more awkwardly: executing the average Athlon instruction requires 210% of the clock cycles required to execute the average Sparc instruction. Question: Can you determine the absolute (actual) CPI of either microprocessor? Answer: To determine the absolute CPI, we would need to know the actual number of instructions execute by at least one of the processors. 3.4.3 Effect of Instruction Set on Performance Your group designs a microprocessor and you are considering adding a fused multiply-accumulate to the instruction set. (A fused multiply accumulate is a single instruction that does both a multiply and an addition. It is often used in digital signal processing.) Your studies have shown that, on average, half of the multiply operations are followed by an add instruction that could be done with a fused multiply-add. Additionally, you know: cpi % ADD 0.8 CPIavg 15% MUL 1.2 CPIavg 5% Other 1.0 CPIavg 80% You have three options: option 1 : no change option 2 : add the MAC instruction, increase the clock period by 20%, and MAC has the same CPI as MUL. option 3 : add the MAC instruction, keep the clock period the same, and the CPI of a MAC is 50% greater than that of a multiply. Question: Which option will result in the highest overall performance?
  • 258.
    236 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION Answer: NumInsts × CPI Time = ClockSpeed ClockSpeed Perf = NumInsts × CPI We need to find NumInsts, CPI, and ClockSpeed for each of the three options. Option 1 is the baseline, so we will define values for variables in Options 2 and 3 in terms of the Option 1 variables. Options 2 and 3 will have the same number of instructions. Half of the multiply instructions are followed by an add that can be fused. In questions that involve changing both CPI and NumInsts, it is often easiest to work with the product of CPI and NumInsts, which represents the total number of clock cycles needed to execute the program. Additionally, set the problem up with an imaginary program of 100 instructions on the baseline system. NumMAC2 = 0.5 × NumMul1 = 0.5 × 5 = 2.5 NumMUL2 = 0.5 × NumMul1 = 0.5 × 5 = 2.5 NumADD2 = NumAdd1 − 0.5 × NumMul1 = 15 − 0.5 × 5 = 12.5 Find the total number of clock cycles for each option. Cycles1 = NumMUL1 × CPIMUL + NumADD1 × CPIADD + NumOth1 × CPIOth = (5 × 1.2) + (15 × 0.8) + (80 × 1.0) = 98 Cycles2 = (NumMAC2 × CPIMAC ) + (NumMUL2 × CPIMUL ) +(NumADD2 × CPIADD ) + (NumOth2 × CPIOth ) = (2.5 × 1.2) + (2.5 × 1.2) + (12.5 × 0.8) + (80 × 1.0) = 96 Cycles3 = (NumMAC3 × CPIMAC ) + (NumMUL3 × CPIMUL ) +(NumADD3 × CPIADD ) + (NumOth3 × CPIOth ) = (2.5 × (1.5 × 1.2)) + (2.5 × 1.2) + (12.5 × 0.8) + (80 × 1.0) = 97.5
  • 259.
    3.4.4 Effect ofTime to Market on Relative Performance 237 Calculate performance for each option using the formula: 1 Performance = Cycles × ClockPeriod Performance1 = 1/(98 × 1) = 1/98 Performance2 = 1/(96 × 1.2) = 1/115 Performance3 = 1/(97.5 × 1) = 1/97.5 The third option is the fastest. 3.4.4 Effect of Time to Market on Relative Performance Assume that performance of the average product in your market segment doubles every 18 months. You are considering an optimization that will improve the performance of your product by 7%. Question: If you add the optimization, how much can you allow your schedule to slip before the delay hurts your relative performance compared to not doing the optimization and launching the product according to your current schedule? Answer: P(t) = performance at time t = P0 × 2t/18 From problem statement: P(t) = 1.07 × P0 Equate two equations for P(t), then solve for t. 1.07 × P0 = P0 × 2t/18 2t/18 = 1.07 t/18 = log2 1.07 t = 18 × (log2 1.07) log x Use: logb x = log b log 1.07 = 18 × log 2 = 1.76months
  • 260.
    238 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION 3.4.5 Summary of Equations Time to perform a task: NumInsts × CPI Time = ClockSpeed Average time to do one of k different tasks: k TAvg = ∑ (%i)(Ti ) i=1 Performance: Work Performance = Time Speedup: TSlow Speedup = TFast TFast is n% faster than TSlow: TSlow − TFast n% faster = TFast Performance at time t if performance increases by factor of k every n units of time: Perf (t) = Perf (0) × kt/n
  • 261.
    3.5. PERFORMANCE ANALYSISAND DATAFLOW DIAGRAMS 239 3.5 Performance Analysis and Dataflow Diagrams 3.5.1 Dataflow Diagrams, CPI, and Clock Speed One of the challenges in designing a circuit is to choose the clock speed. Increasing the clock speed of a circuit might not improve its performance. In this section we will work through several example dataflow diagrams to pick a clock speed for the circuit and schedule operations into clock cycles. When partitioning dataflow diagrams into clock cycles, we need to choose a clock period. Choos- ing a clock period affects many aspects of the design, not just the overall performance. Different design goals might put conflicting pressure on the clock period: some goals will tend toward short clock periods and some goals will tend toward long clock periods. For performance, not only is clock period a poor indicator of the relative performance of two different systems, even for the same system decreasing the clock period might not increase the performance. Goal Action Affect Minimize area decrease clock pe- fewer operations per clock cycle, so riod fewer datapath components and more opportunities to reuse hardware Increase scheduling flexibil- increase clock pe- more flexibility in grouping operations ity riod in clock cycles Decrease percentage of clock increase clock pe- decreases number of flops that data tra- cycle spent in flops (overhead riod verses through — time in flops is not doing useful work) Decrease time to execute an ???? depends on dataflow diagram instruction Our general plan to find the clock period for maximum performance is: 1. Pick clock period to be delay through slowest component + delay through flop. 2. For each instruction, for each operation, schedule the operation in the earliest clock cycle possible without violating clock-period timing constraints. 3. Calculate average time to execute an instruction as: NumInsts × CPI Combine: Time = ClockSpeed k and: CPIavg = ∑ %i × CPIi i=1 k NumInsts × ∑ %i × CPIi i=1 to derive: Time = ClockSpeed
  • 262.
    240 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION 4. If the maximum latency through dataflow diagram is greater than 1, then increase clock period by minimum amount needed to decrease latency by one clock period and return to Step 2. 5. If the maximum latency through dataflow diagram is 1, then clock period for highest perfor- mance is clock period resulting in fastest Time. 6. If possible, adjust the schedule of operations to reduce the maximum number of occurrences of a component per instruction per clock cycle without increasing latency for any instruction. 3.5.2 Examples of Dataflow Diagrams for Two Instructions Circuit supports two instructions, A and B (e.g. multiply and divide). At any point in time, the circuit is doing either A or B — it does not need to support doing A and B simultaneously. The diagrams below show the flow for each instruction and the delay through the components (f,g,h,i) that the instructions use. The delay through a register is 5ns. Each operation (A and B) occurs 50% of the time. Our goal is to find a clock period and dataflow diagram for the circuit that will give us the highest overall performance. Instruction A Instruction B f (30ns) i (40ns) g (50 ns) g (50 ns) h (20 ns) g (50 ns)
  • 263.
    3.5.2 Examples ofDataflow Diagrams for Two Instructions 241 3.5.2.1 Scheduling of Operations for Different Clock Periods 55ns Clock Period 75ns Clock Period Instr A Instr B Instr A Instr B f (30ns) f (30ns) i (40ns) 55ns i (40ns) 75ns 55ns g (50 ns) g (50 ns) g (50 ns) g (50 ns) 75ns h (20 ns) 55ns h (20 ns) g (50 ns) 55ns g (50 ns) 75ns 85ns Clock Period 95ns Clock Period Instr A Instr B Instr A Instr B f (30ns) i (40ns) f (30ns) i (40ns) 85ns 95ns g (50 ns) g (50 ns) g (50 ns) h (20 ns) h (20 ns) 85ns g (50 ns) g (50 ns) 95ns g (50 ns) 155ns Clock Period Instr A Instr B f (30ns) i (40ns) g (50 ns) g (50 ns) 155ns h (20 ns) g (50 ns) 3.5.2.2 Performance Computation for Different Clock Periods Question: Which clock speed will result in the highest overall performance?
  • 264.
    242 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION Answer: Clock Period CPIA CPIB Tavg 55ns 4 2 55 × (0.5 × 4 + 0.5 × 2) = 165 75ns 3 2 75 × (0.5 × 3 + 0.5 × 2) = 187.5 85ns 2 2 85 × (0.5 × 2 + 0.5 × 2) = 170 95ns 2 1 95 × (0.5 × 2 + 0.5 × 1) = 143 ←− 155ns 1 1 155 × (0.5 × 1 + 0.5 × 1) = 155 3.5.2.3 Example: Two Instructions Taking Similar Time Question: For the flow below, which clock speed will result in the highest overall performance? A B 30ns 40ns 50ns 50ns 20ns 40ns 50ns Answer: f (30ns) f (30ns) i (40ns) 55ns i (40ns) 75ns 55ns g (50 ns) g (50 ns) g (50 ns) g (50 ns) 75ns h (20 ns) 55ns i (40ns) h (20 ns) g (50 ns) i (40ns) 75ns 55ns g (50 ns) f (30ns) i (40ns) 85ns g (50 ns) f (30ns) i (40ns) h (20 ns) 85ns g (50 ns) 95ns g (50 ns) g (50 ns) g (50 ns) h (20 ns) 85ns i (40ns) i (40ns) 95ns g (50 ns)
  • 265.
    3.5.2 Examples ofDataflow Diagrams for Two Instructions 243 f (30ns) i (40ns) 135ns g (50 ns) g (50 ns) f (30ns) i (40ns) h (20 ns) 105ns i (40ns) g (50 ns) g (50 ns) h (20 ns) g (50 ns) 135ns g (50 ns) i (40ns) 105ns Should skip 105 ns, because it has same latency as 95 ns. f (30ns) i (40ns) g (50 ns) 155ns g (50 ns) h (20 ns) i (40ns) g (50 ns) Clock Period CPIA CPIB Tavg 55ns 4 3 193 75ns 3 3 225 85ns 2 3 213 95ns 2 2 190 105ns 2 2 NO GAIN 135ns 2 1 203 155ns 1 1 155 A clock period of 155 ns results in the highest performance. For a clock period of 105 ns, we did not calculate the performance, because we could see that it would be worse than the performance with a clock period of 95 ns. The dataflow diagram with a 105 ns clock period has the same latency as the diagram with a clock period of 95 ns. If the data flow diagram with the longer clock period has the same latency as the diagram with the shorter clock period, then the diagram with the longer clock period will have lower performance. 3.5.2.4 Example: Same Total Time, Different Order for A Question: For the flow below, which clock speed will result in the highest overall performance?
  • 266.
    244 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION A B Answer: 30ns 40ns 20ns 50ns 50ns 40ns Clock Period CPIA CPIB Tavg 50ns 55ns 3 3 165ns 95ns 3 2 238ns 105ns 2 2 210ns 135ns 2 1 203ns 155ns 1 1 155ns A clock period of 155 ns results in lowest average execution time, and hence the highest performance. This is the same answer as the previous problem, but the total times for higher clock frequencies differ significantly between the two problems. 3.5.3 Example: From Algorithm to Optimized Dataflow This question involves doing some of the design work for a circuit that implements InstP and InstQ using the components described below. Component Delays Instruction Algorithm Frequence of Occurrence 2-input Mult 40ns InstP a × b × ((a × b) + (b × d) + e) 75% 2-input Add 25ns InstQ (i + j + k + l) × m 25% Register 5ns NOTES • There is a resource limitation of a maximum of 3 input ports. (There are no other resource limitations.) • You must put registers on your inputs, you do not need to register your outputs. • The environment will directly connect your outputs (its inputs) to registers. • Each input value (a, b, c, d, e, i, j, k, l, m) can be input only once — if you need to use a value in multiple clock cycles, you must store it in a register. Question: What clock period will result in the best overall performance? Answer:
  • 267.
    3.5.3 Example: FromAlgorithm to Optimized Dataflow 245 Algorithm Answers (InstP) ................................................ . a b d e a b d e * * * * * a*b b*d a*b b*d (a*b) + (b*d) + + + (a*b) + (b*d) + e + (a*b) + (b*d) + e * (a*b)*((a*b) + (b*d) + e) * (a*b)*((a*b) + (b*d) + e) InstP: common subexpr elim InstP data-dep graph a b d a b d * * e b*d * * + a*b b*d e + (b*d) + e + a*b (a*b) + (b*d) + e * + (a*b)*((a*b) + (b*d) + e) (a*b) + (b*d) + e InstP: alternative data dependency graph. * (a*b)*((a*b) + (b*d) + e) Both options have critical path of 2mults+2adds. First option allows three operations to be done InstP: clock=45ns, lat=4, T=200 with just three inputs (a,b,d). Second option requires all four inputs to do three operations.
  • 268.
    246 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION a b d a b d * * e * * a*b b*d a*b b*d + + e + + (a*b) + (b*d) + e (a*b) + (b*d) + e * * (a*b)*((a*b) + (b*d) + e) (a*b)*((a*b) + (b*d) + e) InstP: clock=55ns, lat=3, T=165ns InstP: clock=70ns, lat=2, T=140 b d * e 70ns b*d a b d e a + * * a*b b*d * (b*d) + e + + a*b (a*b) + (b*d) + e + (a*b) + (b*d) + e * (a*b)*((a*b) + (b*d) + e) * (a*b)*((a*b) + (b*d) + e) InstP: dataflow diagram with alternative data-dep graph. InstP: illegal: 4 inputs Adds a third clock cycle without any gain in clock speed. From diagram, it’s clear that it’s better to put a*b in first clock cycle and e in second, because a*b can be done in parallel with b*d. Fastest option for InstP is 70ns clock, which gives a total execution time of 140 ns.
  • 269.
    3.5.3 Example: FromAlgorithm to Optimized Dataflow 247 Algorithm Answers (InstQ) ................................................. i j k + i j k l m + l m + + + + * * InstQ: alternative data-dep graph: InstQ: data-dep graph with max parallelism able to do two operations with three inputs, while first data-dep graph required four inputs to do two operations. We are limited to three inputs, so choose this data-dep graph for dataflow diagrams. i j k i j k + + + l m + l m + + * * InstQ: clock=55ns, lat=3, T=165ns. InstQ: clock=50ns, lat=4, T=200ns.
  • 270.
    248 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION i j k i j k + + + l m + l m + + * * InstQ: clock=70ns, lat=2, T=140ns. InstQ: irrelevant: lat did not decrease i j k i j k + + + l m + l m + + 70ns * * InstQ: clock=120ns, lat=1, T=120ns InstQ Fastest option for InstQ is 70ns clock, which gives a total execution time of 140 ns. Both InstP and InstQ need a 70ns clock period to maximize their performance. So, use a 70ns clock, which gives a latency of 2 clock cycles for both instructions. Fastest execution time 140ns Clock period 70ns
  • 271.
    3.5.3 Example: FromAlgorithm to Optimized Dataflow 249 Question: Find a minimal set of resources that will achieve the performance you calculated. Answer: Final dataflow graphs for InstP and InstQ a b d i j k * * + a*b b*d + e + l m + + (a*b) + (b*d) + e 70ns * * (a*b)*((a*b) + (b*d) + e) InstQ InstP: clock=70ns, lat=2, T=140 Need do only one of InstP and InstQ at any time, so simply take max of each resource. InstP InstQ System Inputs 3 3 3 Outputs 1 1 1 Registers 3 3 3 Adders 2 2 2 Multipliers 2 1 2
  • 272.
    250 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION Question: Design the datapath and state machine for your design Answer: a b d i j k S0 i1 i2 i3 S0 i1 i2 i3 r1 r2 r3 r1 r2 r3 m1 * m2 * a1 + S1 S1 e l m a2 + i2 a2 + i2 i3 r3 r1 r2 r1 r2 r3 S0 a1 + S0 a1 + m2 * m2 * o1 o1 InstP: clock=70ns, lat=2, T=140ns. InstQ: clock=70ns, lat=2, T=140ns. Control Tables ............................................................ . r1 r2 r3 m1 m2 a1 a2 ce mux ce mux ce mux src1 src2 src1 src2 src1 src2 src1 src2 InstP S0 1 i1 1 i2 1 i3 r1 r2 r3 a1 – – m1 m2 InstP S1 1 a2 1 i2 1 m1 – – r2 r3 r1 r2 – – InstQ S0 1 i1 1 i2 1 i3 – – a1 r3 r1 r2 a1 r3 InstQ S1 1 a2 1 i2 1 i3 – – – – r1 r2 – – Optimize Control Table ................................................... . r1 r2 r3 m1 m2 a1 a2 mux mux mux src1 src2 src1 src2 src1 src2 src1 src2 InstP S0 i1 i2 i3 r1 r2 a1 r3 r1 r2 m1 m2 InstP S1 a2 i2 m1 r1 r2 r2 r3 r1 r2 m1 m2 InstQ S0 i1 i2 i3 r1 r2 a1 r3 r1 r2 a1 r3 InstQ S1 a2 i2 i3 r1 r2 r2 r3 r1 r2 a1 r3
  • 273.
    3.5.3 Example: FromAlgorithm to Optimized Dataflow 251 Write VHDL Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Use the optimized control table as basis for VHDL code. process (clk) begin if rising_edge(clk) then if state=S0 then r1 <= i1 else r1 <= a2 end if; end if; end process; process (clk) begin if rising_edge(clk) then r2 <= i2 end if; end process; process (clk) begin if rising_edge(clk) then if inst=instP and state=S0 then r3 <= m1 else r1 <= i3 end if; end if; end process; m1 <= r1 * r2; m2_src1 <= r2 when state=S0 else a1; m2 <= m2_src1 * r3; a1 <= r1 + r2; a2 <= a2_src1 + a2_src2; process (inst, m1, m2, a1, r3) begin if inst=instP then a2_src1 <= m1; a2_src2 <= m2; else a2_src1 <= a1; a2_src2 <= r3; end if; end process;
  • 274.
    252 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION 3.6 General Optimizations 3.6.1 Strength Reduction Strength reduction replaces one operation with another that is simpler. 3.6.1.1 Arithmetic Strength Reduction Multiply by a constant power of two wired shift logical left Multiply by a power of two shift logical left Divide by a constant power of two wired shift logical right Divide by a power of two shift logical right Multiply by 3 wired shift and addition 3.6.1.2 Boolean Strength Reduction Boolean tests that can be implemented as wires • is odd, is even : least significant bit • is neg, is pos : most significant bit • NOTE: use is odd(a) rather than a(0) By choosing your encodings carefully, you can sometimes reduce a vector comparison to a wire. For example if your state uses a one-hot encoding, then the comparison state = S3 reduces to state(3) = ’1’. You might expect a reasonable logic-synthesis tool to do this reduction automatically, but most tools do not do this reduction. When using encodings other than one-hot, Karnaugh maps can be useful tools for optimizing vector comparisons. By carefully choosing our state assignments, when we use a full binary encoding for 8 states, the comparison: (state = S0 or state = S3 or state = S4) = ’1’ can be reduced from looking at 3 bits, to looking at just 2 bits. If we have a condition that is true for four states, then we can find an encoding that looks at just 1 bit.
  • 275.
    3.6.2 Replication andSharing 253 3.6.2 Replication and Sharing 3.6.2.1 Mux-Pushing Pushing multiplexors into the fanin of a signal can reduce area. After Before tmp <= b when (w = ’1’) z <= a + b when (w = ’1’) else c; else a + c; z <= a + tmp; The first circuit will have two adders, while the second will have one adder. Some synthesis tools will perform this optimization automatically, particularly if all of the signals are combinational. 3.6.2.2 Common Subexpression Elimination Introduce new signals to capture subexpressions that occur multiple places in the code. After Before tmp <= a + c; y <= a + b + c when (w = ’1’) y <= b + tmp when (w = ’1’) else d; else d; z <= a + c + d when (w = ’1’) z <= d + tmp when (w = ’1’) else e; else e; Note: Clocked subexpressions Care must be taken when doing common subexpression elimination in a clocked process. Putting the “temporary” sig- nal in the clocked process will add a clock cycle to the latency of the com- putation, because the tmp signal will be flip-flop. The tmp signal must be combinational to preserve the behaviour of the circuit. 3.6.2.3 Computation Replication • To improve performance – If same result is needed at two very distant locations and wire delays are significant, it might improve performance (increase clock speed) to replicate the hardware • To reduce area – If same result is needed at two different times that are widely separated, it might be cheaper to reuse the hardware component to repeat the computation than to store the result in a register Note: Muxes are not free Each time a component is reused, multiplexors are added to inputs and/or outputs. Too much sharing of a component can cost more area in additional multiplexors than would be spent in replicating the component
  • 276.
    254 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION 3.6.3 Arithmetic VHDL is left-associative. The expression a + b + c + d is interpreted as (((a + b) + c) + d). You can use parentheses to suggest parallelism. Perform arithmetic on the minimum number of bits needed. If you only need the lower 12 bits of a result, but your input signals are 16 bits wide, trim your inputs to 12 bits. This results in a smaller and faster design than computing all 16 bits of the result and trimming the result to 12 bits. 3.7 Retiming state S0 S1 S2 S3 S0 S1 S2 S3 critical path α state a b β a sel c γ b x sel 1 y z x α c y α+γ z α+γ process begin wait until rising_edge(clk); if state = S1 then z <= a + c; else z <= b + c; end if; end process; Retimed Circuit and Waveform ........................................................
  • 277.
    3.7. RETIMING 255 state S0 S1 S2 S3 S0 S1 S2 S3 state a α b β a sel c γ b x sel y z x c y z process begin process (state) begin wait until rising_edge(clk); if state = S1 then if state = then sel = ’1’ sel = ’1’ else else sel = ’1’ sel = ’1’ end if; end if; end process; end process; process begin process begin wait until rising_edge(clk); wait until rising_edge(clk); if sel = ’1’ then if sel = ’1’ then ... -- code for z ... -- code for z end if; end if; end process; end process;
  • 278.
    256 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION 3.8 Performance Analysis and Optimization Problems P3.1 Farmer A farmer is trying to decide which of his two trucks to use to transport his apples from his orchard to the market. Facts: capacity of speed when speed when truck loaded with unloaded (no apples apples) big truck 12 tonnes 15kph 38kph small truck 6 tonnes 30kph 70kph distance to market 120 km amount of apples 85 tonnes NOTES: 1. All of the loads of apples must be carried using the same truck 2. Elapsed time is counted from beginning to deliver first load to returning to the orchard after the last load 3. Ignore time spent loading and unloading apples, coffee breaks, refueling, etc. 4. For each trip, a truck travels either its fully loaded or empty speed. Question: Which truck will take the least amount of time and what percentage faster will the truck be? Question: In planning ahead for next year, is there anything the farmer could do to decrease his delivery time with little or no additional expense? If so, what is it, if not, explain.
  • 279.
    P3.2 Network andRouter 257 P3.2 Network and Router In this question there is a network that runs a protocol called BigLan. You are designing a router called the DataChopper that routes packets over the network running BigLan (i.e. they’re BigLan packets). The BigLan network protocol runs at a data rate of 160 Mbps (Mega bits per second). Each BigLan packet contains 100 Bytes of routing information and 1000 Bytes of data. You are working on the DataChopper router, which has the following performance numbers: 75MHz clock speed 4 cycles for a byte of either data or header 500 number of additional clock cycles to process the routing information for a packet P3.2.1 Maximum Throughput Which has a higher maximum throughput (as measured in data bits per second — that is only the payload bits count as useful work), the network or your router, and how much faster is it? P3.2.2 Packet Size and Performance Explain the effect of an increase in packet length on the performance of the DataChopper (as measured in the maximum number of bits per second that it can process) assuming the header remains constant at 100 bytes. P3.3 Performance Short Answer If performance doubles every two years, by what percentage does performance go up every month? This question is similar to compound growth from your economics class. P3.4 Microprocessors The Yme microprocessor is very small and inexpensive. One performance sacrifice the designers have made is to not include a multiply instruction. Multiplies must be written in software using loops of shifts and adds. The Yme currently ships at a clock frequency of 200MHz and has an average CPI of 4. A competitor sells the Y!v1 microprocessor, which supports exactly the same instructions as the Yme. The Y!v1 runs at 150MHz, and the average program is 10% faster on the Yme than it is on the Y!v1.
  • 280.
    258 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION P3.4.1 Average CPI Question: What is the average CPI for the Y!v1? If you don’t have enough information to answer this question, explain what additional information you need and how you would use it? A new version of the Y!, the Y!u2 has just been announced. The Y!u2 includes a multiply instruction and runs at 180MHz. The Y!u2 publicity brochures claim that using their multiply instruction, rather than shift/add loops, can eliminate 10% of the instructions in the average pro- gram. The brochures also claim that the average performance of Y!u2 is 30% better than that of the Y!v1. P3.4.2 Why not you too? Question: Assuming the advertising claims are true, what is the average CPI for the Y!u2? If you don’t have enough information to answer this question, explain what additional information you need and how you would use it? P3.4.3 Analysis Question: Which of the following do you think is most likely and why. 1. the Y!u2 is basically the same as the Y!v1 except for the multiply 2. the Y!u2 designers made performance sacrifices in their design in order to include a multiply instruction 3. the Y!u2 designers performed other significant optimizations in addition to creating a mul- tiply instruction P3.5 Dataflow Diagram Optimization Draw an optimized dataflow diagram that improves the performance and produces the same output values. Or, if the performance cannot be improved, describe the limiting factor on the performance. NOTES: • you may change the times when signals are read from the environment • you may not increase the resource usage (input ports, registers, output ports, f components, g components) • you may not increase the clock period
  • 281.
    P3.6 Performance Optimizationwith Memory Arrays 259 a b c a b d f f d e f g g c e f f f g g After Optimization Before Optimization P3.6 Performance Optimization with Memory Arrays This question deals with the implementation and optimization for the algorithm and library of circuit components shown below. Algorithm Component Delay q = M[b]; Register 5 ns if (a > b) then Adder 25 ns M[a] = b; Subtracter 30 ns p = (M[b-1]) * b) + M[b]; ALU with +, −, >, =, −, AND, XOR 40 ns else Memory read 60 ns M[a] = b; Memory write 60 ns p = M[b+1] * a; Multiplication 65 ns end; 2:1 Multiplexor 5 ns NOTES: 1. 25% of the time, a > b 2. The inputs of the algorithm are a and b. 3. The outputs of the algorithm are p and q. 4. You must register both your inputs and outputs. 5. You may choose to read your input data values at any time and produce your outputs at any time. For your inputs, you may read each value only once (i.e. the environment will not send multiple copies of the same value). 6. Execution time is measured from when you read your first input until the latter of producing your last output or the completion of writing a result to memory
  • 282.
    260 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION 7. M is an internal memory array, which must be implemented as dual-ported memory with one read/write port and one write port. 8. Assume all memory address and other arithmetic calculations are within the range of repre- sentable numbers (i.e. no overflows occur). 9. If you need a circuit not on the list above, assume that its delay is 30 ns. 10. Your dataflow diagram must include circuitry for computing a > b and using the result to choose the value for p Draw a dataflow diagram for each operation that is optimized for the fastest overall execution time. NOTE: You may sacrifice area efficiency to achieve high performance, but marks will be deducted for extra hardware that does not contribute to performance. P3.7 Multiply Instruction You are part of the design team for a microprocessor implemented on an FPGA. You currently im- plement your multiply instruction completely on the FPGA. You are considering using a special- ized multiply chip to do the multiplication. Your task is to evaluate the performance and optimality tradeoffs between keeping the multiply circuitry on the FPGA or using the external multiplier chip. If you use the multipliplier chip, it will reduce the CPI of the multiply instruction, but will not change the CPI of any other instruction. Using the multiplier chips will also force the FPGA to run at a slower clock speed. FPGA option FPGA + MULT option MULT FPGA FPGA average CPI 5 ??? % of instrs that are multiplies 10% 10% CPI of multiply 20 6 Clock speed 200 MHz 160 MHz P3.7.1 Highest Performance Which option, FPGA or FPGA+MULT, gives the higher performance (as measured in MIPs), and what percentage faster is the higher-performance option?
  • 283.
    P3.7 Multiply Instruction 261 P3.7.2 Performance Metrics Explain whether MIPs is a good choice for the performance metric when making this decision.
  • 284.
    262 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION
  • 285.
    Chapter 4 Functional Verification 4.1Introduction 4.1.1 Purpose The purpose of this chapter is to illustrate techniques to quickly and reliably detect bugs in datapath and control circuits. Section 4.5 discusses verification of datapath circuits and introduces the notions of testbench, spec- ification, and implementation. In section 4.6 we discuss techniques that are useful for debugging control circuits. The verification guild website: http://www.janick.bergeron.com/guild/default.htm is a good source of information on functional verification. 4.2 Overview The purpose of functional verification is to detect and correct errors that cause a system to produce erroneous results. The terminology for validation, verification, and testing differs somewhat from discipline to discipline. In this section we outline some of the terminology differences and describe the terminology used in E&CE 327. We then describe some of the reasons that chips tend to work incorrectly. 263
  • 286.
    264 CHAPTER 4. FUNCTIONAL VERIFICATION 4.2.1 Terminology: Validation / Verification / Testing functional validation Comparing the behaviour of a design against the customer’s expectations. In validation, the “specification” is the customer. There is no specification that can be used to evaluate the correctness of the design (implementation). functional verification Comparing the behaviour of a design (e.g. RTL code) against a specification (e.g. high-level model) or collection of properties • usually treats combinational circuitry as having zero-delay • usually done by simulating circuit with test vectors • big challenges are simulation speed and test generation formal verification checking that a design has the correct behaviour for every possible input and internal state • uses mathematics to reason about circuit, rather than checking individual vectors of 1s and 0s • capacity problems: only usable on detailed models of small circuits or abstract models of large circuits • mostly a research topic, but some practical applications have been demonstrated • tools include model checking and theorem proving • formal verification is not a guarantee that the circuit will work correctly performance validation checking that implementation has (at least) desired performance power validation checking that implementation has (at most) desired power equivalence verification (checking) checking that the design generated by a synthesis tool has same behaviour as RTL code. timing verification checking that all of the paths in a circuit fit meet the timing constraints Hardware vs Software Terminology .................................................... Note: in software “testing” refers to running programs with specific inputs and checking if the program does the right thing. In hardware, “testing” usually means “manufacturing testing”, which is checking the circuits that come off of the manufacturing line.
  • 287.
    4.2.2 The Difficultyof Designing Correct Chips 265 4.2.2 The Difficulty of Designing Correct Chips 4.2.2.1 Notes from Kenn Heinrich (UW E&CE grad) “Everyone should get a lecture on why their first industrial design won’t work in the field.” Here are few reasons getting a single system to work correctly for a few minutes in a university lab is much easier than getting thousands of systems to work correctly for months at a time in dozens of countries around the world. 1. You forgot to make your “unreachable” states transition to the initial (reset) state. Clock glitches, power surges, etc will occasionally cause your system to jump to a state that isn’t defined or produce an illegal data value. When this happens, your design should reset itself, rather than crash or generatel illegal outputs. 2. You have internal registers that you can’t access or test. If you can set a register you must have some way of reading the register from outside the chip. 3. Another chip controls your chip, and the other chip is buggy. All of your external control lines should be able to be disabled, so that you can isolate the source of problems. 4. Not enough decoupling capacitors on your board. The analog world is cruel and and un- usual. Voltage spikes, current surges, crosstalk, etc can all corrupt the integrity of digital signals. Trying to save a few cents on decoupling capacitors can cause headaches and sig- nificant financial costs in the future. 5. You only tested your system in the lab, not in the real world. As a product, systems will need to run for months in the field, simulation and simple lab testing won’t catch all of the weirdness of the real world. 6. You didn’t adequately test the corner cases and boundary conditions. Every corner case is as important as the main case. Even if some weird event happens only once every six months, if you do not handle it correctly, the bug can still make your system unusable and unsellable. 4.2.2.2 Notes from Aart de Geus (Chairman and CEO of Synopsys) More than 60% of the ASIC designs that are fabricated have at least one error, issue, or a problem that whose severity forced the design to be reworked. Even experienced designers have difficulty building chips that function correctly on the first pass (figure4.1).
  • 288.
    266 CHAPTER 4. FUNCTIONAL VERIFICATION 61% of new chip designs require at least one re-spin At least one error/issue/problem (61%) Functional logic error (43%) Analog tuning issue (20%) Signal integrity issue (17%) Clock scheme error (14%) Reliability issue (12%) Mixed-signal problem (11%) Uses too much power (11%) Timing issue (slow paths) (10%) Timing issue (fast paths) (10%) IR drop issues (7%) Firmware error (4%) Other problem (3%) 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Source: Aart de Geus, Chairman and CEO of Synopsys. Keynote address. Synopsys Users’ Group Meeting, Sep 9 2003, Boston USA. Figure 4.1: Problems found on first-spins of new chip designs 4.3 Test Cases and Coverage 4.3.1 Test Terminology Test case / test vector : A combination of inputs and internal state values. Represents one possible test of the system. Boundary conditions / corner cases : A test case that represents an unusual situation on input and/or internal state signals. Corner cases are likely to contain bugs. Test scenario : A sequence of test vectors that, together, exercise a particular situation (scenario) on a circuit. For example, a scenario for an elevator controller might include a sequence of button pushes and movements between floors. Test suite : A collection of test vectors that are run on a circuit.
  • 289.
    4.3.2 Coverage 267 4.3.2 Coverage To be absolutely certain that an implementation is correct, we must check every combination of values. This includes both input values and internal state (flip flops). If we have ni bits of inputs and ns bits in flip-flops, we have to test 2ni +ns different cases when doing functional verification. Question: If we have nc combinational signals, why don’t we have to test 2 ni+ns+nc different cases? Answer: The value of each combinational signal is determined by the flip flops and inputs in its fanin. Once the values of the inputs and flip flops are known, the value of each combinational signal can be calculated. Thus, the combinational signals do not add additional cases that we need to consider. Definition Coverage: The coverage that a suite of tests achieves on a circuit is the percentage of cases that are simulated by the tests. 100% coverage means that the circuit has been simulated for all combinations of values for input signals and internal signals. Note: Coverage Terminology There are many different types of coverage, which measure everything from percentage of cases that are exercised to num- ber of output values that are exercised. There are many different commercial software programs that measure code and other types of coverage. Company Tool Coverage Cadence Affirma Coverage Analyzer Cadence DAI Coverscan code, expressions, fsm Cadence Codecover code, expressions, fsm Fintronic FinCov code Summit Design HDLScore code, events, variables Synopsys CoverMeter code coverage (dead?) TransEDA Verification Navigator code and fsm Verisity SureCov code, block, values, fsm Veritools Express VCT, VeriCover code, branch Aldec Riviera code, block
  • 290.
    268 CHAPTER 4. FUNCTIONAL VERIFICATION 4.3.3 Floating Point Divider Example This example illustrates the difficulty of achieving significant coverage on realistic circuits. Consider doing the functional simulation for a double precision (64-bit) floating-point divider. Given Information Data width 64 bits Number of gates in circuit 10 000 Number of assembly-language instructions to simulate one 100 gate for one test case Number of clock cycles required to execute one assembly 0.5 language instruction on the computer that is running the simulation Clock speed of computer that is running the simulation 1 Gigahertz Number of Cases ...................................................................... Question: How many cases must be considered? Answer: item bits num values src1 64 264 = 1.8E+19 src2 64 264 = 1.8E+19 NumTestsTot = NumInputCases × NumStateCases = (264 × 264 ) × (20 ) = 3.4E+38cases
  • 291.
    4.3.3 Floating PointDivider Example 269 Simulation Run Time . . . . .. . . . . .. . . . . .. . . . . .. . . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . .. Question: How long will it take to simulate all of the different possible cases using a single computer? Answer: 1. Calculate number of seconds to simulate one test case instrs cycles secs TestTime1:1 = 10000gates × 100 × 0.5 × 1E−9 gate instr cycle = 5E−4secs 2. Number of tests per year secs mins hours days 60 × 60 × 24 × 365.25 min hour day year NumTests:1 = TestTime1:1 SpeedOfLight in m/s ≈ TestTime1:1 3E+8secs = 5E−4secs = 6E+12cases/year 3. Number of years to test all cases NumTestsTot TestTimeTot = NumTests:1 3.4E+38cases = 6E+12cases/year = 5.6E+26years Coverage ............................................................................ . Question: If you can run simulations non-stop for one year on ten computers, what coverage will you achieve? Answer:
  • 292.
    270 CHAPTER 4. FUNCTIONAL VERIFICATION 1. Number of tests per year using ten computers NumTests:10 = 10 × NumTests:1 = 10 × 6E+12cases = 6E+13cases 2. Calculate coverage achieved by running tests on ten computers for one year NumTestsRun Covg = NumTestsTot NumTests:10 = NumTestsTot 6E+13 = 3E+38 = 2E−25 = 0.000000000000000000000002% The message is that, even with large amounts of computing resources, it is difficult to achieve numerically significant coverage for realistic circuits. An effective functional verification plan requires carefully chosen test cases, so that even the miniscule amount of coverage than is realistically achievable catches most (all?!?!) of the bugs in the design. Simulation vs the Real World ......................................................... . From Validating the Intel(R) Pentium(R) Microprocessor by Bob Bentley, Design Automation Con- ference 2001. (Link on E&CE 327 web page.) • Simulating the Pentium 4 Processor on a Pentium 3 Processor ran at about 15 MHz. • By tapeout, over 200 billion simulation cycles had been run on a network of computers. • All of these simulations represent less than two minutes of running a real processor. 4.4 Testbenches A test bench (also known as a “test rig”, “test harness”, or “test jig”) is a collection of code used to simulate a circuit and check if it works correctly. Testbenches are not synthesized. You do not need to restrict yourself to the synthesizable subset of VHDL. Use the full power of VHDL to make your testbenches concise and powerful.
  • 293.
    4.4.1 Overview ofTest Benches 271 4.4.1 Overview of Test Benches testbench specification stimulus check implementation Implementation Circuit that you’re checking for bugs also known as: “design under test” or “unit under test” Stimulus Generates test vectors Specification Describes desired behaviour of implementation Check Checks whether implementation obeys specification Notes and observations ............................................................... . • Testbenches usually do not have any inputs or outputs. – Inputs are generated by stimulus – Outputs are analyzed by check and relevant information is printed using report statements • Different circuits will use different stimuli, specifications, and checks. • The roles of the specification and check are somewhat flexible. – Most circuits will have complex specifications and simple checks. – However, some circuits will have simple specifications and complex checks. • If two circuits are supposed to have the same behaviour, then they can use the same stimuli, specification, and check. • If two circuits are supposed to have the same behaviour, then one can be used as the specification for the other. • Testbenches are restricted to stimulating only primary inputs and observing only primary out- puts. To check the behaviour of internal signals, use assertions.
  • 294.
    272 CHAPTER 4. FUNCTIONAL VERIFICATION 4.4.2 Reference Model Style Testbench reference model testbench specification stimulus implementation • Specification has same inputs and outputs as implementation. • Specification is a clock-cycle accurate description of desired behaviour of implementation. • Check is an equality test between outputs of specification and implementation. Examples ............................................................................ . • Execution modules: output is sum, difference, product, quotient, etc.of inputs • DSP filters • Instruction decoders Note: “Functional specification” vs “Reference model” Functional specifi- cation and reference model are often used interchangeably. 4.4.3 Relational Style Testbench relational testbench stimulus check implementation • Relational testbenches, or relational specifications are used when we do not want to specify the specific output values that the implementation must produce. • Instead, we want to check that some relationship holds between the output and the input, or that some relationship holds amongst the output values (independent of the values of the input signals.) • Specification is usually just wires to feed the input signals to the check. • Check is the brains and encodes the desired behaviour of the circuit.
  • 295.
    4.4.4 Coding Structureof a Testbench 273 Examples ............................................................................ . • Carry-save adders: the two outputs are the sum of the three inputs, but do not specify exact values of each individiual output. • Arbiters: every request is eventually granted, but do not specify in which order requests are granted. • One-hot encoding: exactly one bit of vector is a ’1’, but do not specify which bit is a ’1’. Note: “Relational specification” vs “relational testbench” Relational speci- fication and relational testbench are often used interchangeably. 4.4.4 Coding Structure of a Testbench architecture main of athabasca_tb is component declaration for implementation; other declarations begin implementation instantiation; stimulus process; specification process (or component instantiation); check process; end main; 4.4.5 Datapath vs Control Datapath and control circuits tend to use different styles of testbenches. Datapath circuits tend to be well-suited to reference-model style testbenches: • Each set of inputs generates one set of outputs • Each set of outputs is a function of just one set of inputs Control circuits often pose problems for testbenches, • Many more internal signals than outputs. • The behaviour of the outputs provides a view into only a fragment of the current state of the circuit. • It may take many clock cycles from when a bug is exercised inside the circuit until it generates a deviation from the correct behaviour on the outputs. • When the deviation on the outputs is observed, it is very difficult to pinpoint the precise cause of the deviation (the root cause of the bug). Assertions can be used to check the behaviour of internal signals. Control circuits tend to use assertions to check correctness and rely on testbenches only to stimulate inputs.
  • 296.
    274 CHAPTER 4. FUNCTIONAL VERIFICATION 4.4.6 Verification Tips Suggested order of simulation for functional verification. 1. Write high-level model. 2. Simulate high-level model until have correct functionality and latency. 3. Write synthesizable model. 4. Use zero-delay simulation (uw-sim) to check behaviour of synthesizable model against high-level model. 5. Optimize the synthesizable model. 6. Use zero-delay simulation (uw-sim) to check behaviour of optimized model against high- level model. 7. Use timing-simulation (uw-timsim) to check behaviour of optimized model against high- level model. section 4.5 describes a series of testbenches that are particularly useful for debugging datapath circuits in the early phases of the design cycle. 4.5 Functional Verification for Datapath Circuits In this section we will incrementally develop a testbench for a very simple circuit: an AND gate. Although the example circuit is trivial in size, the process scales well to very large circuits. The process allows verification to begin as soon a circuit is simulatable, even before a complete speci- fication has been written. Implementation ....................................................................... entity and2 is port ( a, b : in std_logic; c : out std_logic ); end and2; architecture main of and2 is begin c <= ’1’ when (a = ’1’ AND b = ’1’) else ’0’; end and2;
  • 297.
    4.5.1 A Spec-LessTestbench 275 4.5.1 A Spec-Less Testbench (NOTE: this code has been reviewed manually but has not been simulated. The concepts are illustrated correctly, but there might be typographical errors in the code.) First, use waveform viewer to check that implementation generates reasonable outputs for a small set of inputs. entity and2_tb is end and2_tb; architecture main_tb of and2_tb is component and2 port ( a, b : in std_logic; c : out std_logic ); end component; signal ta, tb, tc_impl : std_logic; signal ok : boolean; begin --------------------------------------------- impl : and2 port map (a => ta, b => tb, c => tc_impl); --------------------------------------------- stimulus : process begin ta <= ’0’; tb <= ’0’; wait for 10ns; ta <= ’1’; tb <= ’1’; wait for 10ns; end process; --------------------------------------------- end main_tb; Use the spec-less testbench until implementation generates solid Boolean values (No X or U data) and have checked that a few simple test cases generate correct outputs.
  • 298.
    276 CHAPTER 4. FUNCTIONAL VERIFICATION 4.5.2 Use an Array for Test Vectors Writing code to drive inputs and repetitively typing wait for 10 ns; can get tedious, so code up test vectors in an array. (NOTE: this code has not been checked for correctness) architecture main_tb of and2_tb is ... begin ... stimulus : process type test_datum_ty is record ra, rb : std_logic; end record; type test_vectors_ty is array(natural range <>) of test_datum_ty ; constant test_vectors : test_vectors_ty := -- a b ( ( ’0’, ’0’), ( ’1’, ’1’) ); begin for i in test_vectors’low to test_vectors’high loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; wait for 10 ns; end loop; end process; end main_tb; Use this testbench until checking the correctness of the outputs by hand using waveform viewer becomes difficult.
  • 299.
    4.5.3 Build Specinto Stimulus 277 4.5.3 Build Spec into Stimulus (NOTE: this code has not been checked for correctness) After a few test vectors appear to be working correctly (via a manual check of waveforms on simulation), begin automatically checking that outputs are correct. • Add expected result to stimulus • Add check process architecture main_tb of and2_tb is ... begin ------------------------------------------ impl : and2 port map (a => ta, b => tb, c => tc_impl); ------------------------------------------ stimulus : process type test_datum_ty is record ra, rb, rc : std_logic; end record; type test_vectors_ty is array(natural range <>) of test_datum_ty; constant test_vectors : test_vectors_ty := -- a, b: inputs -- c : expected output -- a b c ( ( ’0’, ’0’, ’0’), ( ’0’, ’1’, ’0’), ( ’1’, ’1’, ’1’) ); begin for i in test_vectors’low to test_vectors’high loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; tc_spec <= test_vectors(i).rc; wait for 10 ns; end loop; end process; ------------------------------------------ check : process (tc_impl, tc_spec) begin ok <= (tc_impl = tc_spec); end process; ------------------------------------------ end main_tb; Use this testbench until it becomes tedious to calculate manually the correct result for each test case.
  • 300.
    278 CHAPTER 4. FUNCTIONAL VERIFICATION 4.5.4 Have Separate Specification Entity Rather than write the specification as part of stimulus, create separate specification entity/architecture. The specification component then calculates the expected output values. (NOTE: if your simulation tool supports configurations, the spec and impl can share the same entity, we’ll see this in section 4.6)
  • 301.
    4.5.4 Have SeparateSpecification Entity 279 entity and2_spec is ...(same as and2 entity)... end and2_spec; architecture spec of and2_spec is begin c <= a AND b; end spec; architecture main_tb of and2_tb is component and2 ...; component and2_spec ...; signal ta, tb, tc_impl, tc_spec : std_logic; signal ok : boolean; begin ------------------------------------------ impl : and2 port map (a => ta, b => tb, c => tc_impl); spec : and2_spec port map (a => ta, b => tb, c => tc_spec); ------------------------------------------ stimulus : process type test_datum_ty is record ra, rb : std_logic; end record; type test_vectors_ty is array(natural range <>) of test_datum_ty; constant test_vectors : test_vectors_ty := -- a b ( ( ’0’, ’0’), ( ’1’, ’1’) ); begin for i in test_vectors’low to test_vectors’high loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; wait for 10 ns; end loop; end process; ------------------------------------------ check : process (tc_impl, tc_spec) begin ok <= (tc_impl = tc_spec); end process; ------------------------------------------ end main_tb;
  • 302.
    280 CHAPTER 4. FUNCTIONAL VERIFICATION 4.5.5 Generate Test Vectors Automatically When it becomes tedious to write out each test vector by hand, we can automaticaly compute them. This example uses a pair of nested for loops to generate all four permutations of input values for two signals. architecture main_tb of and2_tb is ... begin ... stimulus : process subtype std_test_ty of std_logic is (’0’, ’1’); begin for va in std_test_ty’low to std_test_ty’high loop for vb in std_test_ty’low to std_test_ty’high loop ta <= va; tb <= vb; wait for 10 ns; end loop; end loop; end process; ... end main_tb; 4.5.6 Relational Specification architecture main_tb of and2_tb is ... begin ------------------------------------------ impl : and2 port map (a => ta, b => tb, c => tc_impl); ------------------------------------------ stimulus : process ... end process; ------------------------------------------ check : process (tc_impl, tc_spec) begin ok <= NOT (tc_impl = ’1’ AND (ta =’0’ OR tb = ’0’)); end process; ------------------------------------------ end main_tb;
  • 303.
    4.6. FUNCTIONAL VERIFICATIONOF CONTROL CIRCUITS 281 4.6 Functional Verification of Control Circuits Control circuits are often more challenging to verify than datapath circuits. • Control circuits have many internal signals. Testbenches are unable access key information about the behaviour of a control circuit. • Many clock cycles can elapse between when a bug causes an internal signal to have an incorrect value and when an output signal shows the effect of the bug. In this section, we will explore the functional verification of state machines via a First-In First-Out queue. The VHDL code for the queue is on the web at: http://www.ece.uwaterloo.ca/˜ece327/exs/queue 4.6.1 Overview of Queues in Hardware write read queue Figure 4.2: Structure of queue Empty Write 1 Write 2 A A Figure 4.3: Write Sequence
  • 304.
    282 CHAPTER 4. FUNCTIONAL VERIFICATION Write 1 Write 2 Read 1 Read 2 A A A A B B B B Figure 4.4: A Second Example Write Figure 4.5: Example Read Sequence Write 1 Write 2 Write 1 Write 2 K K B B B B C C C C D D D D E E E E F F F F G G G G H H H H I I I I J J J J Figure 4.6: Write Illustrating Index Wrap Figure 4.7: Write Illustrating Full Queue do_rd do_rd mem do_wr wr_idx rd_idx data_rd data_wr wr_idx mem do_wr WE A0 DO0 data_wr DI0 rd_idx A1 DO1 data_rd empty empty Figure 4.8: Queue Signals Figure 4.9: Incomplete Queue Blocks Control circuitry not shown.
  • 305.
    4.6.2 VHDL Coding 283 4.6.2 VHDL Coding 4.6.2.1 Package Things to notice in queue package: 1. separation of package and body package queue_pkg is subtype data is std_logic_vector(3 downto 0); function to_data(i : integer) return data; end queue_pkg; package body queue_pkg is function to_data(i : integer) return data is begin return std_logic_vector(to_unsigned(i, 4)); end to_data; end queue_pkg; 4.6.2.2 Other VHDL Coding VHDL coding techniques to notice in queue implementation: 1. type declaration for vectors 2. attributes (a) ’low, ’high, ’length, 3. functions (reduce overall implementation and maintenance effort) (a) reduce redundant code (b) hide implementation details (c) (just like software engineering....) 4.6.3 Code Structure for Verification Verification things to notice in queue implementation: 1. instrumentation code 2. coverage monitors 3. assertions
  • 306.
    284 CHAPTER 4. FUNCTIONAL VERIFICATION architecture ... is ... begin ... normal implementation ... process (clk) begin if rising_edge(clk) then ... instrumentation code ... prev_signame <= signame; end if; end process; ... assertions ... ... coverage monitors ... end; 4.6.4 Instrumentation Code • Added to implementation to support verification • Usually keeps track of previous values of signals • Does not create hardware (Optimized away during synthesis) • Does not feed any output signals • Must use synthesizable subset of VHDL process (clk) begin if rising_edge(clk) then prev_rd_idx <= rd_idx; prev_wr_idx <= wr_idx; prev_do_rd <= do_rd; prev_do_wr <= do_wr; end if; end process; Note: Naming convention for instrumentation For assertions, signals are named prev signame and signame, rather than next signame and signame as is done for state machines. This is because for assertions we use the prev signals as history signals, to keep track of past events. In con- trast, for state machines, we name the signals next, because the state machine computes the next values of signals. 4.6.5 Coverage Monitors The goal of a coverage monitors is to check if a certain event is exercised in a simulation run. If a test suite does not trigger a coverage monitor, then we probably want to add a test vector that will trigger the monitor.
  • 307.
    4.6.5 Coverage Monitors 285 For example, for a circuit used in a microwave oven controller, we might want to make sure that we simulate the situation when the door is opened while the power is on. 1. Identify important events, conditions, transitions 2. Write instrumentation code to detect event 3. Use report to write when event happens 4. When run simulation, report statements will print when coverage condition detected 5. Pipe simulation results to log file 6. Examine log file and coverage monitors to find cases and transitions not tested by existing test vectors 7. Add test vectors to exercise missing cases 8. Idea: automate detection of missing cases using Perl script to find coverage messages in VHDL code that aren’t in log file 9. Real world: most commercial synthesis tools come with add-on packages that provide dif- ferent types of coverage analysis 10. Research/entrepreneurial idea: based on missing coverage cases, find new test vectors to exercise case Coverage Events for Queue ............................................................ Prev Now wr rd wr rd Prev Now rd wr wr rd Prev Now wr rd wr rd
  • 308.
    286 CHAPTER 4. FUNCTIONAL VERIFICATION Question: What events should we monitor to estimate the coverage of our functional tests? Answer: • wr idx and rd idx are far apart • wr idx and rd idx are equal • wr idx catches rd idx • rd idx catches wr idx • rd idx wraps • wr idx wraps Coverage Monitor Template .......................................................... . process (signals read) begin if (condition) then report "coverage: message"; elsif (condition) ) then report "coverage: message"; else report "error: case fall through on message" severity warning; end if; end process; Coverage Monitor Code .............................................................. . Events related to rd idx equals wr idx.
  • 309.
    4.6.6 Assertions 287 process (prev_rd_idx, prev_wr_idx, rd_idx, wr_idx) begin if (rd_idx = wr_idx) then if ( prev_rd_idx = prev_wr_idx ) then report "coverage: read = write both moved"; elsif ( rd_idx /= prev_rd_idx ) then report "coverage: Read caught write"; elsif ( wr_idx /= prev_wr_idx ) then report "coverage: Write caught read"; else report "error: case fall through on rd/wr catching" severity warning; end if; end if; end process; Events related to rd idx wrapping. process (rd_idx) begin if (rd_idx = low_idx) then report "coverage: rd mv to low"; elsif (rd_idx = high_idx) then report "coverage: rd mv to high"; else report "coverage: rd mv normal"; end if; end process; 4.6.6 Assertions Assertions for Queue ................................................................. . 1. If rd idx changes, then it increments or wraps. 2. If rd idx changes, then do rd was ’1’, or reset is ’1’. 3. If wr idx changes, then it increments or wraps. 4. If wr idx changes, then do wr was ’1’, or reset is ’1’. 5. And many others.... Assertion Template .................................................................... process (signals read) begin assert (required condition) report "error: message" severity warning; end process;
  • 310.
    288 CHAPTER 4. FUNCTIONAL VERIFICATION Assertions: Read Index ................................................................ process (rd_idx) begin assert ((rd_idx > prev_rd_idx) or (rd_idx = low_idx)) report "error: rd inc" severity warning; assert ((prev_do_rd = ’1’) or (reset = ’1’)) report "error: rd imp do_rd" severity warning; end process; Assertions: Write Index .............................................................. . process (wr_idx) begin assert ((wr_idx > prev_wr_idx) or (wr_idx = low_idx)) report "error: wr inc" severity warning; assert ((prev_do_wr = ’1’) or (reset = ’1’)) report "error: wr imp do_wr" severity warning; end process; 4.6.7 VHDL Coding Tips Vector Type Declaration ............................................................... type data_array_ty is array(natural range <>) of data; signal data_array : data_array_ty(7 downto 0); Functions ............................................................................. function to_idx (i : natural range data_array’low to data_array’high) return idx_ty is begin return to_unsigned(i, idx_ty’length); end to_idx; Conversion to Index Without Function With Function rd_idx <= to_unsigned(5, 3); rd_idx <= to_idx(5); The function code is verbose, but is very maintainable, because neither the function itself nor uses of the function need to know the width of the index vector.
  • 311.
    4.6.8 Queue Specification 289 Attributes . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . . .. . . . . .. . . .. function inc_idx (idx : idx_ty) return idx_ty is begin if idx < data_array’high then return (idx + 1); else return (to_idx(data_array’low)); end if; end inc_idx; Feedback Loops, and Functions ........................................................ Coding guideline: use functions. Don’t use procedures. inc as fun inc as proc wr_idx <= inc_idx(wr_idx); inc_idx(wr_idx); Functions clearly distinguish between reading from a signal and writing to a signal. By examining the use of a procedure, you cannot tell which signals are read from and which are written to. You must examine the declaration or implementation of the procedure to determine modes of signals. Modifying a signal within a procedure results in a tri-state signal. This is bad. File I/O (textio package) ............................................................... TEXTIO defines read, write, readline, writeline functions. Described in: • http://www.eng.auburn.edu/department/ee/mgc/vhdl.html#textio These functions can be used to read test vectors from a file and write results to a file. 4.6.8 Queue Specification Most bugs in queues are related to the queue becoming full, becoming empty, and/or wrap of indices. Specification should be “obviously correct”. Avoid bugs in specification by making specification queue larger than the max number of writes that we will do in test suite. Thus, the specification queue will never become full or wrap. However, the implementation queue will become full and wrap.
  • 312.
    290 CHAPTER 4. FUNCTIONAL VERIFICATION Write Index Update in Specification .................................................... We increment write-index on every write, we never wrap. process (clk) begin if rising_edge(clk) then if (reset = ’1’) then wr_idx <= 0; elsif (do_wr = ’1’) then wr_idx <= wr_idx + 1; end if; end if; end process; Things to Notice ....................................................................... Things to notice in queue specification: 1. don’t care conditions (’-’) 2. uninitialized data (hint: what is the value of rd_data when do more reads than writes? Don’t Care ............................................................................ rd_data <= data_array(rd_idx) when (do_rd =’1’) else (others => ’-’); 4.6.9 Queue Testbench Things to notice in queue testbench: 1. running multipe test sequences 2. uninitialized data ’U’ 3. std_match to compare spec and impl data 0 ∼ 0 0 ∼ L 1 ∼ 1 1 ∼ H - ∼ everything everything else ∼ everything With equality, ’-’ = ’1’, but we want to use ’-’ to mean “don’t care” in specification. The solution is to use std match, rather than = to check implementation signals against the specification.
  • 313.
    4.7. EXAMPLE: MICROWAVEOVEN 291 Stimulus Process Structure ........................................................... . The stimulus process runs multiple test vectors in a single simulation run. stimulus : process type test_datum_ty is record r_reset, ... normal fields ... end record; type test_vectors_ty is array(natural range <>) of test_datum_ty; constant test_vectors : test_vectors_ty := ( -- reset ... other signal ... ( ’1’, normal fields), -- test case 1 ( ’0’, normal fields), ... ( ’1’, normal fields), -- test case 2 ( ’0’, normal fields), ... ); begin for i in test_vectors’range loop if (test_vectors(i).r_reset = ’1’) then ... reset code ... end if; reset <= ’0’; ... normal sequence ... wait until rising_edge(clk); end loop; end process; After reset is asserted, set signals to ’U’. 4.7 Example: Microwave Oven This question concerns the VHDL code microwave, which controls a simple microwave oven; the properties prop1...prop3; and two proposed changes to the VHDL code. INSTRUCTIONS: 1. Assume that the code as currently written is correct — any change to the code that causes a change to the behaviour of the signals heat or count is a bug. 2. For each of the two proposed code changes, answer whether the code change will cause a bug. 3. If the code change will cause a bug, provide a test case that will exercise the bug and identify all of the given properties (prop1, prop2, and prop3) that will detect the bug with the test case you provide.
  • 314.
    292 CHAPTER 4. FUNCTIONAL VERIFICATION 4. If none of the three properties can detect the bug, provide a property of your own that will detect the bug with the testcase you provide. Question: For each of the three properties prop1...prop2, answer whether the property is best checked as part of a testbench or assertion. For each property, justify why a testbench or an assertion is the best method to validate that property. prop1 If start is pushed and the door is closed, then heat remains on for exactly the time specified by the timer when start was pushed, assuming reset remains false and the door remains closed. Answer: Testbench: All relevant signals are primary inputs or outputs, so can check property without seeing internal signals. Testbenches are only able to set and observe primary inputs and outputs. prop2 If the door is open, then heat is off. Answer: Testbench: same as previous property. prop3 If start is not pushed, reset is false, and count is greater than zero, then count is decre- mented. Answer: Assertion: To see count, need access to internal signals. entity microwave is port ( timer -- time input from user : in unsigned(7 downto 0); reset, -- resets microwave clk, -- clock signal input is_open, -- detects when door is open start -- start button input from user : in std_logic; heat : out std_logic -- 1=on, 0=off ); end microwave; architecture main of microwave is signal count : unsigned(7 downto 0); -- internal time count signal x_heat : std_logic; begin
  • 315.
    4.7. EXAMPLE: MICROWAVEOVEN 293 -- heat process ------------------------------ process (clk) begin if rising_edge(clk) then if reset = ’1’ then x_heat <= ’0’; elsif (is_open = ’0’) and (start = ’1’) and -- region of (time > 0) -- change #1 then -- x_heat <= ’1’; -- elsif (is_open = ’0’) and (count > 0) then -- x_heat <= x_heat; -- else x_heat <= ’0’; end if; end if; end process; -- count process ------------------------------ process (clk) begin if rising_edge(clk) then if (reset = ’1’) then count <= to_unsigned(0, 8); elsif (start = ’1’) then -- region of count <= timer; -- change #2 elsif (count > 0) then -- count <= count - 1; -- end if; end if; end process; heat <= x_heat; end main; Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. prop1 If start is pushed and the door is closed, then heat remains on for exactly the time specified by the timer when start was pushed, assuming reset remains false and the door remains closed. prop2 If the door is open, then heat is off. prop3 If start is not pushed, reset is false, and count is greater than zero, then count is decre- mented. Change #1 ........................................................................... .
  • 316.
    294 CHAPTER 4. FUNCTIONAL VERIFICATION elsif (start = ’1’) then count <= time; From: elsif (count > 0) then count <= count - 1; elsif (count > 0) then count <= count - 1; To: elsif (start = ’1’) then count <= time; Answer: The change introduces a bug that is caught by properties 1 and 3. Test Cases testcase1 Maintain reset=0. Close door, set timer to some value (v1 ) and then push start. Leave the door closed. While the microwave is on, set timer to a value (v2 ) that is greater than v1 and then push start. In old the code, the new value on the timer will be read in. In the new code, the new value on the timer will be ignored. The reason to make v2 greater than v1 is to prevent counter from being exactly equal to v2 when start is pushed a second time. In that case, the bug would not be exercised. Note, the old code violated prop1. testcase2 reset = 0, microwave off, door closed, count = 0. Set timer to a non-zero value. Press and hold start for a number of cycles. In the original code, the value of timer would be reloaded into count on each rising edge of the clock. With the change, the value of count continues to decrement and the timer is not reloaded into count. Note: in this case, only prop1 will detect the bug. Prop3 will not detect the bug because the antecedent, or precondition for the property is false. Change #2 ........................................................................... . elsif (is_open = ’0’) and (start = ’1’) and (time > 0) then x_heat <= ’1’; From: elsif (is_open = ’0’) and (count > 0) then x_heat <= x_heat; elsif (is_open = ’0’) To: and ((start = ’1’) or (count > 0)) then x_heat <= ’1’; else x_heat <= ’0’;
  • 317.
    4.7. EXAMPLE: MICROWAVEOVEN 295 Answer: The change introduces a bug that would be caught by prop1, but not by prop2 or prop3. The following scenario or test case will catch the bug with prop1. Maintain reset=0. Microwave is off, door is closed, timer is set to 0. Push start. With old code, microwave will remain off. With new code, microwave will turn on and remain on as long as start is pushed. The change to code exercises another bug that is not caught by prop1. This bug demonstrates a weakness in prop1 that should be remedied. Testcase: reset = 0, microwave off, door closed. Set timer to a non-zero value. Press (and release) start. Before timer expires, open door. Close door before count = 0. In the original code, the microwave will remain off, but with the change, the microwave will start again. Note: the same properties detect the bug as with the original solution. The weakness in prop1 is that it assumes that door remains closed. So, any testcase where the door is opened will pass prop1. In verification, this is known as the “false implies anything problem”, or a testcase that passes a property “vacuously”. To catch this bug, we must either change prop1 or add another property. In fact, we probably should do both. First we strengthen prop1 to deal with situations where the door is opened while the microwave is on. The property gets a bit complicated: “If start is pushed and the door is closed, then heat remains on until the earlier of either opening of the door or the expiration of the time specified by the timer when start was pushed, assuming reset remains false.” Second, we add a property to ensure that the microwave does not turn back on when the door is re-closed with time remaining on the counter: “If the microwave is off, it remains off until start is pushed.” This fourth property is written to be as general as possible. We want to write properties that catch as many bugs as possible, rather than write properties for specific testcases or bugs. Coverage ............................................................................ . Question: If msb of src1 is ’1’ and lsb of src2 is ’0’ or sum(3) is ’1’, then result is wrong. What is the minimum coverage needed to detect bug? What is the minimim coverage needed to guarantee that the bug will be detected?
  • 318.
    296 CHAPTER 4. FUNCTIONAL VERIFICATION 4.8 Functional Verification Problems P4.1 Carry Save Adder 1. Functionality Briefly describe the functionality of a carry-save adder. 2. Testbench Write a testbench for a 16-bit combinational carry save adder. 3. Testbench Maintenance Modify your testbench so that it is easy to change the width of the adder and the latency of the computation. NOTES: (a) You do not need to support pipelined adders. (b) VHDL “generics” might be useful. P4.2 Traffic Light Controller P4.2.1 Functionality Briefly describe the functionality of a traffic-light controller that has sensors to detect the presence of cars. P4.2.2 Boundary Conditions Make a list of boundary conditions to check for your traffic light controller. P4.2.3 Assertions Make a list of assertions to check for your traffic light controller.
  • 319.
    P4.3 State Machinesand Verification 297 P4.3 State Machines and Verification P4.3.1 Three Different State Machines */0 */0 s0 s1 s2 */1 */0 1/0 s0 s1 */0 s9 s8 s3 0/0 */1 */0 */0 */0 */0 s6 s7 s4 s3 s2 */0 */0 */0 Figure 4.10: A very simple machine s5 Figure 4.11: A very big machine */0 */0 s0 s1 q0 q1 */0 input/output */1 */0 */1 q2 * = don’t care s2 */0 */0 q4 q3 */0 Figure 4.13: Legend Figure 4.12: A concurrent machine Answer each of the following questions for the three state machines in figures4.10–4.12. Number of Test Scenarios How many “test scenarios” (sequences of test vectors) would you need to fully validate the behaviour of the state machine? Length of Test Scenario What is the maximum length (number of test vectors) in a test scenario for the state machine?
  • 320.
    298 CHAPTER 4. FUNCTIONAL VERIFICATION Number of Flip Flops Assuming that neither the inputs nor the outputs are registered, what is the minimum number of flip-flops needed to implement the state machine? P4.3.2 State Machines in General If a circuit has i signals of 1-bit each that are inputs, f 1-bit signals that are outputs of flip-flops and c 1-bit signals that are the outputs of combinational circuitry, what is the maximum number of states that the circuit can have? P4.4 Test Plan Creation You’re on the functional verification team for a chip that will control a simple portable CD- player. Your task is to create a plan for the functional verification for the signals in the entity cd digital. You’ve been told that the player behaves “just like all of the other CD players out there”. If your test plan requires knowledge about any potential non-standard features or behaviour, you’ll need to document your assumptions. track min sec prev stop play next pwr entity cd_digital is port ( ---------------------------------------------------- -- buttons prev, stop, play, next, pwr : in std_logic; ---------------------------------------------------- -- detect if player door is open open : in std_logic; ---------------------------------------------------- -- output display information track : out std_logic_vector(3 downto 0); min : out unsigned(6 downto 0); sec : out unsigned(5 downto 0) ); end cd_digital;
  • 321.
    P4.5 Sketches ofProblems 299 P4.4.1 Early Tests Describe five tests that you would run as soon as the VHDL code is simulatable. For each test: describe what your specification, stimulus, and check. Summarize the why your collection of tests should be the first tests that are run. P4.4.2 Corner Cases Describe five corner-cases or boundary conditions, and explain the role of corner cases and boundary conditions in functional verification. NOTES: 1. You may reference your answer for problem P4.4.1 in this question. 2. If you do not know what a “corner case” or “boundary condition” is, you may earn partial credit by: checking this box and explaining five things that you would do in functional verification. P4.5 Sketches of Problems 1. Given a circuit, VHDL code, or circuit size info; calculate simulation run time to achieve n% coverage. 2. Given a fragment of VHDL code, list things to do to make it more robust — e.g. illegal data and states go to initial state. 3. Smith Problem 13.29
  • 322.
    300 CHAPTER 4. FUNCTIONAL VERIFICATION
  • 323.
    Chapter 5 Timing Analysis 5.1Delays and Definitions In this section we will look at the different timing parameters of circuits. Our focus will be on those parameters that limit the maximum clock speed at which a circuit will work correctly. 5.1.1 Background Definitions Definition fanin: The fanin of a gate or signal x are all of the gates or signals y where an input of x is connected to an output of y. Definition fanout: The fanout of a gate or signal x are all of the gates or signals y where an output of x is connected to an input of y. y0 y0 y1 y1 x y2 x y2 y3 y3 y4 y4 Figure 5.1: Immediate Fanin of x Figure 5.2: Immediate Fanout of x 301
  • 324.
    302 CHAPTER 5. TIMING ANALYSIS Definition immediate fanin/fanout: The phrases immediate fanout and immediate fanin mean that there is a direct connection between the gates. x x Figure 5.3: Transitive Fanin Figure 5.4: Transitive Fanout Definition transitive fanin/fanout: The phrases transitive fanout and transitive fanin mean that there is either a direct or indirect connection between the gates. Note: “Immediate” vs “Transitive” fanin and fanout Be careful to dis- tinguish between immediate fan(in/out) and transitive fanin/out. If “fanin” or “fanout” are not qualified with “immediate” or “transitive”, be sure to make sure whether “immediate” or “transitive” is meant. In E&CE 327, “fan(in/out)” will mean “immediate fan(in/out)”. 5.1.2 Clock-Related Timing Definitions 5.1.2.1 Clock Skew skew clk1 clk1 clk3 clk2 clk3 clk2 clk4 clk4 Definition Clock Skew: The difference in arrival times for the same clock edge at different flip-flops.
  • 325.
    5.1.2 Clock-Related TimingDefinitions 303 Clock skew is caused by the difference in interconnect delays to different points on the chip. Clock tree design is critical in high-performance designs to minimize clock skew. Sophisticated synthesis tools put lots of effort into clock tree design, and the techniques for clock tree design still generate PhD theses. 5.1.2.2 Clock Latency master clock latency intermediate clock final clock master clock intermediate clock final clock Definition Clock Latency: The difference in arrival times for the same clock edge at different levels of interconnect along the clock tree. (Intuitively “different points in the clock generation circuitry.”) Note: Clock latency Clock latency does not affect the limit on the minimim clock period. 5.1.2.3 Clock Jitter ideal clock clock with jitter jitter Definition Clock Jitter: Difference between actual clock period and ideal clock period. Clock jitter is caused by:
  • 326.
    304 CHAPTER 5. TIMING ANALYSIS • temperature and voltage variations over time • temperature and voltage variations across different locations on a chip • manufacturing variations between different parts 5.1.3 Storage-Related Timing Definitions Storage devices (latches, flip-flops, memory arrays, etc) define setup, hold and clock-to-Q times. 5.1.3.1 Flops and Latches clk clk d d q q Flop Behaviour Latch Behaviour Storage devices have two modes: load mode and store mode. Flops are edge sensitive: either rising edge or falling edge. An ideal flop is in load mode only for the instant just before the clock edge. In reality, flops are in load mode for a small window on either side of the edge. Latches are level sensitive: either active high or active low. A latch is in load mode when its enable signal is at the active level. Timing Parameters .................................................................... Setup Hold Setup Hold Setup Hold d α β d α β d α β clk clk clk q β q α β q α β Clock-to-Q Clock-to-Q Clock-to-Q Flip-flop Active-high latch Active-low latch Setup and hold define the window in which input data are required to be constant in order to guarantee that storage device will store data correctly. Setup defines the beginning of the window. Hold defines the end of the window. Setup and hold timing constraints ensure that, when the storage device transitions from load mode to store mode, the input data is stored correctly in the storage device. Thus, the setup and hold timing constraints come into play when the storage device transitions from load mode to store mode. Setup is assumed to happen before the clock edge and
  • 327.
    5.1.3 Storage-Related TimingDefinitions 305 hold is assumed to happen after the edge. If the end of the time window constraint occurs before the clock edge, then the hold constraint is negative. Clock-to-Q defines the delay from the clock edge to when the output is guaranteed to be stable. Note: Require / Guarantee Setup and hold times are requirements that the storage device imposes upon its environment. Clock-to-Q is a guarantee that the storage device provides its environment. If the environment satisfies the setup and hold times, then the storage device guarantees that it will satisfy the clock-to-Q time. In this section, we will use the definitions of setup, hold and clock-to-Q. Section 5.2 will show how to calculate setup, hold, and clock-to-Q times for flip flops, latches, and other storage devices. 5.1.3.2 Timing Parameters for a Flop Setup Time .......................................................................... . Definition Setup Time (T ) : Latest time before arrival of clock edge (flip flop), or SUD deasserting of enable line (latch), that input data is required to be stable in order for storage device to work correctly. If setup time is violated, current input data will not be stored; input data from previous clock cycle might remain stored. 5.1.3.3 Hold Time Definition Hold Time (T ): Latest time after arrival of clock edge (flip flop), or HO deasserting of enable line (latch), that input data is required to remain stable in order for storage device to work correctly. If hold time is violated, current input data will not be stored; input data from next clock cycle might slip through and be stored. 5.1.3.4 Clock-to-Q Time Definition Clock-to-Q Time (T ): Earliest time after arrival of clock edge (flip flop), CO or asserting of enable line (latch) when output data is guaranteed to be stable.
  • 328.
    306 CHAPTER 5. TIMING ANALYSIS Review: Timing Parameters .......................................................... . Setup : Time before arrival of clock edge (flip flop), or deasserting of enable line (latch), that input data is required to start being stable Hold : time after arrival of clock edge (flip flop), or deasserting of enable line (latch), that input data is required to remain stable Clock-to-Q : Time after arrival of clock edge (flip flop), or asserting of enable line (latch) when output data is guaranteed to start being stable 5.1.4 Propagation Delays Propagation delay is the time it takes a signal to travel from the source (driving) flop to the desti- nation flop. The two factors that contribute to propagation delay are the load of the combinational gates between the flops and the delay along the interconnect (wires) between the gates. 5.1.4.1 Load Delays Load delay is proportional to load capacitance. Timing of a simple inverter with a load. 1->0 0->1 Vi Vo 0->1 1->0 Schematic Input 1 → 0: Input 0 → 1: Charge output cap Discharge output cap Load capacitance is a dependent on the fanout (how many other gates a gate drives) and how big the other gates are. Section 5.4.2 goes into more detail on timing models and equations for load delay. 5.1.4.2 Interconnect Delays Wires, also known as interconnect, have resistance, and there is a capacitance between a wire and both the substrate and parallel wires. Both the resistance and capacitance of wires increase delay. • Wire resistance is dependent upon the material and geometry of the wire.
  • 329.
    5.1.5 Summary ofDelay Factors 307 •Wire capacitance is dependent on wire geometry, geometry of neighboring wires, and materials. •Shorter wires are faster. •Fatter wires are faster. •FPGAs have special routing resources for long wires. •CMOS processes use higher metal layers for long wires, these layers have wires with much larger cross sections than lower levels of metal. More on this in section 5.4. 5.1.5 Summary of Delay Factors Name Symbol Definition Skew Difference in arrival times for different clock signals Jitter Difference in clock period over time Clock-to-Q T Delay from clock signal to Q output of flop CO Setup T Length of time prior to clock/enable that data SUD must be stable Hold T Length of time after clock/enable that data must HO be stable Load Delay due to load (fanout/consumers/readers) Interconnect Delay along wire Table 5.1: Summary of delay factors 5.1.6 Timing Constraints For a circuit to operate correctly, the clock period must be longer than the sum of the delays shown in table5.1. Definition Margin: The difference between the required value of a timing parameter and the actual value. A negative margin means that there is a timing violation. A margin of zero means that the timing parameter is just satisfied: changing the timing of the signals (which would affect the actual value of the parameter) could violate the timing parameter. A positive margin means that the constraint for the timing parameter is more than satisfied: the timing of the signals could be changed at least a little bit without violating the timing parameter. Note: “Margin” is often called “slack”. Both terms are used commonly.
  • 330.
    308 CHAPTER 5. TIMING ANALYSIS 5.1.6.1 Minimum Clock Period signal is stable a b signal may change clk1 clk2 signal may rise signal may fall clock period propagation skew jitter clock-to-Q interconnect + load setup clk1 clk2 a b slack ClockPeriod > Skew + Jitter + T + Interconnect + Load + T CO SUD Note: The minimum clock period is independent of hold time.
  • 331.
    5.1.6 Timing Constraints 309 5.1.6.2 Hold Constraint clock period clk1 clk2 a b Skew + Jitter + T ≤ T + Interconnect + Load HO CO 5.1.6.3 Example Timing Violations The figures below illustrate correct timing behaviour of a circuit and then two types of violations: setup violation and hold violation. In the figures, the black rectangles identify the point where the violation happens.
  • 332.
    310 CHAPTER 5. TIMING ANALYSIS b c a d clk a α β γ clk b α β γ Clock-to-Q Prop Setup Hold c α β d α β Figure 5.5: Good Timing a α β γ clk b α α β γ Clock-to-Q Prop Setup c α ββ d α ?α?β? ?α?β? Figure 5.6: Setup Violation
  • 333.
    5.2. TIMING ANALYSISOF LATCHES AND FLIP FLOPS 311 b c a d clk a β γ clk b β γ Clock-to-Q Prop Hold c β γ d ?β?γ? Figure 5.7: Hold Violation 5.2 Timing Analysis of Latches and Flip Flops In this section, we show how to find the clock-to-Q, setup, and hold times for latches, flip-flops, and other storage elements. 5.2.1 Simple Multiplexer Latch We begin our study of timing analysis for storage devices with a simple latch built from an inverter ring and multiplexer. There are many better ways to build latches, primarily by doing the design at the transistor level. However, the simplicity of this design makes it ideal for illustrating timing analysis. 5.2.1.1 Structure and Behaviour of Multiplexer Latch Two modes for storage devices: • loading data:
  • 334.
    312 CHAPTER 5. TIMING ANALYSIS – loads input data into storage circuitry – input data passes through to output • using stored data – input signal is disconnected from output – storage circuitry drives output clk i o Schematic ’1’ ’0’ i i o o Loading / pass-through mode Storage mode Unfold Multiplexer to Simple Gates .................................................... d a clk s sel o o a b o b Multiplexer: symbol and implementation Latch implementation Note: inverters on clk Both of the inverters on the clk signal are needed. Together, they prevent a glitch on the OR gate when clk is deasserted. If there was only one inverter, a glitch would occur. For more on this, see sec- tion 5.2.1.6 1 0 d=’0’ 1 d=’1’ 0 1 1 clk=’1’ 1 0 clk=’1’ 0 1 0 o 0 o 0 0 Loading ’0’ Loading ’1’
  • 335.
    5.2.1 Simple MultiplexerLatch 313 d 0 d 0 0 0 clk=’0’ 1 0 clk=’0’ 0 1 1 o=’0’ 1 o=’1’ 1 0 1 0 Storing ’0’ Storing ’1’ 5.2.1.2 Strategy for Timing Analysis of Storage Devices The key to calculating setup and hold times of a latch, flop, etc is to identify: 1. how the data is stored when not connected to the input (often a pair of inverters in a loop) 2. the gate(s) that the clock uses to cause the stored data to drive the output (often a transmission gate or multiplexor) 3. the gate(s) that the clock uses to cause the input to drive the output (often a transmission gate or multiplexor) d 0 d 0 1 clk=’0’ clk=’1’ 1 o 0 o 0 Note: Storage devices vs. Signals We can talk about the setup and hold time of a signal or of a storage device. For a storage device, the setup and hold times are requirements that it imposes upon all environments in which it operates. For an individual signal in a circuit, there is a setup and hold time, which is the amount of time that the signal is stable before and after a clock edge.
  • 336.
    314 CHAPTER 5. TIMING ANALYSIS 5.2.1.3 Clock-to-Q Time of a Multiplexer Latch l1 d l2 c2 clk qn q cn s2 s1 Figure 5.8: Latch for Clock-to-Q analysis d ω α l1 ω α l2 α qn ω α q ω α s1 ω α s2 ω clk cn c2 clock-to-Q Figure 5.9: Waveforms of latch showing Clock-to-Q timing Assume that input is stable, and then clock signal transitions to cause the circuit to move from storage mode to load mode. Calculate clock-to-Q time by finding delay of critical path from where clock signal enters storage circuit to where q exits storage circuit. The path is: clk → cn → c2 → l2 → qn → q, which has a delay of 5 (assuming each gate has a delay of exactly one time unit).
  • 337.
    5.2.1 Simple MultiplexerLatch 315 5.2.1.4 Setup Timing of a Multiplexer Latch Storage device transitions from load mode to store mode. Setup is time that input must be stable before clock changes. l1 d l2 c2 clk qn q cn s2 s1 Figure 5.10: Latch for Setup Analysis setup + margin d ω α l1 ω α l2 ω α qn ω α q ω α s1 ω α s2 ω clk cn c2 Figure 5.11: Setup with margin: goal is to store α Step-by-step animation of latch transitioning from load to store mode. α α α α d 0 d α 0 1 0 1 0 1 clk clk α α α α 1 0 α 0 α α α α α α t=3: l2 is set to 0, Circuit is stable in load mode because c2 turns off AND gate
  • 338.
    316 CHAPTER 5. TIMING ANALYSIS α α α α d α d 0 0 0 1 0 1 0 clk clk α α α α 0 1 0 α α α α α α α t=0: Clk transitions from load to store t=4: α from store path propagates to q α α α α d α d 0 0 1 1 0 1 0 clk clk α α α α 1 1 0 α α α α α α α t=1: Clk transitions from load to store t=5: α from store path completes cycle α α d α 0 1 0 clk α α 1 α α α α t=2: s1 propagates to s2, because cn turns on AND gate The value on s1 at t=1 will propagate from the store loop to the output and back through the store loop. At t=1, s1 must have the value that we want to store. Or, equivalently, the value to store must have saturated the store loop by t=1. It takes 5 time units for a value on the input d to propagate to s1 (d → l1 → l2 → qn → q → s1). The setup time is the difference in the delay from d to s1 and the delay from clk to cn: 5 − 1 = 4, so the setup time for this latch is 4 time units.
  • 339.
    5.2.1 Simple MultiplexerLatch 317 setup with negative margin d ω α l1 ω α l2 ω α qn ω α α/ω ω α α/ω ω α α/ω ω α q ω α α/ω ω α α/ω ω α α/ω ω s1 ω α α/ω ω α α/ω ω α α/ω s2 ω α α/ω ω α α/ω ω α α/ω clk cn c2 Figure 5.12: Setup Violation
  • 340.
    318 CHAPTER 5. TIMING ANALYSIS Step-by-step animation of latch transitioning from load to store mode with setup violation where α arrives 1 time-unit before the rising edge of the clock. α α ω ω d α d ω 0 1 1 1 0 1 clk clk ω ω ω ω 1 0 0 0 ω ω ω ω ω ω t=1: α propagates through AND Circuit is stable in load mode with ω Clk propagates through inverter Trouble: inconsistent values on load path and store path. α ω Old value (ω) still in store path when store path is enabled. d ω 1 0 1 clk α α ω d ω α 0 1 0 0 clk 0 α ω ω 1 ω ω ω ω ω ω t=-1: D transitions from ω to α t=2: old ω propagates through AND α α α α d ω d 0 0 0 1 0 1 0 clk clk ω ω ω/α α 0 1 0 ω ω ω ω ω ω α t=0: α propagates through inverter t=3: l2 is set to 0, Clk transitions from load to store because c2 turns off AND gate
  • 341.
    5.2.1 Simple MultiplexerLatch 319 α α α=1 0 d d 0 0 1 0 0 1 0 clk clk ω ω/α 0 1 1 1 ω 0 α 1 α ω/α 1 1 t=4: ω/α from store path propagates to q t=5: Illustrate instability with ω=0, α=1 α α d 0 0 1 0 clk ω ω 1 α ω/α ω/α ω t=5: ω/α from store path completes cycle setup with negative margin -3 -2 -1 0 1 2 3 4 5 6 d ω α l1 ω α l2 ω α qn ω α α/ω ω α α/ω ω α α/ω ω α q ω α α/ω ω α α/ω ω α α/ω ω s1 ω α α/ω ω α α/ω ω α α/ω s2 ω α α/ω ω α α/ω ω α α/ω clk cn c2
  • 342.
    320 CHAPTER 5. TIMING ANALYSIS We now repeat the analysis of setup violation, but illustrate the minimum violation (input transi- tions from ω to α 3 time-units before the clock edge). ω ω α α d ω d α 1 0 1 1 0 1 clk clk ω ω ω ω 0 0 0 0 ω ω ω ω ω ω Circuit is stable in load mode with ω t=-1: α propagates through AND α ω α α d ω d α 1 0 1 0 0 1 clk clk ω ω α ω 0 0 0 0 ω ω ω ω ω ω t=-3: D transitions from ω to α t=0: Clk transitions from load to store α α α α d ω d α 1 0 1 0 1 1 clk clk ω ω α α 0 1 0 0 ω ω ω ω ω α t=-2: α propagates through inverter t=1: Clk propagates through inverter Trouble: inconsistent values on load path and store path. α Old value (ω) still in store path when store path is enabled. α d 0 α α 0 1 0 d α clk 0 1 0 α α clk α α 1 α 1 ω/α ω α ω/α α α α t=5: ω/α from store path completes cycle t=2: old ω propagates through AND
  • 343.
    5.2.1 Simple MultiplexerLatch 321 α α d 0 α=1 0 0 1 0 d 0 clk 0 1 0 ω/α α clk 0 1 1 α 1 α 0 1 α α 1 1 t=3: l2 is set to 0, t=5: Illustrate instability with ω=0, α=1 because c2 turns off AND gate α α d 0 0 1 0 clk α ω/α 1 α α α ω/α t=4: ω/α from store path propagates to q -3 -2 -1 0 1 2 3 4 5 6 setup with negative margin d ω α l1 ω α l2 ω α qn ω α α/ω α α α/ω α α α/ω ω α q ω α α/ω α α α/ω α α α/ω ω s1 ω α α/ω α α α/ω α α α/ω s2 ω α α/ω α α α/ω α α α/ω clk cn c2
  • 344.
    322 CHAPTER 5. TIMING ANALYSIS setup with negative margin d ω α l1 ω α l2 ω α qn ω α α/ω ω α α/ω ω α α/ω ω α q ω α α/ω ω α α/ω ω α α/ω ω s1 ω α α/ω ω α α/ω ω α α/ω s2 ω α α/ω ω α α/ω ω α α/ω clk cn c2 Figure 5.13: Setup Violation setup d ω α l1 ω α l2 ω α qn ω α α α α α q ω α α α α α s1 ω α α α α α s2 α α α α α clk cn c2 Figure 5.14: Minimum Setup Time When cn is asserted, α must be at s1. Otherwise, ω will affect storage circuitry when data input is disconnected.
  • 345.
    5.2.1 Simple MultiplexerLatch 323 5.2.1.5 Hold Time of a Multiplexer Latch l1 d l2 c2 clk cn qn q s2 s1 Figure 5.15: Latch for Hold Analysis hold + margin d α β l1 α β l2 α qn α α α q α s1 α s2 α clk cn c2 Figure 5.16: Hold OK: goal is to store α
  • 346.
    324 CHAPTER 5. TIMING ANALYSIS α α α α d α d α 0 1 0 1 0 1 clk clk α α α α 1 0 α 0 α α α α t=6: Clk transition propagates to c2, Circuit is stable in load mode l1 may change now without affecting storage device α α α α d 0 d α 0 1 0 0 0 1 clk clk α α α α 1 0 α 0 α α α α t=7: Clk transition propagates to l2, t=0: Clk transitions from load to store α α d α 0 1 1 clk α α 1 0 α α t=5: Clk transition propagates to cn Figure 5.17: Animation of hold analysis It takes 6 time units for a change on the clock signal to propagate to the input of the AND gate that controls the load path. It takes 1 time unit for a change on d to propagate to its input to this AND gate. The data input must remain stable for 6 − 1 = 5 time units after the clock transitions from load to store mode, or else the new data value (e.g., β) will slip into the storage loop and corrupt the value α that we are trying to store.
  • 347.
    5.2.1 Simple MultiplexerLatch 325 hold with negative margin d ω α β l1 ω α β l2 ω α β qn ω α β β β q ω α β s1 ω α β s2 β clk cn c2 Figure 5.18: Hold violation: β slips through to q hold d ω α β l1 ω α β l2 ω α qn ω α q ω α s1 ω α s2 α clk cn c2 Figure 5.19: Minimum Hold Time Can’t let β affect l1 before c2 deasserts. Hold time is difference between path from clk to c2 and path from d to l1.
  • 348.
    326 CHAPTER 5. TIMING ANALYSIS 5.2.1.6 Example of a Bad Latch This latch is very similar to the one from section 5.2.1.5, however this one does not work correctly. The difference between this latch and the one from section 5.2.1.5 is the location of the inverter that determines whether l2 or s2 is enabled. When the clock signal is deasserted, c2 turns off the AND gate l2 before the AND gate s2 turns on. In this interval when both l2 and s2 are turned off, a glitch is allowed to enter the feedback loop. The glitch on the feedback loop is independent of the timing of the signals d and clk. l1 d l2 c2 clk qn q cn s2 s1 d α β l1 α β l2 α qn α α α q α α α s1 α α α s2 α α α clk c2 cn 5.2.2 Timing Analysis of Transmission-Gate Latch The latch that we now examine is more realistic than the simple multiplexer-based latch. We replace the multiplexer with a transmission gate.
  • 349.
    5.2.2 Timing Analysisof Transmission-Gate Latch 327 5.2.2.1 Structure and Behaviour of a Transmission Gate Symbol s’ ’1’ ’0’ ’0’ ’0’ i o i o i o ’1’ ’0’ s ’0’ ’1’ ’1’ ’1’ Implementation Open Closed Transmit ’1’ Transmit ’0’ s i o Transmission gate as switch 5.2.2.2 Structure and Behaviour of Transmission-Gate Latch (Smith 2.5.1) d q clk 0 1 d q d q 1 0 clk 1 clk 1 1 0 0 1 Loading data into latch Using stored data from latch
  • 350.
    328 CHAPTER 5. TIMING ANALYSIS 5.2.2.3 Clock-to-Q Delay for Transmission-Gate Latch d q clk 1 5.2.2.4 Setup and Hold Times for Transmission-Gate Latch d path1 q path2 q d clk 1 clk 1 path2 path1 Setup time = path1 – path2 Setup time for latch Hold time = path1 – path2 Hold time for latch 5.2.3 Falling Edge Flip Flop (Smith 2.5.2) We combine two active-high latches to create a falling-edge, master-slave flip flop. The analysis of the master-slave flip-flop illustrates how to do timing analysis for hierarchical storage devices. Here, we use the timing information for the active high latch to compute the timing information of the flip-flop. We do not need to know the primitive structure of the latch in order to derive the timing information for the flip flop.
  • 351.
    5.2.3 Falling EdgeFlip Flop 329 5.2.3.1 Structure and Behaviour of Flip-Flop d m q clk EN EN d A B α C D E β F clk m ?? A B α D E β clk_b q ?? α β d m q clk EN EN TInv d α clk m α clk_b q α Tinv Tmd Latch Latch Setup Clock-Q TInv delay through an inverter Tmd propagation delay from m to d
  • 352.
    330 CHAPTER 5. TIMING ANALYSIS 5.2.3.2 Clock-to-Q of Flip-Flop d m q clk EN EN d α clk m α clk_b q α Tinv Latch Clock-to-Q Flop Clock-to-Q T Flop = TInv + T Latch CO CO
  • 353.
    5.2.3 Falling EdgeFlip Flop 331 5.2.3.3 Setup of Flip-Flop d m q clk EN EN d α clk m α clk_b q α Latch Setup Flop Setup T Flop = T Latch SUD SUD The setup time of the flip flop is the same as the setup time of the master latch. This is because, once the data is stored in the master latch, it will be held for the slave latch.
  • 354.
    332 CHAPTER 5. TIMING ANALYSIS 5.2.3.4 Hold of Flip-Flop d m q clk EN EN d α β clk m α clk_b q α Hold time for latch Hold time for flop T Flop = T Latch HO HO The hold of the flip flop is the same as the hold time of the master latch. This is because, once the data is stored in the master latch, it will be held for the slave latch. 5.2.4 Timing Analysis of FPGA Cells (Smith 5.1.5) We can apply hierarchical analysis to structures that include both datapath and storage circuitry. We use an Actel FPGA cell to illustrate. The description of the Actel FPGA cell in the course notes is incomplete, refer to Smith’s book for additional material.
  • 355.
    5.2.4 Timing Analysisof FPGA Cells 333 5.2.4.1 Standard Timing Equations T = delay from D-inputs to storage element PD T = delay from clk-input to storage element CLKD T = delay from storage element to output OUT T = setup time SUD = “slowest D path” − “fastest clk path” = T −T PD Max CLKD Min T = hold time HO = “slowest clk path” − “fastest D path” = T −T CLKD Max PD Min T = delay clk to Q CO = “clk path” + “output path” = T +T CLKD OUT 5.2.4.2 Hierarchical Timing Equations Add combinational logic to inputs, clock, and outputs of storage element. t’ SUD d q t’ data inputs t’ t’ OUT PD HO t’ CO clk clk t’ CLKD ′ T = T + T’ − T’ SUD SUD PD Max CLKD Min T = T ′ + T’ − T’ HO HO CLKD Max PD Min T = T ′ + T’ +T CO CO CLKD Max OUT Max 5.2.4.3 Actel Act 2 Logic Cell Timing analysis of Actel Act 2 logic cell (Smith 5.1.5).
  • 356.
    334 CHAPTER 5. TIMING ANALYSIS Actel ACT • Basic logic cells are called Logic Module • ACT 1 family: one type of Logic Module (see Figure 5.1, Smith’s pp. 192) • ACT 2 and ACT 3 families: use two different types of Logic Module (see Figure 5.4, Smith’s pp. 198) • C-Module (Combinatorial Module) — combinational logic similar to ACT 1 Logic Mod- ule but capable of implementing five-input logic function • S-Module (Sequential Module) — C-Module + Sequential Element (SE) that can be con- figured as a flip-flop Actel Timing • ACT family: (see Figure 5.5, Smith’s pp. 200) • Simple. Why? – Only logic inside the chip – Not exact delay (as no place and route, physical layout, hence not accounting for inter- connection delay) – Non-Deterministic Actel Architecture • All primed parameters inside S-Module are assumed — Calculate tSUD, tH, and tCO • The combinational logic delay of 3 ns: 0.4 went into increasing the setup time, tSUD, and 2.6 ns went into increasing the clock-output delay, tCO. From outside we can say that the combinational logic delay is buried in the flip-flop set up time d q q clk d clk clr Actel latch with active-low Simple Actel-style latch clear m d q clk clr Actel flop with active-low clear
  • 357.
    5.2.4 Timing Analysisof FPGA Cells 335 SE-Module C-Module d00 d01 m q d10 d11 se_clk se_clk_n a1 b1 a0 b0 clk clr Actel sequential module 5.2.4.4 Timing Analysis of Actel Sequential Module Timing parameters for Actel latch with active-low clear Other given timing parameters T 0.4ns SUD C-Module delay (t′ ) 3ns T 0.9ns PD HO t’CLKD (from clk to se clk and se clk n) 2.6ns T 0.4ns CO Question: What are the setup, hold, and T times for the entire Actel sequential CO module? Answer: See Smith pp 199. Use Smith’s eqn 5.15, 5.16, and assume t ′ = 2.6ns. CLKD T 0.8ns SUD T 0.5ns HO T 3.0ns CO
  • 358.
    336 CHAPTER 5. TIMING ANALYSIS 5.2.5 Exotic Flop As a contrast to the gate-level implementations of latches that we looked at previously, the figure below is the schematic for a state-of-the-art high-performance latch circa 2001. precharge node keeper precharge node keeper q d clk inverter chain The inverter chain creates an evaluation window in time when clock has just risen and the p tran- sistors are turned on. When clock is ’0’, the left precharge node charges to ’1’ and the right precharge node discharges to ’0. If d is ’1’ during the evaluation window, the left precharge node discharges to ’0’. The left precharge nodes goes through an inverter to the second precharge node, which will charge from ’0’ to texttt’1’, resulting in a ’0’ on q. If d is ’0’ during the evaluation window, the left precharge node stays at the precharge value of ’1’. The left precharge nodes goes through an inverter to the second precharge node, which will stay at ’0’, resulting in a ’1’ on q. The two inverter loops are keepers, which provide energy to keep the precharge nodes at their values after the evaluation window has passed and the clock is still ’1’. 5.3 Critical Paths and False Paths 5.3.1 Introduction to Critical and False Paths In this section we describe how to find the critical path through the circuit: the path that limits the maximum clock speed at which the circuit will work correctly. A complicating factor in finding the
  • 359.
    5.3.1 Introduction toCritical and False Paths 337 critical path is the existence of false paths: paths through the circuit that appear to be the critical path, but in fact will not limit the clock speed of the circuit. The reason that a path is false is that the behaviour of the gates prevents a transition (either 0 → 1 or 1 → 0) from travelling along the path from the source node the destination node. Definition critical path: The slowest path on the chip between flops or flops and pins. The critical path limits the maximum clock speed. Definition false path: : a path along which an edge cannot travel from beginning to end. To confirm that a path is a true critical path, and not a false path, we must find a pair of input vectors that exercise the critical path. The two input vectors usually differ only their value for the input signal on the critical path.1 The change on this signal (either 0 → 1 or 1 → 0) must propagate along the candidate critical path from the input to the output. Usually the two input vectors will produce different output values. However, a critical path might produce a glitch (0 → 1 → 0 or 1 → 0 → 1) on the output, in which case the path is still the critical path, but the two input vectors both result in the same value on the output signal. Glitches should be ignored, because they may result in setup violations. If the glitching value is inside the destination flop or latch at the end of the clock period, then the storage element will not store a stable value. Outline .............................................................................. . The algorithm that we present comes from McGeer and Brayton in a DAC 198? paper. The algorithm to find the critical path through a circuit is presented in several parts. 1. Section 5.3.2: Find the longest path ignoring the possibility of false paths. 2. Section 5.3.3: Almost-correct algorithm to test whether a candidate critical path is a false path. 3. Section 5.3.4: If a candidate path is a false path, then find the next candidate path, and repeat the false-path detection algorithm. 4. Section 5.3.5: Correct, complete, and complex algorithm to find the critical path in a circuit. 1 Section 5.3.5 discusses late-side inputs and situations where more than one input needs to change for the critical path to be exercised.
  • 360.
    338 CHAPTER 5. TIMING ANALYSIS Notes ................................................................................ . Note: The analysis of critical paths and false paths assumes that all inputs change values at exactly the same time. Timing differences between inputs are modelled by the skew parameter in timing analysis. Throughout our discussion of critical paths, we will use the delay values for gates shown in the table below. gate delay NOT 2 AND 4 OR 4 XOR 6 5.3.1.1 Example of Critical Path in Full Adder Question: Find the critical path through the full-adder circuit shown below. ci i s a b k j co Answer: Annotate with Max Distance to Destination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 8 6 ci 0 0 14 14 6 s a 8 14 14 8 b 4 4 8 8 0 0 4 4 co 8 Find Candidate Critical Path .............................................. .
  • 361.
    5.3.1 Introduction toCritical and False Paths 339 8 6 ci 0 0 14 14 6 s a 8 14 14 8 b 4 4 8 8 0 0 4 4 co 8 There are two paths of length 14: a–co and b–co. We arbitrarily choose a–co. Test if Candidate is Critical Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ’1’ ci s a ’0’ ’0’ ’1’ b ’0’ ’0’ ’0’ co Yes, the candidate path is the critical path. The assignment of ci=1, a=0, b=0 followed in the next clock cycle by ci=1, a=1, b=0 will exercise the critical path. As a shortcut, we write the pair of assignments as: ci=1, a=↑, b=0. Question: Do the input values of ci=0, a=↓, b=1 exercise the critical path? Answer: ’0’ ci s a ’1’ ’1’ ’0’ b ’0’ ’0’ ’1’ co The alternative does not exercise the critical path. Instead, the alternative excitation follows a shorter path, so the output stabilizes sooner. Lesson: not all all transitions on the inputs will exercise the critical path. Using timing simulation to find the maximum clock speed of a circuit might overestimate the clock speed, because the inputs values that you simulate might not exercise the critical path.
  • 362.
    340 CHAPTER 5. TIMING ANALYSIS 5.3.1.2 Preliminaries for Critical Paths There are three classes of paths on a chip: • entry path: from an input to a flop Quartus does not report this by default. When Quartus reports this path, it is reported as the period associated with “System fmax”. In Xilinx timing reports this is reported as “Maximum Delay” • stage path: from one flop to another flop In Quartus timing reports, this is reported as the period associated with “Internal fmax”. In Xilinx timing reports, this is reported as “Clock to Setup” and “Maximum Frequency”. • exit path: from a flop to an output Quartus does not report this by default. When Quartus reports this path, it is reported as the period associated with “System fmax”. In Xilinx timing reports this is reported as “Maximum Delay” 5.3.1.3 Longest Path and Critical Path The longest path through the circuit might not be the critical path, because the behaviour of the gates might prevent an edge (0 → 1 or 1 → 0) from travelling along the path. Example False Path .................................................................. . Question: Determine whether the longest path in the circuit below is a false path a y b
  • 363.
    5.3.1 Introduction toCritical and False Paths 341 Answer: For this example, we use a very naive approach simply to illustrate the phenomenon of false paths. Sections 5.3.2–5.3.5 present a better algorithm to detect false paths and find the real critical path. In the circuit above, the longest path is from b to y: The four possible scenarios for the inputs are: (a = 0, b = 0 → 1) (a = 0, b = 1 → 0) (a = 1, b = 0 → 1) (a = 1, b = 1 → 0) a = 0, b = 0 → 1 a = 0, b = 1 → 0 0 a a 0 y 0 y 0 0 0 b b a = 1, b = 0 → 1 a = 1, b = 1 → 0 0 1 a 0 a 1 0 y 1 y 0 0 1 0 b b In each of the four scenarios, the edge is blocked at either the AND gate or the OR gate. None of the four scenarios result in an edge on the output y, so the path from b to y is a false path. Question: How can we determine analytically that this is a false path? Answer: The value on a will always force either the AND gate to be a ’0’ (when a is ’0’) or the the OR gate to be a ’1’ (when a is ’1’). For both a=’0’ and a=’1’, a change on b will be unable to propagate to y. The algorithm to detect false paths is based upon this type of analysis.
  • 364.
    342 CHAPTER 5. TIMING ANALYSIS Preview of Complete Example ........................................................ . This example illustrates all of the concepts in analysing critical paths. Here, we explore the circuit informally. In section 5.3.5, we will revisit this circuit and analyse it according to the complete, correct, and complex algorithm. Question: Find the critical path through the circuit below. b d e g a f c Answer: Even though the equation for this circuit reduces to false, the output signal (g) is not a constant ’0’. Instead, glitches can occur on g. To explore the behaviour of the circuit, we will stimulate the circuit first with a falling edge, then a rising edge. Stimulate the circuit with a falling edge and see which path the edge follows. 0 0 0 2 4 6 b d e g a f10 2 0 c The longest path through the circuit is the middle path. At g, the side input (a) has a controlling value before the falling edge arrives on the path input (e). Thus, a falling edge is unable to excite the longest path through the circuit. Stimulate the circuit with a rising edge and see which path the edge follows. 0 0 0 2 4 6 10 b d e g a f6 2 0 c At f, the side input c has a controlling value before the falling edge arrives on the path input (e). Thus, a rising edge is unable to excite the longest path through the circuit.
  • 365.
    5.3.2 Longest Path 343 Of the two scenarios, the falling edge follows a longer path through the circuit than the rising edge. The critical path is the lower path through the circuit. When we develop our first algorithm to detect false paths (section 5.3.3), we will assume that at each gate, the input that is on the critical path will arrive after the other inputs. Not all circuits satisfy the assumption. At f, when a is a falling edge, the path input (c) arrives before the side input e. This assumption is removed in section 5.3.5, where we present the complete algorithm by dealing with late-arriving side inputs. 5.3.1.4 Timing Simulation vs Static Timing Analysis The delay through a component is usually dependent upon the values on signals. This is because different paths in the circuit have different delays and some input values will prevent some paths from being exercised. Here are two simple examples: • In a ripple-carry adder, if a carry out of the MSB is generated from the least significant bit, then it will take longer for the output to stabilize than if no carries generated at all. • In a state machine using a one-hot state encoding, false paths might exist when more than one state bit is a ’1’. Because of these effects, static timing analysis might be overly conservative and predict a delay that is greater than you will experience in practice. Conversely, a timing simulation may not demonstrate the actual slowest behaviour of your circuit: if you don’t ever generate a carry from LSB to MSB, then you’ll never exercise the critical path in your adder. The most accurate delay analysis requires looking at the complete set of actual data values that will occur in practice. 5.3.2 Longest Path The following is an algorithm to find the longest path from a set of source signals to a set of destination signals. We first provide a high-level, intuitive, description, and then present the actual algorithm. Outline of Algorithm to Find Longest Path ............................................ . The basic idea is to annotate each signal with the maximum delay from it to an output. • Start at destination signals and traverse through fanin to source signals. – Destination signals have a delay of 0 – At each gate, annotate the inputs by the delay through the gate plus the delay of the output.
  • 366.
    344 CHAPTER 5. TIMING ANALYSIS – When a signal fans out to multiple gates, annotate the output of the source (driving) gate with maximum delay of the destination signals. • The primary input signal with the maximum delay is the start of the longest path. The delay annotation of this signal is the delay of the longest path. • The longest path is found by working from the source signal to the destination signals, picking the fanout signal with the maximum delay at each step. Algorithm to Find Longest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1. Set current time to 0 2. Start at destination signals 3. For each input to a gate that drives a destination signal, annotate the input with the current time plus the delay through the gate 4. For each gate that has times on all of its fanout but not a time for itself, (a) annotate each input to the gate with the maximum time on the fanout plus the delay through the gate (b) go to step 4 5. To find the longest path, start at the source node that has the maximum delay. Work forward through the fanout. For signals that fanout to multiple signals, choose the fanout signal with the maximum delay. Longest Path Example .. . . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. .. Question: Find the longest path through the circuit below. d f a g j l h b k m i e c Answer: Annotate signals with the maximum delay to an output:
  • 367.
    5.3.3 Detecting aFalse Path 345 d 14 f 12 12 g a 16 8 2 j 0 12 h 4 b 12 6 4 k 0 8 i 4 4 e 8 c 10 8 Find longest path: d 14 f 12 12 g a 16 8 2 j 0 12 h 4 b 12 6 4 k 0 8 i 4 4 e 8 c 10 8 The path from a to y has a delay of 16. 5.3.3 Detecting a False Path In this section, we will explore a simple and almost correct algorithm to determine if a path is a false path. The simple algorithm in this section sometimes gives the incorrect results if the candi- date path intersects false paths. For all of the example circuits in this section, the algorithm gives the correct result. The purpose of presenting this almost-correct algorithm is that it is relatively easy to understand and introduces one of the key concepts used in the complicated, correct, and complete algorithm for finding the critical path in section 5.3.5. 5.3.3.1 Preliminaries for Detecting a False Path The controlling value of a gate is the value such that if one of the inputs has this value, the output can be determined independently of the other inputs. For an AND gate, the controlling value is ’0’, because when one of the inputs is a ’0’, we know that the output will be ’0’ regardless of the values of the other inputs. The controlled output value is the value produced by the controlling input value. Gate Controlling Value Controlled Output AND ’0’ ’0’ OR ’1’ ’1’ NAND ’0’ ’1’ NOR ’1’ ’0’ XOR none none
  • 368.
    346 CHAPTER 5. TIMING ANALYSIS Path Input, Side Input ................................................................. Definition path input: For a gate on a path (either a candidate critical path, or a real critical path), the path input is the input signal that is on the path. Definition side input: For a gate on a path (either a candidate critical path, or a real critical path), the side inputs are the input signals that are not on the path. The key idea behind the almost-correct algorithm is that: for an edge to propagate along a path, the side inputs to each gate on the path must have non-controlling values. The complete, correct, and complicated algorithm generalizes this constraint to handle circuits where the side inputs are on false paths. Reconvergent Fanout .................................................................. Definition reconvergent fanout: There are paths from signals in the fanout of a gate that reconverge at another gate. Most of the difficulties both with critical paths and with testing circuits for manufacturing faults (Chapter 7) are caused by reconvergent fanout. g a y h d e b z f c There are two sets of reconvergent paths in the circuit above. One set of reconvergent paths goes from a to y and one set goes from d to z. If a candidate path has reconvergent fanout, then the rising or falling edge on the input to the path might cause a side input along the path to have a rising or falling edge, rather than a stable ’0’ or ’1’. To support reconvergent fanout, we extend the rule for side inputs having non-controlling values to say that side inputs must have either non-controlling values or have edges that stabilize in non- controlling values.
  • 369.
    5.3.3 Detecting aFalse Path 347 Rules for Propagating an Edge Along a Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. These rules assume that side inputs arrive before path inputs. Section 5.3.5 relaxes this constraint. NOT 1 1 AND 0 0 OR 1 1 XOR 0 0 Question: Why do the rules not have falling edges for AND gates or rising edges for OR gates on the side input? Answer: a a b c b c For an AND gate, a falling edge on side-input will force the output to change and prevent the path input from affecting the output. This is because the final value of a falling edge is the controlling value for an AND gate. Similarly, for an OR gate, the final value of a rising edge is the controlling value for the gate.
  • 370.
    348 CHAPTER 5. TIMING ANALYSIS Analyzing Rules for Propagating Edges ............................................... . The pictures below show all combinations of output edge (rising or falling) and input values (con- stant 1, constant 0, rising edge, falling edge) for AND and OR gates. These pictures assume that the side input arrives before the path intput. The pictures that are crossed out illustrate situations that prevent the path input from affecting the output. In these situations the inputs cause either a constant value on the output or the side input affects the output but the path input does not. The pictures that are not crossed out correspond to the rules above for pushing edges through AND and OR gates. 0 1 constant 0 output constant 0 output AND 0 1 constant 0 output 0 is controlling 0 1 constant 1 output 1 is controlling OR 0 1 constant 1 output constant 1 output Viability Condition of a Path . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . . .. . . . . .. . . .. Definition Viability condition: For a path (p) though a circuit, the viability condition (sometimes called the viability constraint) is a Boolean expression in terms of the input signals that defines the cases where an edge will propagate along the path. Equivalently: the cases where a transition on the primary input to the path will excite the path. Based upon the rules for propagating an edge that we have seen so far, the viability condition for a path is: every side input has a non-controlling value. As always, section 5.3.5 has the complete viability condition.
  • 371.
    5.3.3 Detecting aFalse Path 349 5.3.3.2 Almost-Correct Algorithm to Detect a False Path The rules above for propagating an edge along a candidate path assume that the values on side inputs always arrive before the value on the path input. This is always true when the candidate path is the longest path in the circuit. However, if the longest path is a false path, then when we are testing subsequent candidate paths, there is the possibility that a side input will be on a false path and the side input value will arrive later than the value from the path input. This almost-correct algorithm assumes that values on side inputs always arrive before values on path inputs. The correct, complex, and complete critical path algorithm in section 5.3.5 extends the almost correct algorithm to remove this assumption. To determine if a path through a circuit is a false path: 1. Annotate each side input along the path with its non-controlling value. These annotations are the constraints that must be satisfied for the candidate path to be exercised. 2. Propagate the constraints backward from the side inputs of the path to the inputs of the circuit under consideration. 3. If there is a contradiction amongst the constraints, then the candidate path is a false path. 4. If there is no contradiction, then the constraints on the inputs give the conditions under which an edge will traverse along the candidate path from input to output. 5.3.3.3 Examples of Detecting False Paths False-Path Example 1 ................................................................ . Question: Determine if the longest path in the circuit below is a false path. d 14 f 12 12 g a 16 8 2 j 0 12 h 4 b 12 6 4 k 0 8 i 4 4 e 8 c 10 8 Answer: Compute constraints for side inputs to have non-controlling values:
  • 372.
    350 CHAPTER 5. TIMING ANALYSIS d f g a j 1 ‘1’ l b 0 ‘0’ h ‘1’ ‘1’ k m i e ‘0’ c Contradictory values. side input non-controlling value constraint g[b] 1 b i[e] 0 c k[h] 1 b Found contradiction between g[b] needing b and k[h] needing b, therefore the candidate path is a false path. Analyze cause of contradiction: d f g a j l h b k m i e 2 c These side inputs will always have opposite values. Both side inputs feed the same type of gate (AND), so it always be the case that one of the side inputs will be a controlling value (0). False-Path Example 2 ................................................................ . Question: Determine if the longest path through the circuit below is a critical path. If the longest path is a critical path, find a pair of input vectors that will exercise the path. d a f b h g e c
  • 373.
    5.3.3 Detecting aFalse Path 351 Answer: d a f b ‘1’ ‘0’ h ‘1’ g e c side input non-controlling value constraint e[a] 1 a g[b] 0 b h[f] 1 a+b Complete constraint is conjunction of constraints: ab(a + b), which reduces to false. Therefore, the candidate path is a false path. False-Path Example 3 ................................................................ . This example illustrates a candidate path that is a true path. Question: Determine if the longest path through the circuit below is a critical path. If the longest path is a critical path, find a pair of input vectors that will exercise the path. d a f b h g e c Answer: Find longest path; label side inputs with non-controlling values: d a f b ‘1’ ‘0’ h ‘0’ g e c
  • 374.
    352 CHAPTER 5. TIMING ANALYSIS Table of side inputs, non-controlling values, and constraints on primary inputs: side input non-controlling value constraint e[a] 0 a g[b] 0 b h[b] 1 a+b The complete constraint is ab(a + b), which reduces to ab. Thus, for an edge to propagate along the path, a must be ’0’ and b must be ’0’. The primary input to the path (c) does not appear in the constraint, thus both rising and falling edges will propagate along the path. If the primary input to the path appears with a positive polarity (e.g. c) in the constraint, then only a rising edge will propagate. Conversely, if the primary input appears negated (e.g., c), then only a falling edge will propagate. The primary input to the path (c) does not appear in the constraint, thus both rising and falling edges will propagate along the path. If the primary input to the path appears with a positive polarity (e.g. c) in the constraint, then only a rising edge will propagate. Conversely, if the primary input appears negated (e.g., c), then only a falling edge will propagate. Critical path c, e, g, h Delay 14 Input vector a=0, b=0, c=rising edge Illustration of rising edge propagating along path: d‘1’ a‘0’ ‘1’ f‘1’ b‘0’ ‘0’ ‘1’ ‘0’ h ‘0’ g e c Illustration of falling edge propagating along path: d‘1’ a‘0’ ‘1’ f‘1’ b‘0’ ‘0’ ‘1’ ‘0’ h ‘0’ g e c False-Path Example 4 ................................................................ . This example illustrates reconvergent fanout.
  • 375.
    5.3.3 Detecting aFalse Path 353 Question: Determine if the longest path through the circuit below is a critical path. If the longest path is a critical path, find a pair of input vectors that will exercise the path. c d a g e f b Answer: c d a ‘1’ g e f b ‘1’ side input non-controlling value constraint e[b] 1 b g[d] 1 a The complete constraint is ab. The constraint includes the input to the path (a), which indicates that not all edges will propagate along the path. The polarity of the path input indicates the final value of the edge. In this case, the constraint of a means that we need a rising edge. Critical path a, c, e, f, g Delay 12 Input vector a=rising edge, b=1 Illustration of rising edge propagating along path: c d a g e f b‘1’ ‘1’ If we try to propagate a falling edge along the path, the falling edge on the side input d forces the output g to fall before the arrival of the falling edge on the path input f. Thus, the edge does not propagate along the candidate path.
  • 376.
    354 CHAPTER 5. TIMING ANALYSIS c d a g e f b‘1’ ‘1’ Patterns in False Paths ............................................................... . After analyzing these examples, you might have begun to observe some patterns in how false paths arise. There are several patterns in the types of reconvergent fanout that lead to false paths. For example, if the candidate path has an OR gate and an AND that are both controlled by the same signal and the candidate has an even number of inverters between these gates then the candidate path is almost certainly a false path. The reason is the same as illustrated in the first example of a false path. The side input will always have a controlling value for either the OR gate or the AND gate. 5.3.4 Finding the Next Candidate Path If the longest path is a false path, we need to find the next longest path in the circuit, which will be our next candidate critical path. If this candidate fails, we continue to find the next longest of the remaining paths, ad infinitum. 5.3.4.1 Algorithm to Find Next Candidate Path To find the next candidate path, we use a path table, which keeps track of the partial paths that we have explored, their maximum potential delay, and the signals that we can follow to extend a partial path toward the outputs. We keep the path table sorted by the maximum potential delay of the paths. We delete a path from the table if we discover that it is a false path. The key to the path table is how to update the potential delay of the partial paths after we discover a false path. All partial paths that are prefixes of the false path will need to have their potential delay values recomputed. The updated delay is found by following the unexplored signals in the fanout of the end of the partial path. 1. Initialize path table with primary inputs, their potential delay, and fanout. 2. Sort path table by potential delay (path with greatest potential delay at bottom of table) 3. If the partial path with the maximum potential delay has just one unused fanout signal, then extend the partial path with this signal. Otherwise: (a) Create a new entry in the path table for the partial path extended by the unused fanout signal with the maximum potential delay.
  • 377.
    5.3.4 Finding theNext Candidate Path 355 (b) Delete this fanout signal from the list of unused fanout signals for the partial path. 4. Compute the constraint that side input of the new signal does not have a controlling value, and update constraint table. 5. If the new constraint does not cause a contradiction, then return to step 3. Otherwise: (a) Mark this partial path as false. (b) For each partial path that is a prefix of the false path: • reduce the potential delay of the path by the difference between the potential delay of the fanout that was followed and the unused fanout with next greatest delay value. (c) Return to step 2 5.3.4.2 Examples of Finding Next Candidate Path Next-Path Example 1 .................................................................. Question: Starting from the initial delay calculation and longest path, find the next candidate path and test if it is a false path. d 14 f 12 12 g a 16 8 2 j 0 12 h 4 b 12 6 4 k 0 8 i 4 4 e 8 c 10 8 Answer: Initial state of path table: potential unused delay fanout path 10 e c 12 h, g b 16 d a Extend path with maximum potential delay until find contradiction or reach end of path. Add an entry in path table for each intermediate path with multiple signals in the fanout.
  • 378.
    356 CHAPTER 5. TIMING ANALYSIS Path table and constraint table after detecting that the longest path is a false path: potential unused delay fanout path 10 e c 12 h, g b 16 j, i a, d, f, g false a, d, f, g, i, k side input non-controlling value constraint g[b] 1 b i[e] 0 c k[h] 1 b The longest path is a false path. Recompute potential delay of all paths in path table that are prefixes of the false path. The one path that is a prefix of the false path is: a,d,f,g . The remaining unused fanout of this path is j, which has a potential delay on its input of 2. The previous potential delay of g was 8, thus the potential delay of the prefix reduces by 8 − 2 = 6, giving the path a potential delay of 16 − 6 = 10. Path table after updating with new potential delays: potential unused delay fanout path false a, d, f, g, i, k 10 e c 10 i a, d, f, g 12 h, g b Extend b through g, because g has greater potential delay than the other fanout signal (h). potential unused delay fanout path false a, d, f, g, i, k 10 e c 10 i a, d, f, g 12 h, g b 12 i, j b, g side input non-controlling value constraint g[a] 1 a From g, we will follow i, because it has greater potential delay than j.
  • 379.
    5.3.4 Finding theNext Candidate Path 357 potential unused delay fanout path false a, d, f, g, i, k 10 e c 10 i a, d, f, g 12 h, g b 12 i, j b, g 12 b, g, i, k side input non-controlling value constraint g[a] 1 a i[e] 0 c k[h] 1 b We have reached an output without encountering a contradiction in our constraints. The complete constraint is abc. Critical path b, g, i, k Delay 12 Input vector a=1, b=falling edge, c=1 Illustrate the propagation of a falling edge: d f g a‘1’ ‘1’ 2 j h b k i e c‘1’ ‘0’ At k, the rising edge on the side input (h) arrives before the falling edge on the path input (i). For a brief moment in time, both the side input and path input are ’1’, which produces a glitch on k. Next-Path Example 2 .................................................................. Question: Find the critical path in the circut below a l m m b j i c g h f d k k e
  • 380.
    358 CHAPTER 5. TIMING ANALYSIS Answer: Find the longest path: a 10 6 l 2 m 0 m b 14 10 j 6 14 6 i 10 c 20 20 g 16 h 14 10 f d 22 20 4 k 0 0 4 4 k e Initial state of path table: potential unused delay fanout path 4 k e 10 j, l a 14 i b 20 g c 22 f d Extend path with maximum potential delay until find contradiction or reach end of path. Add an entry in path table for each intermediate path with multiple fanout signals. potential unused delay fanout path 4 k e 10 j, l a 14 i b 20 g c 22 j, k d, f, g, h, i false d, f, g, h, i, j, l side input non-controlling value constraint g[c] 1 c i[b] 0 b j[a] 0 a l[a] 1 a Contradiction between j[a] and l[a], therefore the path d,f,g,h,i,j,l is a false path. And, any path that extends this path is also false. To find next candidate, begin by recomputing delays along the candidate path. The second gate in the contradiction is l. The last intermediate path before l with unused fanout is i. Cut the candidate path at this signal. The
  • 381.
    5.3.4 Finding theNext Candidate Path 359 remaining initial part of the candidate path is: d, f, g, h, i. The only unused fanout of this path is k. We now calculate the new maximum potential delay of d, f, g, h, i , taking into account the false path that we just discovered. The delay from i along the candidate path j, l, m is 10 and the maximum potential delay along the remaining unused (k) is 4. The difference is: 10 − 4 = 6, and so the potential delay of d, f, g, h, i is reduced to 22 − 6 = 16. After updating the partial delay of d, f, g, h, i , the partial path with the maximum potential delay is c. The new critical path candidate will be: c, g, h, i, j, l, m. Update the path table with delay of 16 for previous candidate path. Extend c along path with maximum potential delay until find contradiction or reach end of path. Add an entry in path table for each intermediate path with multiple fanout signals. potential unused delay fanout path false d, f, g, h, i, j, l 4 k e 10 j, l a 14 i b 16 k d, f, g, h, i 20 k c, f, g, h, i false c, f, g, h, i, j, l We encounter the same contradiction as with the previous candidate, and so we have another false path. We could have detected this false path without working through the path table, if we had recognized that our current candidate path overlaps with the section (j, l) of previous candidate that caused the false path. As with the previous candidate, we reduce the potential delay of the current candidate the path up through i by 6, giving us a potential delay of 20 − 10 = 14 for c, f, g, h, i . The next candidate path is d, f, g, h, i, k with a delay of 16. potential unused delay fanout path false d, f, g, h, i, j, l false c, f, g, h, i, j, l 4 k e 10 j, l a 14 i b 14 k c, f, g, h, i 16 k d, f, g, h, i
  • 382.
    360 CHAPTER 5. TIMING ANALYSIS We extend the path through k and compute the constraint table. side input non-controlling value constraint g[c] 1 c i[b] 0 b k[e] 0 e The complete constraint is bce. There is no constraint on a and d may be either a rising edge or a falling edge. Critical path d, f, g, h, i, k Delay 16 Input vector a=0, b=0, c=1, d=rising edge, e=0 Next Path Example 3 ................................................................. . Question: Find the critical path in the circuit below. j a l m k m e b f g h i c n p p o d Answer: j a 12 10 8 8 l 4 4 m 0 12 k 8 m 8 4 14 e 12 b f g h i 8 c 16 8 n 4 8 4 p 0 4 p 8 6 o4 d Initial state of path table:
  • 383.
    5.3.4 Finding theNext Candidate Path 361 potential unused delay fanout path 8 n, o d 12 j, k a 14 e b 16 f c Extend c through f: potential unused delay fanout path 8 n, o d 12 j, k a 14 e b 16 m, n c,f,g,h,i false c,f,g,h,i,n,p side input non-controlling value constraint n[d] 1 d p[o] 1 d The first candidate is a false path. Recompute potential delay of c, f, g, h, i, which reduces it from 16 to 12. potential unused delay fanout path false c,f,g,h,i,n,p 8 n, o d 12 j, k a 12 m c,f,g,h,i 14 e b Extend b through e: potential unused delay fanout path false c,f,g,h,i,n,p 8 n, o d 12 j, k a 12 m c,f,g,h,i false b,e,k,l side input non-controlling value constraint k[a] 1 a l[j] 1 a
  • 384.
    362 CHAPTER 5. TIMING ANALYSIS The second candidate is a false path. There is no unused fanout signal from l for the path b, e, k, l, so this partial path is a false path and there is no new delay information to compute. There are two paths with a potential delay of 12. Choose c, f, g, h, i , because the end of the path is closer to an output, so there will be less work to do in analyzing the path. potential unused delay fanout path false c,f,g,h,i,n,p false b,e,k,l 8 n, o d 12 j, k a 12 c,f,g,h,i,m side input non-controlling value constraint m[l] 0 ¬(a ∗ (ab)) = true Critical path c,f,g,h,i,m Delay 12 Input vector a=0, b=1, c=rising edge, d=0 5.3.5 Correct Algorithm to Find Critical Path In this section, we remove the assumption that values on side inputs always arrive earlier than the value on the path input. We now deal with late arriving side inputs, or simply “late side inputs”. The presentation of late side inputs is as follows: Section 5.3.5.1 rules for how late side inputs can allow path inputs to exercise gates Section 5.3.5.2 idea of monotone speedup, which underlies some of the rules Section 5.3.5.3 one of the potentially confusing situations in detail. Section 5.3.5.4 complete, correct, and complex algorithm. Section 5.3.5.5 examples 5.3.5.1 Rules for Late Side Inputs For each gate, there are eight sitations: the side input is controlling or non-controlling, the path input is controlling or non-controlling, and the side input arrives early or arrives late.
  • 385.
    5.3.5 Correct Algorithmto Find Critical Path 363 side=non-ctrl side=non-ctrl side=CTRL side=CTRL path=CTRL path=non-ctrl path=CTRL path=non-ctrl Early Side path input causes glitch path input propogates side input propogates neither input propogates Late Side monotone speedup monotone speedup path input propogates side input causes glitch side=non-ctrl side=non-ctrl side=CTRL side=CTRL path=non-ctrl path=CTRL path=CTRL path=non-ctrl Early Side path input propogates path input causes glitch side input propogates neither input propogates Late Side monotone speedup monotone speedup path input propogates side input causes glitch Late side inputs give us three more situations for each of AND and OR gates where the path input will/might excite the gate. In the two cases labeled monotone speedup, the path input does not excite the gate with the current timing, but if our timing estimates for the side input are too slow, or the timing of the side speeds up due to voltage or temperature variations, then the late input might become an early side input. The five situations where the path input excites the gate are: side is early side=non-ctrl, path=non-ctrl The path input is the later of the two inputs to transition to a non-controlling value, so it is the one that causes the output to transition. side=non-ctrl, path=ctrl The side input transitions to a non-controlling value while the path input is a non-controlling value; this causes the output to transition to a non-controlled value. The path input then transitions to a controllilng value, causing a glitch on the output as it transitions to a controlled value. side is late side=non-ctrl, path=non-ctrl If the side input arrives earlier than expected, then we will have an early arriving side input with a non-controlling value. side=non-ctrl, path=ctrl If the side input arrives earlier than expected, then we will have an early arriving side input with a non-controlling value. side=ctrl, path=ctrl The path input transitions to a controlling value before the side input; so, it is the input that causes the output to transition.
  • 386.
    364 CHAPTER 5. TIMING ANALYSIS The three situations where the path input does not excite the gate are: side is early side=ctrl, path=ctrl The side input transitions to a controlling value before the path input transitions to a controlling value. The edge on the path input does not propagate to the output. side=ctrl, path=non-ctrl It is always the case that at least one of the inputs is a controlling value, so the output of the gate is a constant controlled value. side is late side=ctrl, path=non-ctrl The path input transitions to a non-controlling value while the side input is still non-controlling. This causes the output to transition to a non-controlled value. The side input then transitions to a controlling value, which causes the glitch as the output transitions to a controlled value. The second edge of the glitch is caused by the side input, so the side input determines the timing of the gate. Combining together the five situations where the path input excites the gate gives us our complete and correct rule: a path input excites the gate if the side-input is non-controlling or the side-input arrives late and the path input is controlling. Section 5.3.5.2 discusses monotone speedup in more detail, then section 5.3.5.3 demonstrates that a late-arriving side input that causes a glitch cannot result in a true path. After these two tangents, we finally present the correct, complete, and complex algorithm for critical path analysis. 5.3.5.2 Monotone Speedup When we have a late side input with a non-controlling value, the path input does not excite the gate, but the rules state that we should consider this to be a true path. The reason that we report this as a true path, even though the path input does not excite the gate is due to the idea of monotone speedup. Definition monotonic: A function ( f ) is monotonic if increasing its input causes the output to increase or remain the same. Mathematically: x < y =⇒ f (x) ≤ f (y). Definition monotononous: A lecture is monotonous if increasing the length of the lecture increases the number of people who are asleep. Definition monotone speedup: The maximum clockspeed of a circuit should be monotonic with respect to the speed of any gate or sub-circuit. That is, if we increase the speed of part of the circuit, we should either increase the clockspeed of the circuit, or leave it unchanged.
  • 387.
    5.3.5 Correct Algorithmto Find Critical Path 365 Definition monotononous speedup: A lecture has monotonous speedup if increasing the pace of the lecture increases the number of people who are awake. In the monotone speedup situations, if we were to report the candidate path as false and the side input arrives sooner than expected, the path might generate an edge. Thus, a path that we initially thought was a false path becomes a real path. Speeding up a part of the circuit turned a false path into a real path, and thereby actually reduced the maximum clock speed of the circuit. Monotone speedup is desirable, because if we claim that a circuit has a certain minimum delay and then speed up some of the gates in the circuit (because of resizing gates, process variations, temperature or voltage fluctuations), we would be quite distraught to discover that we have in fact increased the minimum delay. We can see the rationale behind the monotone speedup rules by observing that if we have a late side input that transitions to a non-controlling value, and the circuitry that drives the late side input speeds up, the late side input might become an early side input. For each of the two monotone speedup situations, the corresponding early side input situation has a true path. 5.3.5.3 Analysis of Side-Input-Causes-Glitch Situation In the following paragraphs we analyze the rule for a late side input where the side input is controlling and the path input is non-controlling. The excitation rules say that in this situation the path input cannot excite gate. We might be tempted to think that we could construct a circuit where the first edge of the glitch (which is caused by the path input) propagates and the second edge (which is caused by the late side input) does not propagate. Here we demonstrate why we cannot create such a circuit. Readers who are willing to accept that the Earth is round without personally circumnavigating the globe may wish to skip to section 5.3.5.4. In the picture below, c is the gate that produces a glitching output because of a late-arriving side input. We know that a, c is part of a false path and will demonstrate that in the current situation, b, c must also be part of a false path. a c b For a, c to be a part of a false path, there must be a gate that appears later in the circuit that prevents the second edge of the glitch from propagating. In the figure below, this later gate is f, with e being the path input (from c) and d being the side input. d f d f d f e e e very early side ctrl middling early side ctrl late side ctrl d f d f d f e e e very early side non-ctrl middling early side non-ctrl late side non-ctrl
  • 388.
    366 CHAPTER 5. TIMING ANALYSIS For the first edge on e to propagate, the side input (d) must have a non-controlling value at the time of the first edge. To prevent the second edge of the glitch from propagating from e to f, d must be a controlling value. That is, d must transition from a non-controlling value to a controlling value in the middle of the glitch on e. This corresponds to the “middling early side ctrl” situation in the figure. From the perspective of the first edge of the glitch, this is identical to the situation with the first gate (c), in that a late-arriving side input transitions to a controlling value. In this case of “middling early side ctrl”, the edge on d arrives later than the first edge on e, which means that d, f is a slower path than b, c, ..., e, f , which means that d, f is part of a false path. Thus, there is a gate later in the circuit that prevents the second edge of the glitch on f from propagating. We wrap up the argument that the situation illustrated with a, b, c cannot lead to a critical path through b, c in two ways: intuitively and mathematically. Intuitively, for b, c to be part of a critical path, c must be followed by f, which itself must be followed by another gate with a middling-early side input. All of the other cases that prevent the second edge of the glitch from propagating will prevent both edges of the glitch from propagating. This other gate with the middling-early side input produces a glitch and so must itself be followed by yet another gate with a middling side input. This process continues ad infinitum — we cannot construct a finite circuit that allows the first edge of the glitch on c to propagate and prevents a second edge of the glitch from propagating. Mathematically, we construct a simple inductive proof based on the number of later gates in the candidate path. In the base case, f is the last gate in the path, and so it must be the gate that propagates the first edge of the glitch and does not generate a glitch. There is no situation in which this happens, thus the last gate in the path cannot have a middling-early input. In the inductive case we assume that there are n gates later in the path and none of them have middling-early side inputs. We can then prove that the gate just prior to the nth gate cannot have a middling-early side input, because for it to have a middling-early side input, one of the n later gates would need to have a middling-early side input that would allow the first edge of the glitch to propagate and prevent the second edge of the glitch from propagating. From the inductive hypothesis, we know that none of the n gates have a middling-early input, and so we have completed the proof by contradiction. 5.3.5.4 Complete Algorithm The possibility of late-arriving side inputs caused us to modify our rules for when a path input will excite a gate. The complete rule (section 5.3.5.1) is: the side-input is non-controlling or the side-input arrives late and the path input is controlling. Because we explore candidate critical paths beginning with the slowest and working through faster and faster paths, a late-arriving side input must be part of a previously discovered false path. In the previous sections, when we did not have late-arriving side inputs, we could exercise the critical path a change on just one input signal. With late-arriving side inputs, both the primary-input to the critical path and the late-arriving side inputs might need to change.
  • 389.
    5.3.5 Correct Algorithmto Find Critical Path 367 When using the late-arriving side input portion of our excitation rule, we must ensure that the side input does in fact arrive later than the path input. If we do not, we would fall into the situation where both inputs are controlling and the side input arrives early. In this situation, the side input excites the gate. For the side input to arrive late, the late path to the side input must be viable. Stated more precisely, the prefix of the previously discovered false path that ends at the side input must be viable. The entire previously discovered false path is clearly not viable, it is only the prefix up to the side input that must be viable. The viability condition for the prefix uses the same rule as we use for normal path analysis: for every gate along the prefix the side-input is non-controlling or the prefix’s side input arrives late and the prefix’s path input is controlling. The complete, correct, and complex algorithm is: • If find a contradiction on the path, check for side inputs that are on previously discovered false paths. • If a gate and its side input are on a previously discovered false path, then the side input defines a prefix of a false path that is a late-arriving side input. • For each late-arriving prefix, compute its viability (the conditions under which an edge will propagate along the prefix to the late side input). • To the row of the late arriving side input in the constraint table, add as a disjunction the constraint that: the path input has a controlling value and at least one of the prefixes is viable. 5.3.5.5 Complete Examples Complete Example 1 ................................................................. . Question: Find the critical path in the circuit below. b d e g a f c Answer: b 4 14 14 12 d 10 e 8 8 f4 g 0 a 4 8 10 c 8
  • 390.
    368 CHAPTER 5. TIMING ANALYSIS potential unused delay fanout path 14 g, b, c a false a,b,d,e,f,g side input non-controlling value constraint f[c] 1 a g[a] 1 a First false path, pursue next candidate. potential unused delay fanout path false a,b,d,e,f,g 10 g, c a 10 a,c,f,g side input non-controlling value constraint f[e] 1 a g[a] 1 a At first, this path appears to be false, but the side input f[e] is on the prefix of the false path a,b,d,e,f,g. Thus, f[e] is a late arriving side input. The candidate path will be a true path if the side input arrives late and the path input is a controlling value. The viability condition for the path a,b,d,e is true. The constraint for the path input (c) to have a controlling value for f is a. Together, the viability constraint of true and the controlling value constraint of a give us a late-side constraint of a. Updating the constraint table with the late arriving side input constraint gives us: side input non-controlling value constraint f[e] 1 a + a = true g[a] 1 a The constraint reduces to a. A rising edge will exercise the path. Critical path a, c, f, g Delay 10 Input vector a=rising edge Illustration of rising edge exercising the critical path: 0 0 0 2 4 6 10 b d e g a f6 2 0 c
  • 391.
    5.3.5 Correct Algorithmto Find Critical Path 369 Complete Example 2 ................................................................. . Question: Find the critical path in the circuit below. a d b i c Answer: Find longest path: 8 8 f4 a 8 4 j 0 8 8 i 4 4 j d 16 e 14 g 12 12 8 18 14 h8 b 12 12 i c Explore longest path: potential unused delay fanout path 8 f a 12 h c 18 f, g b,d,e 18 h, i b,d,e,g false b,d,e,g,h,i,j side input non-controlling value constraint h[c] 0 c i[g] 0 b j[f] 0 ab Contradiction. 0 f0 a 0 0 j 0 i j d e 0 1 g 0 b h i c First false path, find next candidate.
  • 392.
    370 CHAPTER 5. TIMING ANALYSIS Changes in potential delays: Signal / path old new g on b, d, e, g 12 8 b, d, e, g 18 14 g[e] on b, d, e 14 10 e on b, d, e 14 10 b, d, e 18 14 potential unused delay fanout path false b,d,e,g,h,i,j 8 f a 12 h c 14 f, g b,d,e 14 b,d,e,g,i,j 8 8 f4 a 8 4 j 0 8 8 i 4 4 j d 12 e 10 g 8 12 8 14 10 h8 b 12 12 i c side input non-controlling value constraint h[c] 0 c i[h] 0 cb j[f] 0 ab Initially, found contradiction, but b, d, e, g, h is a prefix of a false path, and i[h] is a side input to the candidate path. We have a late side input. Note that at the time that we passed through i, we could not yet determine that we would need to use i[h] as a late side input. The lesson is that when a contradiction is discovered, we must look back along the entire candidate path covered so far to see if we have any late side inputs. Our late-arriving constraint for i[h] is: • late side path ( b, d, e, g, h ) is viable: c. • path input (i[g]) has a controlling value of ’1’: b. Combining these constraints together gives us bc. Adding the constraint of the late side input to to the condition table gives us: side input non-controlling value constraint h[c] 0 c i[h] 0 bc + bc = c j[f] 0 ab
  • 393.
    5.3.5 Correct Algorithmto Find Critical Path 371 The constraints reduce to abc. Critical path b, d, e, g, i, j Delay 14 Input vector a=0, b=falling edge, c=0 Illustration of falling edge exercising the critical path: a 0 f 8 8 4 14 6 i j 10 10 j 0 d 2 e 4 4 g 6 h 10 b 0 i c Complete Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . This example illustrates the benefits of the principle of monotone speedup when analyzing critical paths. b d f a e c Critical-path analysis says that the critical path is a, c, e, f , with a late side input of e[d] and a total delay of 10. The required excitation is a rising edge on a. However, with the given delays, this excitation does not produce an edge on the output. 0 0 0 2 4 b d f a e 0 2 c For a more complete analysis of the behaviour, we also try a falling edge. The falling edge path exercises the path a, f with a delay of 4. 0 0 0 2 4 4 b d 6 f a e 0 2 c Monotone speedup says that if we reduce the delay of any gate, we must not increase the delay of the overall circuit. We reduce the delays of b and d from 2 to 0.5 and produce an edge at time 10 via the path a, c, e, f .
  • 394.
    372 CHAPTER 5. TIMING ANALYSIS 0 0 0 0.5 1 10 b d f a e 6 0 2 c The critical path analysis said that the critical path was a, c, e, f with a delay of 10. With the original circuit, the slowest path appeared to have a delay of 4. But, by reducing the delays of two gates, we were able to produce an edge with a delay of 10. Thus, the critical path algorithm did indeed satisfy the principle of monotone speedup. Complete Example 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . This example illustrates that we sometimes need to allow edges on the inputs to late side paths. Question: Find the critical path in the circuit below. c d a e k i j b f g h Answer: The purpose of this example is to illustrate a situation where we need the primary input of a late-side path to toggle. To focus on the behaviour of the circuit, we show pictures of different situations and do not include the path and constraint tables. Longest path in the circuit, showing a contradiction between e[b] and j[h]. c d a e k i j 1 0 b 0 1 f 0 g 1 h Second longest path b, f, g, h, i, j, k , using only early side inputs, showing a contradiction between k[e] and i[e]. c d 1 a e 1 0 0 k i j b f g h
  • 395.
    5.3.5 Correct Algorithmto Find Critical Path 373 Second longest path using late side input i[e], which has a controlling value of 1 (rising edge) on i[h]. However, we neglect to put a rising edge on a. The late-side path is not exercised and our candidate path is also not exercised. c 0 d 1 a1 0 e 1 1 i j 1 k 0 6 1 0 b 0 f 2 g 4 h 6 We now put a rising edge on a, which causes our late side input (i[e]) to be a non-controlling value when our path input (i[h]) arrives. 0 c 2 d 4 48 a 0 e4 8 k 16 14 0 i 810 j b 6 10 12 0 f 2 g 4 h 6 In looking at the behaviour of i, we might be concerned about the precise timing of of the glitch on e and the rising ege on h. The figure below shows normal, slow, and fast timing of e. With slow timing, the first edge of glitch on e arrives after the rising edge on h. The timing of the second edge of the glitch remains unchanged. The value of i remains constant, which could lead us to believe (incorrectly!) that our critical path analysis needs to take into account the first edge of the glitch. However, this is in fact an illustration of monotone speedup. The fast timing scenario move the glitch earlier, such that the edge on h does in fact determine the timing of the circuit, in that h produces the last edge on i. In summary, with the glitch on e and the rising edge on h, either h causes the last edge on i or there is no edge on i. 4 8 8 10 Normal timing e 6 i h 4 8 8 10 Slow timing on e e 6 i h 4 8 8 10 Fast timing on e e 6 i h Complete Example 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . This example demonsrates that a late side path must be viable to be helpful in making a true path. Question: Find the critical path in the circuit below.
  • 396.
    374 CHAPTER 5. TIMING ANALYSIS a e g b i k c h j f d Answer: Find that the two longest paths are false paths, because of contradiction between g[d] and i[c]. a e g b 0 i 1 k c h j f d Try third longest path d, f, h, j, k using early side inputs. Find contradiction between k[i] and j[c]. a e g b i 0 1 0 1 k c 0 0 h j f d Try using late side paths a, e, g, i, k or b, e, g, i, k . Find that neither path is viable by itself, because of contradiction between g[d] and i[c]. Also, neither path is viable in conjunction with the candidate path, because of contradiction between i[c] on late side path and j[c] on candidate path. Either one of these contradictions by itself is sufficient to prevent the late side path from helping to make the candidate path a true path. a e g b i 0 1 k c 0 0 h j f d 5.3.6 Further Extensions to Critical Path Analysis McGeer and Brayton’s paper includes two extensions to the critical path algorithm presented here that we will not cover.
  • 397.
    5.3.7 Increasing theAccuracy of Critical Path Analysis 375 • gates with more than two inputs • finding all input values that will exercise the critical path • multiple paths with the same delay to the same gate 5.3.7 Increasing the Accuracy of Critical Path Analysis When doing critical path calculations, it is often useful to strike a balance between accuracy and effort. In the examples so far, we assumed that all signals had the same wire and load delays. This assumption simplifies calculations, but reduces accuracy. Section 5.4 discusses how the analog world affects timing analysis. 5.4 Elmore Timing Model There are many different models used to describe the timing of circuits. In the section on critical paths, we used a timing model that was based on the size of the gate. The timing model ignored interconnect delays and treated all gates as if they had the same fanout. For example, the delay through an AND gate was 4, independent of how many gates were in its immediate fanout. In this section and the next (section 5.4) we discuss two timing models. In this section, we discuss the detailed analog timing model, which reflects quite accurately the actual voltages on different nodes. The SPICE simulation program uses very detailed analog models of transistors (dozens of parameters to describe a single transistor). In the next section, we describe the Elmore delay model, which achieves greater simplicity than the analog model, but at a loss of accuracy. 5.4.1 RC-Networks for Timing Analysis Cross-Section of Transistor Level Mask Level (P-Tran) Fabricated Transistor Switch Level (P-Tran) (P-Tran) source poly contact poly source source contact gate p-diff gate gate p-diff drain drain drain substrate Cross-Section of Transistor Level Mask Level (N-Tran) Fabricated Transistor Switch Level (N-Tran) source poly contact (N-Tran) poly source source contact gate p-diff gate n-diff gate drain drain drain substrate
  • 398.
    376 CHAPTER 5. TIMING ANALYSIS Different Levels of Abstraction for Inverter .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . ... Mask Level contact metal Transistor Level VDD VDD poly p-diff Gate Level a b a b a b n-diff GND GND metal From the electrical characteristics of fabricated transistors, VLSI and device engineers derive models of how transistors behave based on mask-level descriptions. For our purposes, we will use the very simple resistor-capacitor model shown below. Each of the P- and N-transistor models contains a resistor (“pullup” for the P-transistor and “pulldown” for the N-transistor) and a parasitic capacitor. When we combine a P-transistor and an N-transistor to create an invertor, we combine the capacitors into a single parasitic capacitor that is the sum of the two individual capacitors. RC-Network for Timing Analysis VDD RC-Network models of P- and N-transistors source drain Rpu Rpu Cp gate gate a b CL Cp Rpd Cp drain source Rpd GND • Contacts (vias) have resistance (RV ) • Metal areas (wires) have resistance (RW ) and capacitance (CW ). – The resistance is dependent upon the geometry of the wire. – The capacitance is dependent upon the geometry of the wire and the other wires adjacent to it. • For most circuits, the via resistance is much greater than the wire resistance (RV ≫ RW ) To reduce area, modern wires tend to have tall and narrow cross sections. When wires are packed close together (e.g. a signal that is an array or vector), the wires act like capacitors.
  • 399.
    5.4.1 RC-Networks forTiming Analysis 377 A Pair of Inverters ................................................................... . Transistor Level VDD Gate Level b b a c a c GND Mask Level b a c A Pair of Inverters (Cont’d) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Mask Level VDD b c a GND RC-Network for Timing Analysis VDD Rpu Rpu b RW RV a c CL Cp CW CL Cp Rpd Rpd GND To analyze the delay from one inverter to the next, we analyze how long it takes the capacitive load of the second (destination) inverter to charge up from ground to VDD, or to discharge from VDD to ground. In doing this analysis, the gate side of the driving inverter is irrelevant and can be removed (trimmed). Similarly, the pullup resistor, pulldown resistor, and parasitic capacitance of the destination inverter can also be removed. RC-Network for Timing Analysis (trimmed)
  • 400.
    378 CHAPTER 5. TIMING ANALYSIS VDD Rpu b RW RV Cp CW CL Rpd GND A Circuit with Fanout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . We will look at one more example of inverters and their RC-network before beginning the timing analysis of these networks. Gate Level c Gate Level (physical layout) b b c a a d c d Transistor Level VDD b a c b d c GND Mask Level VDD b c a b d c GND
  • 401.
    5.4.1 RC-Networks forTiming Analysis 379 RC-Network for Timing Analysis VDD Rpu Rpu Rpu b b RW1 RV c RW2 RV a d Cp CW1 Cp CW2 Cp CL CL CL Rpd Rpd Rpd RW3 c CW3 GND RC-Network for Timing Analysis (trimmed) VDD Rpu b b RW1 RV RW2 RV Cp CW1 CW2 CL CL Rpd GND We will use this circuit as our primary example for the analog and Elmore timing models, so we draw a simplified version of the trimmed RC-network before proceeding.
  • 402.
    380 CHAPTER 5. TIMING ANALYSIS RC-Network for Timing Analysis (cleaned up) b RW2 RV VDD Rpu CW2 CL b RW1 RV Cp CW1 CL Rpd GND 5.4.2 Derivation of Analog Timing Model The primary purpose of our timing model is provide a mechanism to calculate the approximate delay of a circuit. For example, to say that a gate has a delay of 100 ps. The actual gate behaviour is a complicated function of the input signal behaviour. The waveforms below are all possible behaviours of the same circuit. From these various waveforms, it would be very difficult to claim that the circuit has a specific delay value. Slow input Fast input input input voltage voltage time time output input voltage voltage time time Steps Toward Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . We begin with two simplifications as steps toward calculating a single delay value for a circuit. 1. Look at the circuit’s response to a step-function input. 2. Measure the delay to go from GND to 65% of VDD and from VDD to 35% of VDD. These values of 65% VDD and 35% VDD are “trip points”.
  • 403.
    5.4.2 Derivation ofAnalog Timing Model 381 Definition Trip Points: A high or ’1’ trip point is the voltage level where an upwards transition means the signal represents a ’1’. A low or ’0’ trip point is the voltage level where a downwards transition means the signal represents a ’0’. In the figure below the gray line represents the actual voltage on a signal. The black line is digital discretization of the analog signal. a b Node Numbering, Initial Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . To motivate our derivation of the analog timing model, we will use the inverter that fans out to two other inverters as our example circuit. • The source (VDD in our case) and each capacitor is a node. We number the nodes, capacitors, and resistors. Resistors are numbered according to the capacitor to their right. Multiple resistors in series without an intervening capacitor are lumped into a single resistor. • All nodes except the source start at GND. • We calculate the voltage at a node when we turn on the P-transistor (connect to VDD). The process for analyzing a transition from VDD to GND on a node is the dual of the process just described. The source node is GND, all other nodes start at VDD, we calculate the voltage when we turn on the N-transistor (connect it to GND). b RW2 3 RV R3 R4 VDD 0 4 Rpu R1 CW2 CL 1 b RW12 R2 RV R5 5 Cp CW1 CL Rpd GND Define: Path and Downstream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . We still have a few more preliminaries to get through. To discuss the structure of a network, we introduce two terms: path and downstream
  • 404.
    382 CHAPTER 5. TIMING ANALYSIS Definition path: The path from the source node to a node i is the set of all resistors between the source and i. Example: path(3) = {R1 , R2 , R3 } Definition down: The set of capactitors downstream from a node is the set of all capacitors where current would flow through the node to charge the capacitor. You can think of this as the set of capacitors that are between the node and ground. Example: down(2) = {C2 ,C3 ,C4 ,C5 }. Example: down(3) = {C3 ,C4 } 5.4.2.1 Example Derivation: Equation for Voltage at Node 3 As a concrete example of deriving the analog timing model, we derive the equation for the voltage at Node 3 in our example circuit. After this concrete example, we do the general derivation. V3 (t) = V0 (t) − voltage drop fromNode0toNode3 The voltage drop is the sum of the voltage drops across the resistors on the path from Node0 to Node3 = V0 (t) − ∑ Rr ×Ir (t) r∈path(3) = V0 (t) − (R1I1 (t) + R2I2 (t) + R3I3 (t)) The current through a resistor is the sum of the currents through all of the downstream capacitors Ir (t) = ∑ Ic c∈down(r) I1 (t) = Ic1 + Ic2 + Ic3 + Ic4 + Ic5 I2 (t) = Ic2 + Ic3 + Ic4 + Ic5 I3 (t) = Ic3 + Ic4 Substitute Ir into the equation for V3   R1 (Ic1 + Ic2 + Ic3 + Ic4 + Ic5 ) V3 (t) = V0 (t) −  + R2 (Ic2 + Ic3 + Ic4 + Ic5 )  + R3 (Ic3 + Ic4 ) Use associativity to group terms by currents.   Ic1 (R1 )  + Ic2 (R1 + R2 )     + Ic3 (R1 + R2 + R3 )  V3 (t) = V0 (t) −    + Ic4 (R1 + R2 + R3 )  + Ic5 (R1 + R2 )
  • 405.
    5.4.2 Derivation ofAnalog Timing Model 383 Current through a capacitor ∂Vc (t) Ic (t) = Cc ∂t Substitute Ic into equation for V3   ∂Vc1 (t) (R1 )Cc1  ∂t  ∂Vc2 (t)      + (R1 + R2 )Cc2   ∂t   + (R + R + R )C ∂Vc3 (t)   V3 (t) = V0 (t) −  1 2 3 c3  ∂t  ∂Vc4 (t)      + (R1 + R2 + R3 )Cc4   ∂t   ∂Vc5 (t)  + (R1 + R2 )Cc5 ∂t In each of the resistance-capacitance terms (e.g., (R1 + R2 )Cc2 ), the resistors are the set of resistors on the path to the capacitor that are also on the path to Node3 . We capture this observation by defining the Elmore resis- tance Ri,k for a pair of nodes i and k to be the resistors on the path to Nodei that are also on the path to Nodek . Ri,k = ∑ Rr r∈(path(k)∩path(k)) R3,1 = R1 R3,2 = R1 + R2 R3,3 = R1 + R2 + R3 R3,4 = R1 + R2 + R3 R3,5 = R1 + R2 Substitute Ri,k into V3 ∂Vc1 (t) ∂Vc2 (t) ∂Vc3 (t)   R3,1Cc1 + R3,2Cc2 + R3,3Cc3  V3 (t) = V0 (t) −  ∂t ∂t ∂t  ∂Vc4 (t) ∂Vc5 (t)  + R3,4Cc4 + R3,5Cc5 ∂t ∂t We are left with a system of dependent equations, in that V3 is dependent upon all of the voltages in the circuit. In the general derivation that follows next, we repeat the steps we just did, and then show how the Elmore delay is an approximation of this system of dependent differential equations. 5.4.2.2 General Derivation We derive the equation for the voltage at Nodei as a function of the voltage at Node0 .
  • 406.
    384 CHAPTER 5. TIMING ANALYSIS Vi (t) = V0 (t) − voltage drop fromNode0toNodei The voltage drop is the sum of the voltage drops across the resistors on the path from Node0 to Nodei = V0 (t) − ∑ Rr ×Ir (t) r∈path(i) The current through a resistor is the sum of the currents through all of the downstream capacitors Ir (t) = ∑ Ic c∈down(r) Substitute Ir into equation for Vi the Vi (t) = V0 (t) − ∑ Rr × ∑ Ic  r∈path(i) c∈down(r) Use associativity to push Rr into the summation over c Vi (t) = V0 (t) − ∑ ∑ Rr ×Ic r∈path(i) c∈down(r) Current through a capacitor ∂Vc (t) Ic (t) = Cc ∂t Substitute Ic into equation for Vi ∂Vc (t) Vi (t) = V0 (t) − ∑ ∑ Rr ×Cc ∂t r∈path(i) c∈down(r) A little bit of handwaving to prepare for Elmore resistance   ∂Vc (t) Vi (t) = V0 (t) − ∑  ∑ Rr  ×Ck ∂t k∈Nodes r∈path(i)∩path(k) Define Elmore resistance Ri,k Ri,k = ∑ Rr r∈(path(k)∩path(k)) Substitute Ri,k into Vi ∂Vc (t) Vi (t) = V0 (t) − ∑ Ri,k ×Ck ∂t k∈Nodes
  • 407.
    5.4.3 Elmore TimingModel 385 The final equation above is an exact description of the behaviour of the RC-network model of a circuit. More accurate models would result in more complicated equations, but even this equation is more complicated than we want for calculating a simple number for the delay through a circuit. The equation is actually a system of dependent equations, in that each voltage Vi is dependent upon all of the voltages Vc in the circuit. Spice and other analog simulators use numerical methods to calculate the behaviour of these systems. Elmore’s contribution was to find a simple approximation of the behaviour of such systems. 5.4.3 Elmore Timing Model • Assume that V0 (t) is a step function from 0 to 1 at time 0. • Derive upper and lower bounds for Vi (t). • Find RC time constants for upper and lower bounds. • Elmore delay is guaranteed to be between upper and lower bounds. Upper and lower bounds Elmore model RC-network model TD-TRi TRi TP TP-TRi TD Equations for Curves .................................................................. Time : 0 TDi − TRi TP − TRi ∞ TDi − TP − t t − TDi TRi TRi Upper 1+ 1− e TP TP Elmore 1 − e−t/TDi TP − TRi − t TDi TD TP Lower 0 1− 1− ie t + TRi TP Fact: 0 ≤ TRi ≤ TDi ≤ TP
  • 408.
    386 CHAPTER 5. TIMING ANALYSIS Definitions of Time Constants .......................................................... R2 Ck ∑ k,i TRi = Mathematical artifact, no intuitive meaning Ri,i k∈Nodes TDi = ∑ Rk,iCk Elmore delay k∈Nodes TP = ∑ Rk,kCk RC-time constant for lumped network k∈Nodes Picking the Trip Point ................................................................ . Vi (t) = VDD(1 − e−t/TDi ) Pick trip point of Vi (t) = 0.65VDD, then solve for t 0.65VDD = VDD(1 − e−t/TDi ) 0.35 = e−t/TDi Take ln of both sides ln 0.35 = ln(e−t/TDi ) ln 0.35 = −1.05 ≈ −1.0 −1.0 = −t/TDi t = TDi By picking a trip point of 0.65VDD, the time for Vi to reach the trip is the Elmore delay.
  • 409.
    5.4.4 Examples ofUsing Elmore Delay 387 5.4.4 Examples of Using Elmore Delay 5.4.4.1 Interconnect with Single Fanout G1 G2 C3 Rw3 Ra3 Ra4 G2 C2 Rw2 C1 Rw1 Ra2 Ra1 G1 G1 G2 Rpu Ra1 Rw1 Ra2 Rw2 Ra3 Rw3 Ra4 Vi Cp C1 C2 C3 CG2 Rpd G* gate C* capacitance on wire Ra* resistance through antifuse Rw* resistance through wire Question: Calculate delay from gate 1 to gate 2 Answer: Gate 2 represents node 4 on the RC tree.
  • 410.
    388 CHAPTER 5. TIMING ANALYSIS 4 τD4 = ∑ ERk,iCk k=1 = ER C + ER C2 + ER C3 + ER C4 1,4 1 2,4 3,4 4,4 = (Ra1 + Rw1 + Ra2 + Rw2 + Ra3 + Rw3 + Ra4 )CG2 +(Ra1 + Rw1 + Ra2 + Rw2 + Ra3 + Rw3 )C3 +(Ra1 + Rw1 + Ra2 + Rw2 )C2 +(Ra1 + Rw1 )C1 approximate Ra ≫ Rw = (Ra1 )C1 + (Ra1 + Ra2 )C2 + (Ra1 + Ra2 + Ra3 )C3 + (Ra1 + Ra2 + Ra3 + Ra4 )CG2 approximate Rai = Ra j = 4(Ra)CG2 + 3(Ra)C3 + 2(Ra)C2 + (Ra)C1 Question: If you double the number of antifuses and wires needed to connect two gates, what will be the approximate effect on the wire delay between the gates? Answer: n τDi = ∑ ERk,iCk k=1 Assume all resistances and capacitances are the same values (R and C), and assume that all intermediate nodes are along path between the two gates of inter- est. ER = k×R k,i n τDi = ( ∑ k)RC k=1 Using the mathematical theorem:
  • 411.
    5.4.4 Examples ofUsing Elmore Delay 389 n (n + 1)n ∑i = 2 i=1 ≈ n 2 We simplify delay equation: n τDi = ( ∑ k)RC k=1 = n2 RC We see that the delay is propotional to the square of the number of antifuses along the path. 5.4.4.2 Interconnect with Multiple Gates in Fanout G2 G1 G2 G3 G3 G1 Question: Assuming that wire resistance is much less than antifuse resistance and that all antifuses have equal resistance, calculate the delay from the source inverter (G1) to G2 Answer: 1. There are a total of 7 nodes in the circuit (n = 7). 2. Label interconnect with resistance and capacitance identifiers. R3 C3 R4 C4 R5 C5 R6 G2 C1 G3 C7 R1 G1 R2 C2 C6
  • 412.
    390 CHAPTER 5. TIMING ANALYSIS 3. Draw RC tree G1 G2 Rpu R1 n1 R2 n2 n3 R3 n4 R4 n5 Vi Cp C1 C2 C3 C4 C5 Rpd G3 R5 n6 R6 n7 C6 C7 4. G2 is node 5 in the circuit (i = 5). 5. Elmore delay equations 7 τD5 = ∑ ERk,5Ck k=1 = ER C + ER C2 + ER C3 + ER C4 1,5 1 2,5 3,5 4,5 +ER C + ER C6 + ER C7 5,5 5 6,5 7,5 6. Elmore resistances ER = R1 = R 1,5 ER = R1 + R2 = 2R 2,5 ER = R1 + R2 = 2R 3,5 ER = R1 + R2 + R3 = 3R 4,5 ER = R1 + R2 + R3 + R4 = 4R 5,5 ER = R1 + R2 = 2R 6,5 ER = R1 + R2 = 2R 7,5 7. Plug resistances into delay equations τD5 = (R)C1 + (2R)C2 + (2R)C3 + (3R)C4 + (4R)C5 + (2R)C6 + (2R)C7
  • 413.
    5.4.4 Examples ofUsing Elmore Delay 391 Delay from G1 to G3 ................................................................. . Question: Assuming that wire resistance is much less than antifuse resistance and that all antifuses have equal resistance, calculate the delay from the source inverter (G1) to G3 Answer: 1. G3 is node 7 in the circuit (i = 7). 2. Elmore delay equations n τDi = ∑ ERk,iCk k=1 7 τD7 = ∑ ERk,7Ck k=1 = ER C + ER C2 + ER C3 + ER C4 1,7 1 2,7 3,7 4,7 +ER C + ER C6 + ER C7 5,7 5 6,7 7,7 3. Elmore resistances ER = R1 = R 1,7 ER = R1 + R2 = 2R 2,7 ER = R1 + R2 = 2R 3,7 ER = R1 + R2 = 2R 4,7 ER = R1 + R2 = 2R 5,7 ER = R1 + R2 + R5 = 3R 6,7 ER = R1 + R2 + R5 + R6 = 4R 7,7 4. Plug resistances into delay equations τD7 = (R)C1 + (2R)C2 + (2R)C3 + (2R)C4 + (2R)C5 = + (3R)C6 + (4R)C7
  • 414.
    392 CHAPTER 5. TIMING ANALYSIS Delay to G2 vs G3 ..................................................................... Question: Assuming all wire segments at same level have roughly the same capacitance, which is greater, the delay to G2 or the delay to G3? Answer: 1. Equations for delay to G2 (τD5 ) and G3 (τD7 ) τD5 = (R)C1 + (2R)C2 + (2R)C3 + (3R)C4 + (4R)C5 + (2R)C6 + (2R)C7 τD7 = (R)C1 + (2R)C2 + (2R)C3 + (2R)C4 + (2R)C5 + (3R)C6 + (4R)C7 2. Difference in delays τD5 − τD7 = RC4 + 2RC5 − RC6 − 2RC7 3. Compare capacitances C4 ≈ C6 C5 ≈ C7 4. Conclusion: delays are approximately equal. 5.5 Practical Usage of Timing Analysis Speed Grading • Fabs sort chips according to their speed (sorting is known as speed grading or speed binning) • Faster chips are more expensive • In FPGAs, sorting is based usualy on propagation delay through an FPGA cell. As wires become a larger portiono of delay, some analysis of wire delays is also being done. • Propagation delay is the average of the rising and falling propagation delays. • Typical speed grades for FPGAs: Std standard speed grade 1 15% faster than Std 2 25% faster than Std
  • 415.
    5.5.1 Speed Binning 393 3 35% faster than Std Worst-Case Timing • Maximum Delay in CMOS. When? – Minimum voltage – Maximum temperature – Slow-slow conditions (process variation/corner which result in slow p-channel and slow n-channel). We could also have fast-fast, slow-fast, and fast-slow process corners • Increasing temperature increases delay – ⇑ Temp =⇒ ⇑ resistivity – ⇑ resistivity =⇒ ⇑ electron vibration – ⇑ electron vibration =⇒ ⇑ colliding with current electrons – ⇑ colliding with current electrons =⇒ ⇑ delay • Increasing supply voltage decreases delay – ⇑ supply voltage =⇒ ⇑ current – ⇑ current =⇒ ⇓ load capacitor charge time – ⇓ load capacitor charge time =⇒ ⇓ total delay • Derating factor is a number used to adjust timing number to account for voltage and temp conditions • ASIC manufacturers classes, based on variety of environments: VDD TA (ambient temp) TC (case temp) Commercial 5V ± 5% 0 to +70C Industrial 5V ± 10% –40 to +85C Military 5V ± 10% –55 to +125C • What is important is the transistor temperature inside the chip, TJ (junction temperature) 5.5.1 Speed Binning Speed binning is the process of testing each manufactured part to determine the maximum clock speed at which it will run reliably. Manufacturers sell chips off of the same manufacturing line at different prices based on how fast they will run. A “speed bin” is the clock speed that chips will be labeled with when sold. Overclocking: running a chip at a clock speed faster than what it is rated for (and hoping that your software crashes more frequently than your over-stressed hardware will).
  • 416.
    394 CHAPTER 5. TIMING ANALYSIS 5.5.1.1 FPGAs, Interconnect, and Synthesis On FPGAs 40-60% of clock cycle is consumed by interconnect. When synthesizing, increasing effort (number of iterations) of place and route can significantly reduce the clock period on large designs. 5.5.2 Worst Case Timing 5.5.2.1 Fanout delay In Smith’s book, Table 5.2 (Fanout delay) combines two separate parameters: • capacitive load delay • interconnect delay into a single parameter (fanout). This is common, and fine. But, when reading a table such as this, you need to know whether fanout delay is combining both capacitive load delay and interconnect delay, or is just capacitive load. 5.5.2.2 Derating Factors Delays are dependent upon supply voltage and temperature. ⇑ Temp =⇒ ⇑ Delay ⇑ Supply voltage =⇒ ⇓ Delay Temperature .......................................................................... • ⇑ Temp =⇒ ⇑ Delay – ⇑ Temp =⇒ ⇑ Resistivity of wires – As temp goes up, atoms vibrate more, and so have greater probability of colliding with electrons flowing with current. • ⇑ Supply voltage =⇒ ⇓ Delay – ⇑ Supply voltage =⇒ ⇑ current (V = IR) – ⇑ current =⇒ ⇓ time to charge load capacitors to threshold voltage
  • 417.
    5.5.2 Worst CaseTiming 395 Derating Factor Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A “derating factor” is a number to adjust timing numbers to account for different temperature and voltage conditions. Excerpt from table 5.3 in Smith’s book (Actel Act 3 derating factors): Derating factor Temp Vdd 1.17 125C 4.5V 1.00 70C 5.0V 0.63 -55C 5.5V
  • 418.
    396 CHAPTER 5. TIMING ANALYSIS 5.6 Timing Analysis Problems P5.1 Terminology For each of the terms: clock skew, clock period, setup time, hold time, and clock-to-q, answer which time periods (one or more of t1 – t9 or NONE) are examples of the term. NOTES: 1. The timing diagram shows the limits of the allowed times (either minimum or maximum). 2. All timing parameters are non-negative. 3. The signal “a” is the input to a rising-edge flop and “b” is the output. The clock is “clk1”. t7 t4 signal is stable t3 t6 signal may change t1 t2 t9 clk1 t8 clk2 t5 a b b t10 t11 clock skew clock period setup time hold time P5.2 Hold Time Violations P5.2.1 Cause What is the cause of a hold time violation?
  • 419.
    P5.3 Latch Analysis 397 P5.2.2 Behaviour What is the bad behaviour that results if a hold time violation occurs? P5.2.3 Rectification If a circuit has a hold time violation, how would you correct the problem with minimal effort? P5.3 Latch Analysis Does the circuit below behave like a latch? If not, explain why not. If so, calculate the clock-to-Q, setup, and hold times; and answer whether it is active-high or active-low. d Gate Delays q AND 4 en OR 2 NOT 1
  • 420.
    398 CHAPTER 5. TIMING ANALYSIS P5.4 Critical Path and False Path Find the critical path through the following circuit: b c a d g f e h i j l k m
  • 421.
    P5.5 Critical Path 399 P5.5 Critical Path a d f j b g gate delay e k NOT 2 c AND 4 h OR 4 l XOR 6 m i Assume all delay and timing factors other than combinational logic delay are negligible. P5.5.1 Longest Path List the signals in the longest path through this circuit. P5.5.2 Delay What is the combinational delay along the longest path? P5.5.3 Missing Factors What factors that affect the maximum clock speed does your analysis for parts 1 and 2 not take into account? P5.5.4 Critical Path or False Path? Is the longest path that you found a real critical path, or a false path? If it is a false path, find the real critical path. If it is a critical path, find a set of assignments to the primary inputs that exercises the critical path.
  • 422.
    400 CHAPTER 5. TIMING ANALYSIS P5.6 YACP: Yet Another Critical Path Find the critical path in the circuit below. d f a b g h e c
  • 423.
    P5.7 Timing Models 401 P5.7 Timing Models In your next job, you have been told to use a “fanout” timing model, which states that the delay through a gate increases linearly with the number of gates in the immediate fanout. You dimly recall that a long time ago you learned about a timing model named Elmo, Elmwood, Elmore, El-Morre, or something like that. For the circuit shown below as a schematic and as a layout, answer whether the fanout timing model closely matches the delay values predicted by the Elmore delay model. G2 G3 Symbol Description Capacitance Resistance G1 G4 Interconnect level 2 Cx 0 G5 Interconnect level 1 Cy 0 G1 Gate Cg 0 Antifuse 0 R G2 G3 G4 G5 Assumptions: • The capacitance of a node on a wire is independent of where the node is located on the wire.
  • 424.
    402 CHAPTER 5. TIMING ANALYSIS P5.8 Short Answer P5.8.1 Wires in FPGAs In an FPGA today, what percentage of the clock period is typically consumed by wire delay? P5.8.2 Age and Time If you were to compare a typical digital circuit from 5 years ago with a typical digital circuit today, would you find that the percentage of the total clock period consumed by capacative load has increased, stayed the same, or decreased? P5.8.3 Temperature and Delay As temperature increases, does the delay through a typical combinational circuit increase, stay the same, or decrease? P5.9 Worst Case Conditions and Derating Factor Assume that we have a ’Std’ speed grade Actel A1415 (an ACT 3 part) Logic Module that drives 4 other Logic Modules: P5.9.1 Worst-Case Commercial Estimate the delay under worst-case commercial conditions (assume that the junction temperature is the same as the ambient temperature) P5.9.2 Worst-Case Industrial Find the derating factor for worst-case industrial conditions and calculate the delay (assume that the junction temperature is the same as the ambient temperature). P5.9.3 Worst-Case Industrial, Non-Ambient Junction Temperature Estimate the delay under the worst-case industrial conditions (assuming that the junction temperature is 105C).
  • 425.
    Chapter 6 Power Analysisand Power-Aware Design 6.1 Overview 6.1.1 Importance of Power and Energy • Laptops, PDA, cell-phones, etc — obvious! • For microprocessors in personal computers, every watt above 40W adds $1 to manufacturing cost • Approx 25% of operating expense of server farm goes to energy bills • (Dis)Comfort of Unix labs in E2 • Sandia Labs had to build a special sub-station when they took delivery of Teraflops massively parallel supercomputer (over 9000 Pentium Pros) • High-speed microprocessors today can run so hot that they will damage themselves — Athlon reliability problems, Pentium 4 processor thermal throttling • In 2000, information technology consumed 8% of total power in US. • Future power viruses: cell phone viruses cause cell phone to run in full power mode and consume battery very quickly; PC viruses that cause CPU to meltdown batteries 6.1.2 Industrial Names and Products All of the articles and papers below are linked to from the Documentation page on the E&CE 327 web site. Overview white paper by Intel: PC Energy-Efficiency Trends and Technologies An 8-page overview of energy and power trends, written in 2002. Available from the web at an intolerably long URL. 403
  • 426.
    404 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN AMD’s Athlon PowerNow! Reduce power consumption in laptops when running on battery by allowing software to reduce clock speed and supply voltage when performance is less important than battery life. Intel Speedstep Reduce power consumption in laptops when running on battery by reducing clock speed to 70-80% of normal. Intel X-Scale An ARM5-compatible microprocessor for low-power systems: http://developer.intel.com/design/intelxscale/ Synopsys PowerMill A simulator that estimates power consumption of the circuit as it is simulated: http://www.synopsys.com/products/etg/powermill ds.html DEC / Compaq / HP Itsy A tiny but powerful PDA-style computer running linux and X-windows. Itsy was created in 1998 by DEC’s Western Research Laboratory to be an experimental platform in low-power, energy-efficient computing. Itsy lead to the iPAQ PocketPC. www.hpl.hp.com/techreports/Compaq-DEC/WRL-2000-6.html www.hpl.hp.com/research/papers/2003/handheld.html Satellites Satellites run on solar power and batteries. They travel great distances doing very little, then have a brief period very intense activity as they pass by an astronomical object of interest. Satellites need efficient means to gather and store energy while they are flying through space. Satellites need powerful, but energy efficient, computing and communication devices to gather, process, and transmit data. Designing computing devices for satellites is an active area of research and business. 6.1.3 Power vs Energy Most people talk about “power” reduction, but sometimes they mean “power” and sometimes “energy.” • Power minimization is usually about heat removal • Energy minimization is usually about battery life or energy costs Type Units Equivalent Types Equations Energy Joules Work = Volts × Coulombs = 2 ×C × Volts2 1 Power Watts Energy / Time = Volts × I = Joules/ sec
  • 427.
    6.1.4 Batteries, Powerand Energy 405 6.1.4 Batteries, Power and Energy 6.1.4.1 Do Batteries Store Energy or Power? Energy = Volts × Coulombs Energy Power = Time Batteries rated in Amp-hours at a voltage. battery = Amps × Seconds × Volts = Coulombs × Seconds × Volts Seconds = Coulombs × Volts = Energy Batteries store energy. 6.1.4.2 Battery Life and Efficiency To extend battery life, we want to increase the amount of work done and/or decrease energy consumed. Work and energy are same units, therefore to extend battery life, we truly want to improve efficiency. “Power efficiency” of microprocessors normally measured in MIPS/Watt. Is this a real measure of efficiency? MIPs = millions of instructions × Seconds Watts Seconds Energy = millions of instructions Energy Both instructions executed and energy are measures of work, so MIPs/Watt is a measure of efficiency. (This assumes that all instructions perform the same amount of work!)
  • 428.
    406 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN 6.1.4.3 Battery Life and Power Question: Running a VHDL simulation requires executing an average of 1 million instructions per simulation step. My computer runs at 700MHz, has a CPI of 1.0, and burns 70W of power. My battery is rated at 10V and 2.5AH. Assuming all of my computer’s clock cycles go towards running VHDL simulations, how many simulation steps can I run on one battery charge? Answer: Outline of approach: 1. Unify the units 2. Calculate amount of energy stored in battery 3. Calculate energy consumed by each simulation step 4. Calculate number of simulation steps that can be run Unify the units: Amp (current) Coulomb/sec Volt (potential difference, energy per charge) Joule/Coulomb Watt (power) Joule/sec Energy stored in battery: Ebatt = check equation by checking the units = AmpHours ×Vbatt sec = Amp × hour × × Volt hour Coulomb sec Joule = × hour × × sec hour Coulomb = Joule unit match, do the math = 2.5AH × 3600sec/hour × 10V = 90 000Joules Energy per simulation step:
  • 429.
    6.1.4 Batteries, Powerand Energy 407 Estep check the units = Watts × . . . Joule sec cyc instr = × × × sec cyc instr step Joule = step units check, do the math 1 = 70Watts × × 1.0cyc/instr × 106 instr/step 700 × 106 cyc/sec = 0.1Joule/step = 0.1Joule/step Number of steps: Ebatt NumSteps = Estep 90 000 = 0.1 = 900, 000steps Question: If I use the SpeedStep feature of my computer, my computer runs at 600MHz with 60W of power. With SpeedStep activated, much longer can I keep the computer running on one battery? Answer: Approach: 1. Calculate uptime with Speedstep turned off (high power) 2. Calculate uptime with Speedstep turned on (low power) 3. Calculate difference in uptimes High-power uptime:
  • 430.
    408 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN Ebatt TH = PH 90 000Watt-Secs = 70Watt = 1285Secs = 21minutes Low-power uptime: Ebatt TL = PL 90 000Watt-Secs = 60Watt = 1500Secs = 25minutes Difference in uptimes: Tdiff = TL − TH = 25 − 21 = 4minutes Analysis: This question is based on data from a typical laptop. So, why are the predicted uptimes so much shorter than those experienced in reality? Answer: The power consumption figures are the maximum peak power consumption of the laptop: disk spinning, fan blowing, bus active, all peripherals active, all modules on CPU turned on. In reality, laptop almost never experience their maximum power consumption. Question: With SpeedStep activated, how many more simulation steps can I run on one battery?
  • 431.
    6.2. POWER EQUATIONS 409 Answer: Clock speed is proportional to power consumption. In both high-power and low-power modes, the system runs the same number of clock cycles on the energy stored in the battery. So, we are run the same number of simulation steps both with and without SpeedStep activated. Analysis: In reality, with SpeedStep activated, I am able to run more simulation steps. Why does the theoretical calculation disagree with reality? Answer: In reality, the processor does not use 100% of the clock cycles for running the simulator. Many clock cycles are “wasted” while waiting for I/O from the disk, user, etc. When reducing the clock speed, a smaller number of clock cycles are wasted as idle clock cycles. 6.2 Power Equations Power = SwitchPower + ShortPower + LeakagePower DynamicPower StaticPower Dynamic Power dependent upon clock speed Switching Power useful — charges up transistors Short Circuit Power not useful — both N and P transistors are on Static Power independent of clock speed Leakage Power not useful — leaks around transistor Dynamic power is proportional to how often signals change their value (switch). • Roughly 20% of signals switch during a clock cycle. • Need to take glitches into account when calculating activity factor. Glitches increase the activity factor. • Equations for dynamic power contain clock speed and activity factor.
  • 432.
    410 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN 6.2.1 Switching Power 1->0 0->1 0->1 1->0 CapLoad CapLoad Charging a capacitor Disharging a capacitor 1 energy to (dis)charge capacitor = × CapLoad × VoltSup2 2 1 When a capacitor C is charged to a voltage V , the energy stored in capacitor is 2 CV 2 . 1 The energy required to charge the capacitor from 0 to V is CV 2 . Half of the energy ( 2 CV 2 is dissipated as heat through the pullup resistance. Half of energy is transfered to the capacitor. 1 When the capacitor discharges from V to 0, the energy stored in the capacitor ( 2 CV 2 ) is dissipated as heat through the pulldown resistance. f ′ : frequency at which invertor goes through complete charge-discharge cycle. (eqn 15.4 in Smith) average switching power = f ′ × CapLoad × VoltSup2 ClockSpeed clock speed ActFact average number of times that signal switches from 0 → 1 or from 1 → 0 during a clock cycle 1 average switching power = × ActFact × ClockSpeed × CapLoad × VoltSup2 2
  • 433.
    6.2.2 Short-Circuited Power 411 6.2.2 Short-Circuited Power VoltSup VoltSup - VoltThresh VoltThresh GND P-trans on IShort N-trans on Vi Vo TimeShort Gate Voltage PwrShort = ActFact × ClockSpeed × TimeShort × IShort × VoltSup 6.2.3 Leakage Power Vi Vo I N N P P ILeak P V N-substrate Leakage current through parasitic diode Cross section of invertor showing parasitic diode PwrLk = ILeak × VoltSup −q × VoltThresh k×T ILeak ∝ e
  • 434.
    412 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN 6.2.4 Glossary ClockSpeed def Clock speed aka f ActFact def activity factor aka A NumTransitions = NumSignals × NumClockCycles = Per signal: percentage of clock cycles when signal changes value. = Per clock cycle: percentage of signals that change value per clock cycle. Note: When measuring per circuit, sometimes approximate by looking only at flops, rather than every single signal. TimeShort def short circuit time aka τ = Time that both N and P transistors are turned on when signal changes value. MaxClockSpeed def Maximum clock speed that an implementation technology can sup- port. aka fmax (VoltSup − VoltThresh)2 ∝ VoltSup VoltSup def Supply voltage aka V VoltThresh def Threshold voltage aka Vth = voltage at which P transistors turn on ILeak def Leakage current aka IS (reverse bias saturation current) −q × VoltThresh k×T ∝ e IShort def Short circuit current aka Ishort = Current that goes through transistor network while both N and P tran- sistors are turned on. CapLoad def load capacitance aka CL PwrSw def switching power (dynamic) 1 2 = 2 × ActFact × ClockSpeed × CapLoad × VoltSup PwrShort def switching power (dynamic) = ActFact × ClockSpeed × TimeShort × IShort × VoltSup PwrLk def leakage power (static) = ILeak × VoltSup Power def total power = PwrSw + PwrShort + PwrLk
  • 435.
    6.2.5 Note onPower Equations 413 q def electron charge = 1.60218 × 10−19C k def Boltzmann’s constant = 1.38066 × 10−23 J/K T def temperature in Kelvin 6.2.5 Note on Power Equations The power equation: Power = DynamicPower + StaticPower = PwrSw + PwrShort + PwrLk = (ActFact × ClockSpeed × 1 CapLoad × VoltSup2 ) 2 + (ActFact × ClockSpeed × TimeShort × IShort × VoltSup) + (ILeak × VoltSup) is for an individual signal. To calculate dynamic power for n signals with different CapLoad, TimeShort, and IShort: n 1 DynamicPower = ( ∑ ActFacti × CapLoadi × ClockSpeed × VoltSup2 ) i=1 2 n + ( ∑ ActFacti × ClockSpeed × TimeShorti × IShorti × VoltSup) i=1 If know the average CapLoad, TimeShort, and IShort for a collection of n signals, then the above formula simplifies to: DynamicPower = (n × ActFactAV G × 2 CapLoadAV G × ClockSpeed × VoltSup2 ) 1 + (n × ActFactAV G × ClockSpeed × TimeShortAV G × IShortAV G × VoltSup) If capacitances and short-circuit parameters don’t have an even distribution, then don’t average them. If high-capacitance signals have high-activity factors, then averaging the equations will result in erroneously low predictions for power.
  • 436.
    414 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN 6.3 Overview of Power Reduction Techniques We can divide power reduction techniques into two classes: analog and digital. analog Parameters to work with: capacitance for example, Silicon on Insulator (SOI) resistance for example, copper wires voltage low-voltage circuits Techniques: dual-VDD Two different supply voltages: high voltage for performance-critical portions of design, low voltage for remainder of circuit. Alternatively, can vary voltage over time: high voltage when running performance-critical software and low voltage when running software that is less sensitive to performance. dual-Vt Two different threshold voltages: transistors with low threshold voltage for performance-critical portions of design (can switch more quickly, but more leakage power), transistors with high threshold voltage for remainder of circuit (switches more slowly, but reduces leakage power). exotic circuits Special flops, latches, and combinational circuitry that run at a high frequency while minimizing power adiabatic circuits Special circuitry that consumes power on 0 → 1 transitions, but not 1 → 0 transitions. These sacrifice performance for reduced power. clock trees Up to 30% of total power can be consumed in clock generation and clock tree digital Parameters to work with: capacitance (number of gates) activity factor clock frequency Techniques: multiple clocks Put a high speed clock in performance-critical parts of design and a low speed clock for remainder of circuit clock gating Turn off clock to portions of a chip when it’s not being used data encoding Gray coding vs one-hot vs fully encoded vs ... glitch reduction Adjust circuit delays or add redundant circuitry to reduce or eliminate glitches. asynchronous circuits Get rid of clocks altogether.... Additional low-power design techniques for RTL from a Qualis engineer: http://home.europa.com/˜celiac/lowpower.html
  • 437.
    6.4. VOLTAGE REDUCTIONFOR POWER REDUCTION 415 6.4 Voltage Reduction for Power Reduction If our goal is to reduce power, the most promising approach is to reduce the supply voltage, because, from: Power = (ActFact × ClockSpeed × 1 CapLoad × VoltSup2 ) 2 + (ActFact × ClockSpeed × TimeShort × IShort × VoltSup) + (ILeak × VoltSup) we observe: Power ∝ VoltSup2 Reducing Difference Between Supply and Threshold Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . As the supply voltage decreases, it takes longer to charge up the capacitive load, which increases the load delay of a circuit. In the chapter on timing analysis, we saw that increasing the supply voltage will decrease the delay through a circuit. (From V = IR, increasing V causes an increase in I, which causes the capacitive load to charge more quickly.) However, it is more accurate to take into account both the value of the supply voltage, and the difference between the supply voltage and the threshold voltage. (VoltSup − VoltThresh)2 MaxClockSpeed ∝ VoltSup Question: If the delay along the critical path of a circuit is 20 ns, the supply voltage is 2.8 V, and the threshold voltage is 0.7 V, calculate the critical path delay if the supply voltage is dropped to 2.2 V. Answer: d 20ns current delay along critical path d′ ?? new delay along critical path V 2.8V current supply voltage V′ 2.2V new supply voltage Vt 0.7V threshold voltage
  • 438.
    416 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN MaxClockSpeed ∝ 1/d (VoltSup − VoltThresh)2 MaxClockSpeed ∝ VoltSup V d ∝ (V −Vt )2 d′ (V −Vt )2 V′ = × ′ d V (V −Vt )2 (V −Vt )2 V′ d′ = d × × ′ V (V −Vt )2 (2.8V − 0.7V)2 2.2V = 20ns × × 2.8V (2.2V − 0.7V)2 = 31ns Reducing Threshold Voltage Increases Leakage Current . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . If we reduce the supply voltage, we want to also reduce the threshold voltage, so that we do not increase the delay through the circuit. However, as threshold voltage drops, leakage current increases: −q × VoltThresh ILeak ∝ e k×T And increasing the leakage current increases the power: Power ∝ ILeak So, need to strike a balance between reducing VoltSup (which has a quadratic affect on reducing power), and increasing ILeak, which has a linear affect on increasing power. 6.5 Data Encoding for Power Reduction 6.5.1 How Data Encoding Can Reduce Power Data encoding is a technique that chooses data values so that normal execution will have a low activity factor. The most common example is “Gray coding” where exactly one bit changes value each clock cycle when counting.
  • 439.
    6.5.1 How DataEncoding Can Reduce Power 417 Two ways to understand the pattern for Gray-code count- ing. Both methods are based on noting when a bit in the Gray code toggles from 0 to 1 or 1 to 0. • To convert from binary to Gray, a bit in the Gray code toggles whenever the corresponding bit in the binary code goes from 0 to 1. (US Patent 4618849 issued in 1984). Decimal Gray Binary • To implement a Gray code counter from scratch, 0 0000 0000 number the bits from 1 to n, with a special less-than- 1 0001 0001 least-significant bit q0 . The output of the counter 2 0011 0010 will be qn . . . q1 . 3 0010 0011 4 0110 0100 1. create a flop that toggles in each clock cycle: 5 0111 0101 q0 <= not q0 6 0101 0110 7 0100 0111 2. bit 1 toggles whenever q0 is 1. 8 1100 1000 3. For each bit i ∈ 2..n, the counter bit qi toggles 9 1101 1001 whenever qi−1 is 1 and all of the bits qi−2 . . . q0 10 1111 1010 are 0. 11 1110 1011 4. This behaviour can be implemented in a 12 1010 1100 ripple-carry style by introducing carry (ci ) and 13 1011 1101 toggle (qti ) signals for each bit. 14 1001 1110 15 1000 1111 q0 <= not(q0 ) reg asn c0 <= not(q0 ) comb asn ci <= ci−1 and not(qi ) comb asn qti <= qi−1 and ci−2 comb asn We create a toggle flip-flop by xoring the out- put of a D-flop with its toggle signal: qi <= qi xor qti reg asn Question: For an eight-bit counter, how much more power will a binary counter consume than a Gray-code counter? Answer: Power consumption is dependent on area and activity factor. The original purpose of this problem was to focus on activity factor. The problem was created under the mistaken assumption that a Gray code counter and a binary code counter will both use the same area (1 fpga cell per
  • 440.
    418 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN bit) and so the power difference comes from the difference in activity factors. This mistake is addressed at the end of the solution. For Gray coding, exactly one-bit toggles in each clock cycle. Thus, the activity 1 factor for an n-bit Gray counter will be n . For binary coding, the least significant bit toggles in every clock cycle, so it has an activity factor of 1. The 2nd least-significant bit toggles in every other 1 clock cycle, so it has an activity factor of 2 . We study the other bits and try to find a pattern based on the bit position, i, where i = 0 for the least-significant bit and n − 1 for the most significant bit of an n-bit counter. We see that for bit 1 i, the activity factor is 2i . For an n-bit binary counter, the average activity factor is the sum of the activity factors for the signals over the number of signals: 1 + 1 + 1 +···+ 1 20 21 22 2n−1 BinaryActFact = n 1 n−1 i−1 = ×∑2 n i=0 The limit of the summation term as n goes to infinity is 2. We can see this as an instance of Zeno’s paradox, in that with each step we halve the distance to 2. 1 BinaryActFact ≈ ×2 n 2 ≈ n Find the ratio of the binary activity factor to the Gray-code activity factor. BinaryActFact 2 n = × GrayActFact n 1 = 2 In reality, the ripple-carry Gray code counter will always have two transitions per clock cycle: one for the q0 toggle flop and one for the actual signal in the counter that toggles. Thus the Gray code counter will consume more power than the binary counter. The overall power reduction comes from the circuit that uses the Gray code.
  • 441.
    6.5.2 Example Problem:Sixteen Pulser 419 Question: For completely random eight-bit data, how much more power will a binary circuit consume than a Gray-code circuit? Answer: If the data is completely random, then the Gray code loses its feature that consecutive data will differ in only one bit position. In fact, the activity factor for Gray code and binary code will be the same. There will not be any power saving by using Gray code. A binary counter will consume the same power as a Gray-code circuit. On average, half of the bits will be 1 and half will be 0. For each bit, there are four possible transitions: 0→0, 0→1, 1→0, and 1→1. In these four transitions, two causes changes in value and two do not cause a change. Half of the transitions result in a change in value, therefore for random data the activity factor will be 0.5, independent of data encoding or the number of bits. 6.5.2 Example Problem: Sixteen Pulser 6.5.2.1 Problem Statement Your task is to do the power analysis for a circuit that should send out a one-clock-cycle pulse on the done signal once every 16 clock cycles. (That is, done is ’0’ for 15 clock cycles, then ’1’ for one cycle, then repeat with 15 cycles of ’0’ followed by a ’1’, etc.) 1 2 3 15 16 17 31 32 33 clk done Required behaviour You have been asked to consider three different types of counters: a binary counter, a Gray-code counter, and a one-hot counter. (The table below shows the values from 0 to 15 for the different encodings.) Question: What is the relative amount of power consumption for the different options?
  • 442.
    420 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN 6.5.2.2 Additional Information Your implementation technology is an FPGA where each cell has a programable combinational circuit and a flip-flop. The combinational circuit has 4 inputs and 1 output. The capacitive load of the combinational circuit is twice that of the flip-flop. PLA cell 1. You may neglect power associated with clocks. 2. You may assume that all counters: (a) are implemented on the same fabrication process (b) run at the same clock speed (c) have negligible leakage and short-circuit currents 6.5.2.3 Answer Outline of Thinking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Factors to consider that distinguish the options: capacitance and activity factor: Capacitance is dependent upon the number of signals, and whether a signal is combinational or a flop. Sketch out the circuitry to evaluate capacitance. Sketch the Circuitry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Name the output “done” and the count digits “d()”.
  • 443.
    6.5.2 Example Problem:Sixteen Pulser 421 d(0) PLA d(1) PLA d(2) PLA d(3) PLA done PLA Block diagram for Gray and Binary Counters d(0) d(1) d(15) done PLA PLA PLA Block diagram for One-Hot Observation: The Gray and Binary counters have the same design, and the Gray counter will have the lower activity factor. Therefore, the Gray counter will have lower power than the Binary counter. However, we don’t know how much lower the power of the Gray counter will be, and we don’t know how much power the One-Hot counter will consume. Capacitance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
  • 444.
    422 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN cap number subtotal cap Gray d() PLAs 2 4 8 Flops 1 4 4 done PLAs 2 1 2 Flops 1 0 0 1-Hot d() PLAs 2 0 0 Flops 1 16 16 done PLAs 2 0 0 Flops 1 0 0 Binary d() PLAs 2 4 8 Flops 1 4 4 done PLAs 2 1 2 Flops 1 0 0 Activity Factors ....................................................................... clk clk d(0) 8/16 d(0) 2/16 d(1) 4/16 d(1) 2/16 d(2) 2/16 d(2) 2/16 d(3) 2/16 2/16 done 2/16 done 2/16 Gray coding One-hot coding clk d(0) 16/16 d(1) 8/16 d(2) 4/16 d(3) 2/16 done 2/16 Binary coding
  • 445.
    6.5.2 Example Problem:Sixteen Pulser 423 act fact Gray d() PLAs 1/4 signals in each clock cycle Flops 1/4 signals in each clock cycle done PLAs 2 transitions / 16 clock cycles Flops — 1-Hot d() PLAs — Flops 2 transitions / 16 clock cycles done PLAs — Flops — 16 + 8 + 4 + 2 transitions Binary d() PLAs = 0.47 4 signals × 16 clock cycles 16 + 8 + 4 + 2 transitions Flops = 0.47 4 signals × 16 clock cycles done PLAs 2 transitions / 16 clock cycles Flops — Note: Activity factor for One-Hot counter Because all signals have same capacitance, and all clock cycles have the same number of transitions for the One-Hot counter, could have calculated activity factor as two transitions per sixteen signals.
  • 446.
    424 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN Putting it all Together ................................................................ . subtotal cap act fact power Gray d() PLAs 8 1/4 2 Flops 4 1/4 1 done PLAs 2 2/16 4/16 Flops 0 — 0 Total 3.25 1-Hot d() PLAs 0 — 0 Flops 16 2/16 2 done PLAs 0 — 0 Flops 0 — 0 Total 2 Binary d() PLAs 8 0.47 3.76 Flops 4 0.47 1.88 done PLAs 2 2/16 0.25 Flops 0 — 0 Total 5.87 If choose Binary counting as baseline, then relative amounts of power are: Gray 54% One-Hot 35% Binary 100% If choose One-Hot counting as baseline, then relative amounts of power are: Gray 156% One-Hot 100% Binary 288% 6.6 Clock Gating The basic idea of clock gating is to reduce power by turning off the clock when a circuit isn’t needed. This reduces the activity factor. 6.6.1 Introduction to Clock Gating Examples of Clock Gating Condition Circuitry turned off O/S in standby mode Everything except “core” state (PC, registers, caches, etc) No floating point instructions floating point circuitry for k clock cycles Instruction cache miss Instruction decode circuitry No instruction in pipe stage i Pipe stage i
  • 447.
    6.6.2 Implementing ClockGating 425 Design Tradeoffs ..................................................................... . + Can significantly reduce activity factor (Synopsys PowerCompiler claims that can cut power to be 50–80% of ungated level) − Increases design complexity • design effort • bugs! − Increases area − Increases clock skew Functional Validation and Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . It’s a functional bug to turn a clock off when it’s needed for valid data. It’s functionally ok, but wasteful to turn a clock on when it’s not needed. (About 5% of the bugs caught on Willamette (Intel Pentium 4 Processor) were related to clock gating.) Nicolas Mokhoff. EE Times. June 27, 2001. http://www.edtn.com/story/OEG20010621S0080 6.6.2 Implementing Clock Gating Clock gating is implemented by adding a component that disables the clock when the circuit isn’t needed. i_data o_data i_valid clk o_valid Without clock gating i_data o_data i_valid clk cool_clk o_valid clk_en Clock Enable i_wakeup State Machine With clock gating
  • 448.
    426 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN The total power of a circuit with clock gating is the sum of the power of the main circuit with a reduced activity factor and the power of the clock gating state machine with its activity factor. The clock-gating state machine must always be on, so that it will detect the wakeup signal — do not make the mistake of gating the clock to your clock gating circuit! 6.6.3 Design Process Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . • What level of granularity for gated clocks? – entire module? – individual pipe stages? – something in between? • When should the clocks turn off? • When should the clocks turn on? • Protocol for incoming wakeup signal? • Protocol for outgoing wakeup signal? Wakeup Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Designers negotiate incoming and outgoing wakeup protocol with environment. An example wakeup protocol: • wakeup in will arrive 1 clock cycle before valid data • wakeup in will stay high until have at least 3 cycles of invalid data Design Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . When designing clock gating circuitry, consider the two extreme case: • a constant stream of valid data • circuit is turned off and receives a single parcel of valid data For a constant stream of valid data, the key is to not incur a large overhead in design complexity, area, or clock period when clocks will always be toggling. For a single parcel of valid data, the key is to make sure that the clocks are toggling so that data can percolate through circuit. Also, we want to turn off the clock as soon as possible after data leaves.
  • 449.
    6.6.4 Effectiveness ofClock Gating 427 6.6.4 Effectiveness of Clock Gating We can measure the effectiveness of clock gating by comparing the percentage of clock cycles when the clock is not toggling to the percentage of clock cycles that the circuit does not have valid data (i.e. the clock does not need to toggle). The most ineffective clock gating scheme is to never turn off the clock (let the clock always toggle). The most effective clock gating scheme is to turn off the clock whenever the circuit is not processing valid data. Parameters to characterize effectiveness of clock gating: Eff = effectiveness of clock gating PctValid = percentage of clock cycles with valid data in the circuit — the clock must be toggling PctClk = percentage of clock cycles that clock toggles Effectiveness measures the percentage of clock cycles with invalid data in which the clock is turned off. Equation for effectiveness of clock gating: PctClkOff Eff = PctInvalid 1 − PctClk = 1 − PctValid Question: What is the effectiveness if the clock toggles only when there is valid data? Answer: PctClk = PctValid and the effectiveness should be 1: 1 − PctClk Eff = 1 − PctValid 1 − PctValid = 1 − PctValid = 1 Question: What is the effectiveness of a clock that always toggles?
  • 450.
    428 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN Answer: If the clock is always toggling, then PctClk = 100% and the effectiveness should be 0. 1 − PctClk Eff = 1 − PctValid 1−1 = 1 − PctValid = 0 Question: What does it mean for a clock gating scheme to be 75% effective? Answer: 75% of the time that the there is invalid data, the clock is off. Question: What happens if PctClk < PctValid? Answer: If PctClk < PctValid, then: 1 − PctClk > 1 − PctValid so, effectiveness will be greater than 100%. In some sense, it makes sense that the answer would be nonsense, because a clock gating scheme that is more than 100% effective is too effective: it is turning off the clock sometime when it shouldn’t! We can see the effect of the effectiveness of a clock-gating scheme on the activity factor: A PctValid * A A’ 0 0 1 Eff When the effectiveness is zero, the new activity factor is the same as the original activity factor. For a 100% effective clock gating scheme, the activity factor is A × PctValid . Between 0% and 100% effectiveness, the activity factor decreases linearly. The new activity factor with a clock gating scheme is: A′ = A − (1 − PctValid ) × Eff × A
  • 451.
    6.6.5 Example: ReducedActivity Factor with Clock Gating 429 6.6.5 Example: Reduced Activity Factor with Clock Gating Question: How much power will be saved in the following clock-gating scheme? • 70% of the time the main circuit has valid data • clock gating circuit is 90% effective (90% of the time that the circuit has invalid data, the clock is off) • clock gating circuit has 10% of the area of the main circuit • clock gating circuit has same activity factor as main circuit • neglect short-circuiting and leakage power Answer: 1. Set up main equations
  • 452.
    430 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN PwrMain = power for main circuit without clock gating Pwr′Main = power for main circuit with clock gating PwrClkFsm = power for clock enable state machine PwrTot = PwrMain + PwrClkFsm Pwr = PwrSw + PwrLk + PwrShort 1 PwrSw = × A ×C ×V 2 2 PwrLk = negligible PwrShort = negligible 1 Pwr = × A ×C ×V 2 2 1 1 PwrTot = × AMain ×CMain ×V 2 + × AClkFsm ×CClkFsm ×V 2 2 2 AMain = A CMain = C AClkFsm = A CClkFsm = 0.1C A′ = A′ Main A′ = A ClkFsm 1 1 ′ PwrTot × A′ ×C ×V 2 + × A × 0.1C ×V 2 2 2 = PwrTot 1 × A ×C ×V 2 2 A′ + 0.1A = A 2. Find new activity factor for main circuit (A′ ): A′ = (1 − Eff(1 − PctValid)) × A = (1 − 0.9(1 − 0.7)) × A = 0.73A 3. Find ratio of new total power to previous total power:
  • 453.
    6.6.6 Clock Gatingwith Valid-Bit Protocol 431 Pwr′ Tot A′ + 0.1A = PwrTot A 0.73A + 0.1A = A = 0.83 4. Final answer: new power is 83% of original power 6.6.6 Clock Gating with Valid-Bit Protocol A common technique to determine when a circuit has valid data is to use a valid-bit protocol. In section 6.6.6.1 we review the valid-bit procotol and then in section 6.6.6.3 we add clock-gating circuitry to a circuit that uses the valid-bit protocol. 6.6.6.1 Valid-Bit Protocol Need a mechanism to tell circuit when to pay attention to data inputs — e.g. when is it supposed to decode and execute an instruction, or write data to a memory array? clk i_valid o_valid i_data o_data clk i_valid i_data α β γ o_valid o_data α β γ i valid: high when i data has valid data — signifies whether circuit should pay attention to or ignore data. o valid: high when o data has valid data — signifies whether whether environment should pay attention to output of circuit. For more on circuit protocols, see section 2.12.
  • 454.
    432 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN Microscopic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Which clock edges are needed? i_valid o_valid clk clk i_valid o_valid
  • 455.
    6.6.6 Clock Gatingwith Valid-Bit Protocol 433 6.6.6.2 How Many Clock Cycles for Module? Given a module with latency Lat, if the module receives a stream of NumPcls consecutive valid parcels, how many clock cycles must the clock-enable signal be asserted? Latency NumPcls NumClkEn Latency NumPcls NumClkEn i_valid o_valid clk_en i_valid i_valid o_valid o_valid clk_en clk_en i_valid i_valid o_valid o_valid clk_en clk_en i_valid i_valid o_valid o_valid clk_en clk_en ti1 time of first i valid to1 time of first o valid tik time of last i valid tok time of last o valid tstart first clock cycle with clock enabled tlast last clock cycle with clock enabled Initial equations to describe relationships between different points in time: to1 = ti1 + Lat tok = to1 + NumPcls − 1 tfirst ti1 + 1 tlast tok + 1 To understand the −1 in the equation for tok , examine the situation when NumPcls = 1. With just one parcel going through the system to1 = ti1 + Lat , so we have: tok = to1 + 1 − 1. In the equation for tlast , we need the +1 to clear the last valid bit. Solve for the length of time that the clock must be enabled. The +1 at the end of this equation is becuase if tlast = tfirst , we would have the clock enabled for 1 clock cycle.
  • 456.
    434 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN ClkEnLen = tlast − tfirst + 1 = tok + 1 − (ti1 + 1) + 1 = tok − ti1 + 1 = to1 + NumPcls − 1 − ti1 + 1 = to1 + NumPcls − ti1 = ti1 + Lat + NumPcls − ti1 = Lat + NumPcls We are left with the formula that the number of clock cycles that the module’s clock must be enabled is the latency through the module plus the number of consecutive parcels. 6.6.6.3 Adding Clock-Gating Circuitry Before Clock Gating ................................................................... data_in data_out valid_in valid_out clk clk valid_in data_in α β γ δ valid_out don’t care data_out α β γ uninitialized After Clock Gating: Circuitry ........................................................ . data_in data_out valid_in valid_out hot_clk cool_clk clk_en Clock Enable wakeup_in wakeup_out State Machine • hot clk: clock that always toggles
  • 457.
    6.6.6 Clock Gatingwith Valid-Bit Protocol 435 • cool clk: gated clock — sometimes toggles, sometimes stays low • wakeup: alerts circuit that valid data will be arriving soon • clk en: turns on cool clk
  • 458.
    436 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN After Clock Gating: New Signals . . . .. . . . . .. . . . . .. . . . . .. . . . . . .. . . . . .. . . . . .. . . . . .. . . . . ... hot_clk wakeup_in valid_in data_in α β γ δ clk_en cool_clk valid_out data_out α β γ wakeup_out
  • 459.
    6.6.7 Example: PipelinedCircuit with Clock-Gating 437 6.6.7 Example: Pipelined Circuit with Clock-Gating Design a “clock enable state machine” for the pipelined component described below. • capacitance of pipelined component = 200 • latency varies from 5 to 10 clock cycles, even distribution of latencies • contains a maximum of 6 instructions (parcels of data). • 60% of incoming parcels are valid • average length of continuous sequence of valid parcels is 80 • use input and output valid bits for wakeup • leakage current is negligible • short-circuit current is negligible • LUTs have a capacitance of 1, flops have a capacitance of 2 The two factors affecting power are activity factor and capacitance. 1. Scenario: turned off and get one parcel. (a) Need to turn on and stay on until parcel departs (b) idea #1 (parcel count): • count number of parcels inside module • keep clocks toggling if have non-zero parcels. (c) idea #2 (cycle count): • count number of clock cycles since last valid parcel entered module • once hit 10 clock cycles without any valid parcels entering, know that all parcels have exited. • keep clocks toggling if counter is less than 10 2. Scenario: constant stream of parcels (a) parcel count would require looking at input and output stream and conditionally incrementing or decrementing counter (b) cycle count would keep resetting counter Waveforms .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . . .. . . . . .. . . . . ... 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 i_valid o_valid parcel_count parcel_clk_en
  • 460.
    438 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 i_valid o_valid cycle_count 0 1 2 0 0 0 1 2 3 4 5 0 1 2 3 4 5 6 7 8 9 10 cycle_clk_en Outline: 1. sketch out circuitry for parcel count and cycle count state machine 2. estimate capacitance of each state machine 3. estimate activity factor of main circuit, based on behaviour Parcel Count Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Need to count (0..6) parcels, therefore need 3 bits for counter. Counter must be able to increment and decrement. Equations for counter action (increment/decrement/no-change): i valid o valid action 0 0 no change 0 1 decrement 1 0 increment 1 1 no change
  • 461.
    6.7. POWER PROBLEMS 439 6.7 Power Problems P6.1 Short Answers P6.1.1 Power and Temperature As temperature increases, does the power consumed by a typical combinational circuit increase, stay the same, or decrease? P6.1.2 Leakage Power The new vice president of your company has set up a contest for ideas to reduce leakage power in the next generation of chips that the company fabricates. The prize for the person who submits the suggestion that makes the best tradeoff between leakage power and other design goals is to have a door installed on their cube. What is your door-winning idea, and what tradeoffs will your idea require in order to achieve the reduction in leakage power? P6.1.3 Clock Gating In what situations could adding clock-gating to a circuit increase power consumption? P6.1.4 Gray Coding What are the tradeoffs in implementing a program counter for a microprocessor using Gray coding? P6.2 VLSI Gurus The VLSI gurus at your company have come up with a way to decrease the average rise and fall time (0-to-1 and 1-to-0 transitions) for signals. The current value is 1ns. With their fabrication tweaks, they can decrease this to 0.85ns . P6.2.1 Effect on Power If you implement their suggestions, and make no other changes, what effect will this have on power? (NOTE: Based on the information given, be as specific as possible.)
  • 462.
    440 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN P6.2.2 Critique A group of wannabe performance gurus claim that the above optimization can be used to improve performance by at least 15%. Briefly outline what their plan probably is, critique the merits of their plan, and describe any affect their performance optimization will have on power. P6.3 Advertising Ratios One day you are strolling the hallways in search of inspiration, when you bump into a person from the marketing department. The marketing department has been out surfing the web and has noticed that companies are advertising the MIPs/mm2, MIPs/Watt, and Watts/cm3 of their products. This wide variety of different metrics has confused them. Explain whether each metric is a reasonable metric for customers to use when choosing a system. If the metric is reasonable, say whether “bigger is better” (e.g. 500 MIPs/mm2 is better than 20 MIPs/mm2) or “smaller is better” (e.g. 20 MIPs/mm2 is better than 500 MIPs/mm2), and which one type of product (cell phone, desktop computer, or compute server) is the metric most relevant to. • MIPs/mm2 • MIPs/Watt • Watts/cm3 P6.4 Vary Supply Voltage As the supply voltage is scaled down (reduced in value), the maximum clock speed that the circuit can run at decreases. The scaling down of supply voltage is a popular technique for minimizing power. The maximum clock speed is related to the supply voltage by the following equation: (VoltSup − VoltThresh)2 MaxClockSpeed ∝ VoltSup Where VoltSup is supply voltage and VoltThresh is threshold voltage. With a supply voltage of 3V and a threshold voltage of 0.8V, the maximum clock speed is measured to be 200MHz. What will the maximum clock speed be with a supply voltage of 1.5V?
  • 463.
    P6.5 Clock SpeedIncrease Without Power Increase 441 P6.5 Clock Speed Increase Without Power Increase The following are given: • You need to increase the clock speed of a chip by 10% • You must not increase its dynamic power consumption • The only design parameter you can change is supply voltage • Assume that short-circuiting current is negligible P6.5.1 Supply Voltage How much do you need to decrease the supply voltage by to achieve this goal? P6.5.2 Supply Voltage What problems will you encounter if you continue to decrease the supply voltage? P6.6 Power Reduction Strategies In each low power approach described below identify which component(s) of the power equation is (are) being minimized and/or maximized: P6.6.1 Supply Voltage Designers scaled down the supply voltage of their ASIC P6.6.2 Transistor Sizing The transistors were made larger. P6.6.3 Adding Registers to Inputs All inputs to functional units are registered P6.6.4 Gray Coding Gray coding of signals is used for address signals.
  • 464.
    442 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN P6.7 Power Consumption on New Chip While you are eating lunch at your regular table in the company cafeteria, a vice president sits down and starts to talk about the difficulties with a new chip. The chip is a slight modification of existing design that has been ported to a new fabrication process. Earlier that day, the first sample chips came back from fabrication. The good news is that the chips appear to function correctly. The bad news is that they consume about 10% more power than had been predicted. The vice president explains that the extra power consumption is a very serious problem, because power is the most important design metric for this chip. The vice president asks you if you have any idea of what might cause the chips to consume more power than predicted. P6.7.1 Hypothesis Hypothesize a likely cause for the surprisingly large power consumption, and justify why your hypothesis is likely to be correct. P6.7.2 Experiment Briefly describe how to determine if your hypothesized cause is the real cause of the surprisingly large power consumption. P6.7.3 Reality The vice president wants to get the chips out to market quickly and asks you if you have any ideas for reducing their power without changing the design or fabrication process. Describe your ideas, or explain why her suggestion is infeasible.
  • 465.
    Chapter 7 Fault Testingand Testability 7.1 Faults and Testing 7.1.1 Overview of Faults and Testing 7.1.1.1 Faults During manufacturing, faults can occur that make the physical product behave incorrectly. Definition: A fault is a manufacturing defect that causes a wire, poly, diffusion, or via to either break or connect to something it shouldn’t. Good wires Shorted wires Open wire 7.1.1.2 Causes of Faults • Fabrication process (initial construction is bad) – chemical mix – impurities – dust • Manufacturing process (damage during construction) – handling ∗ probing ∗ cutting ∗ mounting 443
  • 466.
    444 CHAPTER 7. FAULT TESTING AND TESTABILITY – materials ∗ corrosion ∗ adhesion failure ∗ cracking ∗ peeling 7.1.1.3 Testing Definition Testing is the process of checking that the manufactured wafer/chip/board/system has the same functionality as the simulations. 7.1.1.4 Burn In Some chips that come off the manufacturing line will work for a short period of time and then fail. Definition Burn-in: The process of subjecting chips to extreme conditions (high and low temps, high and low voltages, high and low clock speeds) before and during testing. The purpose is to cause (and catch) failures in chips that would pass a normal test, but fail in early use by customers. Soon to break wire The hope is that the extreme conditions will cause chips to break that would otherwise have broken in the customers system soon after arrival. The trick is to create conditions that are extreme enough that bad chips will break, but not so extreme to cause good chips to break. 7.1.1.5 Bin Sorting Each chip (or wafer) is run at a variety of clock speeds. The chips are grouped and labeled (binned) by the maximum clock frequency at which they will work reliably. For example, chips coming off of the same production line might be labelled as 800MHz, 900MHz, and 1000MHz. Overclocking is taking a chip rated at nMHz and running it at 1.x × nMHz. (Sure your computer often crashes and loses your assignment, but just think how much more productive you are when it is working...)
  • 467.
    7.1.1 Overview ofFaults and Testing 445 7.1.1.6 Testing Techniques Scan Testing or Boundary Scan Testing (BST, JTAG) • Load test vector from tester into chip • Run chip on test data • Unload result data from chip to tester • Compare results from chip against those produced by simulation • If results are different, then chip was not manufactured correctly Built In Self Test (BIST) • Build circuitry on chip that generates tests and compares actual and expected results IDDQ Testing • Measure the quiescent current between VDD and GND. • Variations from expected values indicate faults. Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The challenges in testing: • test circuitry consumes chip area • test circuitry reduces performance • decrease fault escapee rate of product that ships while having minimal impact on production cost and chip performance • external tester can only look at I/O pins • ratio of internal signals to I/O pins is increasing • some faults will only manifest themselves at high-clock frequencies “The crux of testing is to use yesterday’s technology to find faults in tomorrow’s chips.” Agilent engineer at ARVLSI 2001. 7.1.1.7 Design for Testability (DFT) Scan testing and self-testing require adding extra circuitry to chips. Design for test is the process of adding this circuitry in a disciplined and correct manner. A hot area of research, that is becoming mainstream practice, is developing synthesis tools to automatically add the testing circuitry.
  • 468.
    446 CHAPTER 7. FAULT TESTING AND TESTABILITY 7.1.2 Example Problem: Economics of Testing Given information: • The ACHIP costs $10 without any testing • Each board uses one ACHIP (plus lots of other chips that we don’t care about) • 68% of the manufactured ACHIPS do not have any faults • For the ACHIP, it costs $1 per chip to catch half of the faults • Each 50% reduction in fault escapees doubles cost of testing (intuition: doubles number of tests that are run) • If board-level testing detects a bad ACHIP, it costs $200 to replace the ACHIP • Board-level testing will detect 100% of the faults in an ACHIP Question: What escapee fault rate will minimize cost of the ACHIP? Answer: TotCost = NoTestCost + TestCost + EscapeeProb × ReplaceCost NoTestCost Testcost EscapeeProb ReplaceCost TotCost $10 $0 32% (200 × 0.32 = $64) $74 $10 $1 16% (200 × 0.16 = $32) $43 $10 $2 8% (200 × 0.08 = $16) $28 $10 $4 4% (200 × 0.04 = $8) $22 $10 $8 2% (200 × 0.02 = $4) $22 $10 $16 1% (200 × 0.01 = $2) $28 $10 $32 0.5% (200 × 0.005 = $1) $43 The lowest total cost is $22. There are option with a total cost of $22: $4 of testing and $8 of testing. Economically, we can choose either option. For high-volume, small-area chips, testing can consume more than 50% of the total cost.
  • 469.
    7.1.3 Physical Faults 447 7.1.3 Physical Faults 7.1.3.1 Types of Physical Faults Good Circuit Bad Circuits a c a c b d open b d a c wired-AND bridging short b d a c wired-OR bridging short b d a c stronger wins bridging short b d (b is stronger) a c short to VDD b d a c b d short to GND 7.1.3.2 Locations of Faults Each segment of wire, poly, diffusion, via, etc is a potential fault location. Different segments affect different gates in the fanout. A potential fault location is a segment or segments where a fault at any position affects the same set of gates in the same way. b b BAD OK BAD b BAD b BAD b OK Three different locations for potential faults.
  • 470.
    448 CHAPTER 7. FAULT TESTING AND TESTABILITY When working with faults, we work with wire segments, not signals. In the circuit below, there are 8 different wire segments (L1–L8). Each wire segment corresponds to a logically distinct fault location. All physical faults on a segment affect the same set of signals, so they are grouped together into a “logical fault”. If a signal has a fanout of 1, then there is one wire segment. A signal with a fanout of n, where n > 1, has at least n + 1 wire segments — one for the source signal and one for each gate of fanout. As shown in section 7.1.3.3, the layout of the circuit can have more than n + 1 segments. a L1 L6 L4 b L2 L8 z L5 L7 c L3 7.1.3.3 Layout Affects Locations a f e e e L2 L2 b g L1 L3 g L1 L3 L5 g i b b c L4 L4 h h h d For the signal b in the schematic above, we can have either four or five different locations for potential faults, depending upon how the circuit is layed out. 7.1.3.4 Naming Fault Locations Two ways to name a fault location: pin-fault model Faults are modelled as occuring on input and output pins of gates. net-fault model Faults are modelled as occuring on segments of wires. In E&CE 327, we’ll use the net-fault model, because it is simpler to work with and is closer to what actually happens in hardware. 7.1.4 Detecting a Fault To detect a fault, we compare the actual output of the circuit against the expected value. To find a test vector that will detect a fault:
  • 471.
    7.1.4 Detecting aFault 449 1. build Boolean equation (or Karnaugh map) of correct circuit 2. build Boolean equation (or Karnaugh map) of faulty circuit 3. compare equations (or Karnaugh maps), regions of difference represent test vectors that will detect fault 7.1.4.1 Which Test Vectors will Detect a Fault? Question: For the good circuit and faulty circuit shown below, which test vectors will detect the fault? a d a d b b e e c c Good circuit Faulty circuit Answer: a b c good faulty 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 The only test vector that will detect 0 1 1 1 1 the fault in the circuit is 110. 1 0 0 0 0 1 0 1 1 1 1 1 0 1 0 ←− 1 1 1 1 1 Sometimes multiple test vectors will catch the same fault. Sometimes a single test vector can catch multiple faults. a d b a b c good faulty e c 1 1 0 1 0 ←− Another fault The test vector 110 can catch both this fault and the previous one. With testing, we are primarily concerned with determining whether a circuit works correctly or not — detecting whether there is a fault. If the circuit has a fault, we usually do not care where
  • 472.
    450 CHAPTER 7. FAULT TESTING AND TESTABILITY the fault is — diagnosing the fault. To detect the two faults above, the test vector 110 is sufficient, because if either of the two faults is present, 110 will detect that the circuit does not work correctly. Note: Detect vs. diagnose Testing detects faults. Testing does not diagnose which fault occurred. If we have a higher-than-expected failure rate for a chip, we might want to investigate the cause of the failures, and so would need to diagnose the faults. In this case, we might do more exhaustive analysis to see which test vectors pass and which fail. We might also need to examine the chip physically with probes to test a few individual wires or transistors. This is done by removing the top layers of the chip and using very small and very sensitive probes, analogous to how we use a multimeter to test a circuit on a breadboard. 7.1.5 Mathematical Models of Faults Goal: develop reliable and predictable technique for detecting faults in circuits. Observations: • The possible faults in a circuit are dependent upon the physical layout of the circuit. • A very wide variety of possible faults • A single test vector can catch many different faults Need: a mathematical model for faults that is abstracted from complexities of circuit layout and plethora of possible faults, yet still detects most or all possible faults. 7.1.5.1 Single Stuck-At Fault Model Although there are many different bad behaviours that faults can lead to, the simple model of single-stuck-at-faults has proven very capable of finding real faults in real circuits. Two simplifying assumptions: 1. A maximum of one fault per tested circuit (hence “single”) 2. All faults are either: (a) stuck-at 1: short to VDD (b) stuck-at 0: short to GND hence, “stuck at”
  • 473.
    7.1.6 Generate TestVector to Find a Mathematical Fault 451 Example of Stuck-At Faults ............................................................ L1 a L9 L5 L8 L2 L6 b L10 L12 L3 i c L7 L11 L4 d 12 fault locations × 2 types of faults = 24 possible faults. If restrict to single stuck-at fault model, then have 24 faulty circuits to consider. If allowed multiple faults, then the circuit above could have up to 12 different faults. How many faulty circuits would need to be considered? Each of the 12 locations has three possible values: good, stuck-at-1, stuck-at-0. Therefore, 312 = 5.3 × 105 different circuits would need to be considered! If allowed multiple faults of 4 different types at 12 different locations, then would have 512 − 1 = 2.4 × 108 different faulty circuits to consider! 4 There are 22 = 6.6 × 104 different Boolean functions of four inputs (A k-map of four variables is 4 a grid of 24 squares; each square is either 0 or 1, which gives 22 different combinations). There are 6.6 × 104 possible equations for circuits with four inputs and one output. This is much less than the number of faulty circuit models that would be generated by the simultaneous-faults-at-every-location models. So both of the simultaneous-faults-at-every-location models are too extreme. 7.1.6 Generate Test Vector to Find a Mathematical Fault Faults are detected by stimulating circuits (real, manufactured circuit, not a simulation!) with test-vectors and checking that the real circuit gives the correct output. Standard practice in testing is to test circuits for single stuck-at faults. Mathematics and empirical evidence demonstrate that if a circuit appears to be free of single stuck-at faults, then probably it also free of other types of faults. That is, testing a circuit for single stuck-at faults will also detect many other types of faults and will often detect multiple faults. 7.1.6.1 Algorithm 1. compute Karnaugh map for correct circuit 2. compute Karnaugh map for faulty circuit 3. find region of disagreement
  • 474.
    452 CHAPTER 7. FAULT TESTING AND TESTABILITY 4. any assignment in region of disagreement is a test vector that will detect fault 5. any assignment outside of region of disagreement will result in same output on both correct and faulty circuit 7.1.6.2 Example of Finding a Test Vector a d a d b b e e c c a b ab ab ab ab a 10 11 01 00 b c c c1 c0 Good circuit Faulty circuit a b c Difference between good and faulty circuits 7.1.7 Undetectable Faults Not all faults are detectable. 1. If a circuit is irredundant then all single stuck-at faults can be detected. A redundant circuit is one where one or more gates can be removed without affecting the functional behaviour. 2. If not trying to find all of the faults in a circuit, then a fault that you aren’t looking for can mask a fault that you are looking for. 7.1.7.1 Redundant Circuitry Some faults are undetectable. Undetectable stuck-at faults are located in redundant parts of a circuit.
  • 475.
    7.1.7 Undetectable Faults 453 Timing Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Static hazard Timing hazards are often removed by adding Dynamic hazard redundant circuitry. Redundant Circuitry ................................................................. . a b 1,1 c a e 1,0 1,0 d b 1,0,1 g e d 0,1 f f 0,1 1,1 c g Irredundant circuit Illustration of timing hazard Glitch on g is caused because the AND gate for e turns off before f turns on. Question: Add one or more gates to the circuit so that the static hazard is guaranteed to be prevented, independent of the delay values through the gates In this sum-of-products style circuit, each AND gate corresponds to a cube in the Karnaugh map. a b c We can prevent this transition from causing a glitch by adding a cube that covers the two squares of the transition from 111 to 101. This cube is 1-1, which is the black cube in the Karnaugh map below and the signal h in the redundant circuit below. a b a b c c a b c a e d b e h L1 g f d h f c g Redundant circuit No more timing hazards
  • 476.
    454 CHAPTER 7. FAULT TESTING AND TESTABILITY Question: Has the redundant circuitry introduced any undetectable faults? If so, identify an undetectable fault. L1@0 is undetectable. Correct circuit Faulty circuit ab + bc ab + bc + ac With L1@0, ac −→ 0 ab + bc + 0 ab + bc Same equation as correct circuit A stuck-at fault in redundant circuitry will not affect the steady state behaviour of the circuit, but could allow timing glitches to occur. 7.1.7.2 Curious Circuitry and Fault Detection The two circuits below have the same steady-state behaviour. a a b L2 a c L1 b z L3 z c c Because the two circuits have the same behaviour, it might appear that the leftmost two XOR gates are redundant. However, these gates are not redundant. In the test for redundancy, when we remove a gate, we delete it; we do not replace it with other circuitry. Curiously, the stuck-at fault at L1 is undetectable, but faults at either L2 or L3 are detectable. fault eqn K-map diff w/ ckt a a b b c c L2@0 a ⊕ (b ⊕ c) a a b b c c L2@1 a ⊕ (b ⊕ c)
  • 477.
    7.2. TEST GENERATION 455 7.2 Test Generation 7.2.1 A Small Example Throughout this section we will use the circuit below: ab + bc a b a c L4 b L2 z L5 c At first, we will consider only the following faults: L2@1, L4@1, L5@1. fault eqn K-map diff w/ ckt test vectors a a b b c c 1) L2@1 a + c 101, 001, 100 a a b b c c 2) L4@1 a + bc 101, 100 a a b b c c 3) L5@1 ab + c 101, 001 Choose Test Vector ................................................................... . a b c If we choose 101, we can detect all three faults. Choosing either 001 or 100 will miss one of the three faults. 7.2.2 Choosing Test Vectors The goal of test vector generation is to find the smallest set of test vectors that will detect the faults of interest. Test vector generation requires analyzing the faults. We can simplify the task of fault analysis by reducing the number of faults that we have to analyze. Smith has examples of this in Figures 14.13 and 14.14.
  • 478.
    456 CHAPTER 7. FAULT TESTING AND TESTABILITY 7.2.2.1 Fault Domination fault eqn K-map Diff w/ ckt test vectors a a b b c c 1) L5@1 ab+c 101, 001 a a b b c c 2) L6@1 1 101, 001, 100, 010, 000 Any test vector that detects L5@1 will also detect L6@1: L5@1 is detected by 101 and 001, each of which will detect L6@1. L6@1 does not dominate L5@1, because there is at least one test vector that detecs L6@1 but does not detect L5@1 (e.g. each of 100, 010, 000 detect L6@1 but not L5@1). Definition dominates: f1 dominates f2 : any test vector that detects f1 will also detect f2 . When choosing test vectors, we can ignore the dominated fault, but must keep the dominant fault. L5@1 dominates L6@1. When choosing test vectors we can ignore L6@1 and just include L5@1. Question: To detect both L5@1 and L6@1, can we ignore one of the faults? Answer: We can ignore L6@1, because L5@1 dominates L6@1: each test vector that detects L5@1 also detects L6@1. Question: What would happen if we ignored the “wrong” fault? Answer: If we ignore L5@1, but keep L6@1, we can choose any of 5 test vectors that detect L6@1. If we chose 100, 010, or 000 as our test vector to detect L6@1, then we would not detect L5@1.
  • 479.
    7.2.2 Choosing TestVectors 457 7.2.2.2 Fault Equivalence fault eqn K-map Diff w/ ckt a a b b c c 1) L1@1 b a a b b c c 2) L3@1 b The two faults above are “equivalent”. Definition fault equivalence: f1 is equivalent to f2 : f1 and f2 are detected by exactly the same set of test vectors. That is, all of the test vectors that detect f1 will also detect f2 , and vice versa. When choosing test vectors we can ignore one of the faults and just include the other. 7.2.2.3 Gate Collapsing A controlling value on an input to a gate forces the output to be the controlled value. If a stuck-at fault on the input causes the input to have a controlling value, then that fault is equivalent to the output having a stuck-at fault of being at the controlled value. For example, a 1 on the input to an OR gate will force the output to be 1. So, a stuck-at-1 fault on either input to an OR gate is equivalent to a stuck-at-1 fault on the output of the gate, and is equivalent to a stuck-at-1 fault on any other input to the OR gate. A stuck-at-1 fault on the input to an OR gate is equivalent to a stuck-at-1 fault on the output of the OR gate. Definition Gate collapsing: : The technique of looking at the functionality of a gate and finding equivalent faults between inputs and outputs. Sets of collapsable faults for common gates @0 @0 @0 AND @1 @1 @1 OR QuestionWhat is the set of collapsible faults for a NAND gate? NAND
  • 480.
    458 CHAPTER 7. FAULT TESTING AND TESTABILITY Answer: To determine the collapsible faults, treat the NAND gate as an AND gate followed by an inverter, then invert the faults on the output of the gate. @0 @0 AND + NOT @0 @0 @1 NAND @0 7.2.2.4 Node Collapsing Note: Node collapsing is relevant only for the pin-fault model When two segments affect the same set of gates (ignoring any gates between the two segments), then faults on the two segments can be collapsed. With an invertor or buffer, the segment on the input affects the same gates as the output. Therefore, faults on the input and output segments are equivalent. Sets of collapsable faults for nodes @1 @0 NOT-1 @0 @1 NOT-0 With the net-fault model, which is the one we are using in E&CE 327, inverters and buffers are the only gates where node collapsing is relevant. With the pin-fault model, where faults are modelled as occuring on the pins of gates, there are other instances where node collapsing can be used. 7.2.2.5 Fault Collapsing Summary When calculating the test-vectors to detect a set of faults, apply the fault collapsing techniques of: • gate collapsing • node collapsing (if using pin-fault model) • general fault equivalence (intelligent collapsing) • fault domination to reduce the number of faults that you must examine. Fault collapsing is an optimization. If you skip this step, you will still get the correct answer, it will just take more work to get the correct answer, because in each step you will analyze a greater number of faults than if you do fault collapsing.
  • 481.
    7.2.3 Fault Coverage 459 7.2.3 Fault Coverage Definition Fault coverage: percentage of detectable faults that are detected by a set of test vectors. DetectedFaults FaultCoverage = DetectableFaults Some people’s definition of fault coverage has a denominator of AllPossibleFaults, not just those that are detectable. If the denominator is AllPossibleFaults, then, if a circuit has 100% single stuck-at fault coverage with a suite of test vectors, then each stuck-at fault in the circuit can be detected by one or more vectors in the suite. This also means that the circuit has no undetectable faults, and hence, no redundant circuitry. Even if the denominator is AllPossibleFaults, it is possible that achieving 100% coverage for single stuck at faults will allow defective chips to pass if they have faults that are not stuck-at-1 or stuck-at-0. I think, but haven’t seen a proof, that achieving 100% single stuck-at coverage will detect all combinations of multiple stuck-at faults. But, if you do not achieve 100% coverage, then a stuck-at fault that you aren’t testing for can mask (hide) a fault that you are testing for. NOTE: In Smith’s book, undetectable faults don’t hurt your coverage. This is not universally true. 7.2.4 Test Vector Generation and Fault Detection There are two ways to generate vectors and check results: built-in tests and scan testing. Both require: • generate test vectors • overide normal datapath to send test-vectors, rather than normal inputs, as inputs to flops • compare outputs of flops to expected result 7.2.5 Generate Test Vectors for 100% Coverage In this section we will find the test vectors to achieve 100% coverage of single stuck at faults for the circuit of the day. We will use a simple algorithm, there are much more sophisticated algorithms that are more efficient.
  • 482.
    460 CHAPTER 7. FAULT TESTING AND TESTABILITY The problem of test vector generation is often called Automatic Test Pattern Generation (ATPG) and continues to be an active area of research. A trendy idea is to use Genetic Algorithms (inspired by how DNA works) to generate test vectors that catch the maximum number of faults. The “classic” algorithm is the D algorithm invented by Roth in 1966 (Smith 14.5.1, 14.5.2). An enhanced version is the Path-Oriented D Algorithm (PODEM), which supports reconvergent fanout and was developed by Goel in 1981 (Smith 14.5.3). a L1 ab + bc L6 a L4 b b c L2 L8 z L5 L7 c L3 Figure 7.1: Example Circuit with Fault Locations and Karnaugh Map 7.2.5.1 Collapse the Faults a L1@0,1 L6@0,1 L4@0,1 b L2@0,1 L8@0,1 z L5@0,1 L7@0,1 c L3@0,1 Initial circuit with potential faults: Gate collapsing a L1 @0 @0 L6 L4 @0 b L2 L8 z L5 L7 c L3 L1@0, L4@0, L6@0 a L1 L6 L4 b L2 L8 z L5 @0 @0 c L3 @0 L7 L3@0, L5@0, L7@0 a L1 L6 L4 b @1 L2 @1 z @1 L8 L5 c L3 L7 L6@1, L7@1, L8@1
  • 483.
    7.2.5 Generate TestVectors for 100% Coverage 461 Node Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Node collapsing: none applicable (no invertors or buffers). a L1@1 L6@0 L4@1 b L2@0,1 L8@0,1 z L5@1 L7@0 c L3@1 Remaining faults: Intelligent Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sometimes, after the regular forms of fault collapsing have been done, there will still be some sets of equivalent faults in the circuit. It is usually beneficial to quickly look for patterns or symmetries in the circuit that will indicate a set of potentially equivalent faults. Intelligent Collapsing a b L2@0 L8@0 z L2@0, L8@0 Both L2@0 and L8@0 result in the c equation 0. a L1@1 b z L1@1, L3@1 Both L1@1 and L3@1 result in the c L3@1 equation b a L6@0 L4@1 b L2@1 L8@0,1 z L5@1 L7@0 c L3@1 Remaining faults:
  • 484.
    462 CHAPTER 7. FAULT TESTING AND TESTABILITY 7.2.5.2 Check for Fault Domination fault eqn K-map Diff w/ ckt a a b b c c 1) L2@1 a+c dominated by L4@1, L5@1 a a b b c c 2) L3@1 b a a b b c c 3) L4@1 a+bc a a b b c c 4) L5@1 ab+c a a b b c c 5) L6@0 bc a a b b c c 6) L7@0 ab a a b b c c 7) L8@0 0 dominated by L6@0, L7@0 a a b b c c 8) L8@1 1 dominated by L2@1, L3@1, L4@1, L5@1
  • 485.
    7.2.5 Generate TestVectors for 100% Coverage 463 Remove dominated faults .............................................................. Current faults: a L6@0 L4@1 b L2@1 L8@0,1 z L5@1 L7@0 c L3@1 Dominated faults: (L2@1, L8@0, L8@1). fault eqn K-map Diff w/ ckt a a b b c c a 1) L3@1 b L6@0 a a L4@1 c b c b b z 2) L4@1 a+bc L5@1 a a L7@0 c b c b c L3@1 3) L5@1 ab+c a a b b c c 4) L6@0 bc a a b b c c 5) L7@0 ab 7.2.5.3 Required Test Vectors If we have any faults that are detected by just one test-vector, then we Required vectors must include that test vector in our suite. L3@1 010 L6@0 110 Definition required test vector: A test vector tv is required L7@0 011 if there is a fault for which tv is the only test vector that will detect the fault. 7.2.5.4 Faults Not Covered by Required Test Vectors fault eqn K-map Diff w/ ckt The intersection of the two difference regions a a c b c b is 101. Choosing 101 detects both L4@1 and L5@1. 1) L4@1 a+bc a b a b Add 101 to suite of test vectors. c c Final set of test vectors is: 2) L5@1 ab+c 010, 110, 011, 101.
  • 486.
    464 CHAPTER 7. FAULT TESTING AND TESTABILITY 7.2.5.5 Order to Run Test Vectors The order in which the test vectors are run is important because it can affect how long a faulty chip stays in the tester before the chip’s fault is detected. The first vector to run should be the one that detects the most faults. Build a table for which faults each test vector will detect. Test Vector a a a a b b b b c c c c fault 110 010 011 101 a b c 1) L1@0 1 a b c 2) L1@1 1 a b c 3) L2@0 1 1 a b c 4) L2@1 1 a b c 5) L3@0 1 a b c 6) L3@1 1 a b c 7) L4@0 1 a b c 8) L4@1 1 a b c 9) L5@0 1 a b c 10) L5@1 1 a b c 11) L6@0 1 a b c 12) L6@1 1 1 a b c 13) L7@0 1 a b c 14) L7@1 1 1 a b c 15) L8@0 1 1 a b c 16) L8@1 1 1 Faults detected 5 5 5 6 101 detects the most faults, so we should run it first.
  • 487.
    7.2.5 Generate TestVectors for 100% Coverage 465 This reduces the faults found by 010 from 5 to 2 (because L6@1, L7@1, and L8@1 will be found by 101). This leaves 110 and 011 with 5 faults each, we can run them in either order, then run 010. We settle on a final order for our test suite of: 101, 011, 110, 010. 7.2.5.6 Summary of Technique to Find and Order Test Vectors 1. identify all possible faults 2. gate collapsing 3. node collapsing 4. intelligent collapsing 5. fault domination 6. determine required test vectors 7. choose minimal set of test vectors to detect remaining faults 8. order test vectors based on number of faults detected (NOTE: when iterating through this step, need to take into account faults detected by earlier test vectors)
  • 488.
    466 CHAPTER 7. FAULT TESTING AND TESTABILITY 7.2.5.7 Complete Analysis In case you don’t trust the fault collapsing analysis, here’s the complete analysis. fault eqn K-map Diff w/ ckt a a b b c c 1) L1@0 bc a a b b c c 2) L1@1 b a a b b c c 3) L2@0 0 dominated by 1, 5 a a b b c c 4) L2@1 a+c dominated by 8, 10 a a b b c c 5) L3@0 ab 6) L3@1 b same as 2 7) L4@0 bc same as 1 a a b b c c 8) L4@1 a+bc 9) L5@0 ab same as 5 a a b b c c 10) L5@1 ab+c 11) L6@0 bc same as 1 a a b b c c 12) L6@1 1 dominated by 8, 10 13) L7@0 ab same as 5 14) L7@1 1 same as 12 15) L8@0 0 same as 3 16) L8@1 1 same as 12
  • 489.
    7.2.6 One FaultHiding Another 467 7.2.6 One Fault Hiding Another a L1 L6 L4 b L2 L8 z L5 L7 c L3 Assume that we are not trying to detect all faults — L1 is viewed as not being at risk for faults, but L3 is at risk for faults. a L1 a L1 b b z z c L3 c L3 Problem: If L1 is stuck-at 1, the test vectors that normally detect L3@0 will not detect L3@0. In the presence of other faults, the set of test vectors to detect a fault will change. fault(s) eqn K-map Diff w/ ckt a a b b c c L3@0 ab a a b b c c L1@1,L3@0 b 7.3 Scan Testing in General Scan testing is based on the techniques described in section 7.2.5. The generation of test vectors and the checking of the result are done off-chip. In comparison, built-in self test (section 7.5) does test-vector generation and result checking on chip. Scan testing has the advantage of flexibility and reduced on-chip hardware, but increases the length of time required to run a test. In scan testing, we want to individually drive and read every flop in the circuit. Even without using any I/O pins for testing purposes, chips are already I/O bound, so scan-testing must be very frugal in its use of pins. Flops are connected together in “scan chains” with one input pin and one output pin.
  • 490.
    468 CHAPTER 7. FAULT TESTING AND TESTABILITY 7.3.1 Structure and Behaviour of Scan Testing data_in(3) zeta_in(3) another circuit #0 another circuit #1 data_in(2) zeta_in(2) circuit under test data_in(1) zeta_in(1) data_in(0) zeta_in(0) Normal Circuit mode0 scan_in0 mode1 scan_in1 yet another circuit another circuit scan chain 0 scan chain 1 circuit under test scan_out0 scan_out1 Circuit with Scan Chains Added 7.3.2 Scan Chains 7.3.2.1 Circuitry in Normal and Scan Mode mode0 scan_in0 mode1 scan_in1 zeta_in(3) data_in(3) zeta_in(2) data_in(2) circuit under test zeta_in(1) data_in(1) zeta_in(0) data_in(0) scan_out0 scan_out1 Normal Mode
  • 491.
    7.3.2 Scan Chains 469 mode0 scan_in0 mode1 scan_in1 circuit under test scan_out0 scan_out1 Scan Mode 7.3.2.2 Scan in Operation mode0 scan_in0 mode1 scan_in1 scan chain 0 scan chain 0 circuit another another circuit circuit under yet test Sequence of load; test; unload scan_out0 scan_out1 Circuit under test with scan chains Load Test Vector Run Test Vector Unload Result (1 cycle per bit) Through Circuit (1 cycle per bit) Unload and Load and Same Time ...................................................... Unload Prev Result Run Cur Test Vector Unload Cur Result Load Cur Test Vector Through Circuit Load New Test Vector (1 cycle per bit) (1 cycle per bit) clk mode0 scan_out0 previous current results0 results0 scan_in0 current next test vector0 vector0 scan_out1 previous current results1 results1 scan_in1 current next test vector1 vector1 Sequence of load; run; unload
  • 492.
    470 CHAPTER 7. FAULT TESTING AND TESTABILITY 7.3.2.3 Scan in Operation with Example Circuit mode0 scan_in0 mode1 scan_in1 a a y b y b z z c c d Circuit under test d scan_out0 scan_out1 Circuit under test with scan test circuitry
  • 493.
    7.3.2 Scan Chains 471 mode0 scan_in0 mode1 scan_in1 δ δ a y b z c d scan_out0 scan_out1 clk mode0 Start Loading Test Vector (Load δ) mode0 scan_in0 mode1 scan_in1 γ γ δ a y δ b δ z c d scan_out0 scan_out1 clk mode0 Load γ
  • 494.
    472 CHAPTER 7. FAULT TESTING AND TESTABILITY mode0 scan_in0 mode1 scan_in1 β β γ a y γ γ δ b z δ c δ d scan_out0 scan_out1 clk mode0 Load β mode0 scan_in0 mode1 scan_in1 α α β a y β β γ b z γ γ δ c δ δ d scan_out0 scan_out1 clk mode0 Load α
  • 495.
    7.3.2 Scan Chains 473 mode0 scan_in0 mode1 scan_in1 α α α β β β γ γ γ δ scan_out0 scan_out1 clk mode0 Run Test Vector mode0 scan_in0 mode1 scan_in1 α αβ __ αβ+βγ __ α α β αδ __ __ αδ+βδ βγ β β γ βδ γ γ δ scan_out0 scan_out1 clk mode0 Test Values Propagate
  • 496.
    474 CHAPTER 7. FAULT TESTING AND TESTABILITY mode0 scan_in0 mode1 scan_in1 δ’ δ’ __ - − αβ+βγ __ αδ+βδ scan_out0 scan_out1 __ (αδ+βδ) clk mode0 Flop-In Result, Start (Un)loading Test Vector mode0 scan_in0 mode1 scan_in1 γ’ γ’ δ’ − − − δ’ δ’ __ − − αβ+βγ scan_out0 scan_out1 __ __ (αδ+βδ, αβ+βγ) clk mode0 Continue (Un)loading Test Vector
  • 497.
    7.3.3 Summary ofScan Testing 475 mode0 scan_in0 mode1 scan_in1 β’ β’ γ’ ζ ζ − γ’ γ’ δ’ − − − δ’ δ’ scan_out0 scan_out1 __ __ (αδ+βδ, αβ+βγ) clk mode0 Finish (Un)loading Test Vector mode0 scan_in0 mode1 scan_in1 α’ α’ β’ ψ ψ ζ β’ β’ γ’ ζ ζ − γ’ γ’ δ’ δ’ δ’ δ’ scan_out0 scan_out1 __ __ (αδ+βδ, αβ+βγ) clk mode0 Run Next Test Vector 7.3.3 Summary of Scan Testing • Adding scan circuitry 1. Registers around circuit to be tested are grouped into scan chains 2. Replace each flop with mux + flop 3. Flops and muxes wired together into scan chains 4. Each scan chain is connected to dedicated I/O pins for loading and unloading test vectors
  • 498.
    476 CHAPTER 7. FAULT TESTING AND TESTABILITY • Running test vectors 1. Put scan chain in “scan” mode 2. Load in test vector (one element of vector per clock cycle) 3. Put scan chain in “normal” mode 4. Run circuit for one clock cycle — load result of test into flops 5. Unload results of current test vector while simultaneously loading in next test vector (one element of vector per clock cycle) 7.3.4 Time to Test a Chip If the length (number of flops) of a scan chain is n, then it takes 2n + 1 clock cycles to run a single test: n clock cycles to scan in the test vector, 1 clock cycle to execute the test vector, and n cycles to scan out the results. Once the results are scanned out, they can be compared to the expected results for a correctly working circuit. If we run 2 or more tests (and chips generally are subjected to hundreds of thousands of tests), then we speed things up by scanning in the next test vector while we scan out the previous result. ScanLength = number of flip flops in a scan chain NumVectors = number of test vectors in test suite TimeScan = number of clock cycles to run test suite = NumVectors × (ScanLength + 1) + ScanLength 7.3.4.1 Example: Time to Test a Chip A 800MHz chip has scan chains of length 20,000 bits, 18,000 bits, 21,000 bits, 22,000 bits, and two of 15,000 bits. 500,000 test vectors are used for each scan chain. The tests are run at 80% of full speed. Question: Calculate the total test time. Answer: We can load and unload all of the scan chains at the same time, so time will be limited by the longest (22,000 bits).
  • 499.
    7.4. BOUNDARY SCANAND JTAG 477 For the first test vector, we have to load it in, run the circuit for one clock cycle, then unload the result. Loading the second test vector is done while unloading the first. TimeTot = ClockPeriod ×(MaxLengthVec + NumVecs × (MaxLengthVec + 1)) = (1/(0.80 × 800 × 106)) × (22, 000 + 500, 000 × (22, 000 + 1)) = 17secs 7.4 Boundary Scan and JTAG Boundary scan originated as technique to test wires on printed circuit boards (PCBs). Goal was to replace “bed-of-nails” style testing with technique that would work for high-density PCBs (lots of small wires close together) Now used to test both boards and chip internals. Used both on boundaries (I/O pins) and internal flops. Boundary Scan with JTAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Standardized by IEEE (1149) and previously by JTAG: • 4 required signals (Scan Pins: TDI, TDO, TCK, TMS) • 1 optional signal (Scan Pin: TRST) • protocol to connect circuit under test to tester and other circuits • state machine to drive test circuitry on chip • Boundary Scan Description Language (BSDL): structural language used to describe which features of JTAG a circuit supports JTAG circuitry now commonly built-into FPGAs and ASICS, or part of a cell-library. Rarely is a JTAG circuit custom-built as part of a larger part. So, you’ll probably be choosing and using JTAG circuits, not constructing new ones. Using JTAG circuitry is usually done by giving a description of your printed circuit board (PCB) and the JTAG components on each chip (in BSDL) to test generation software. The software then generates a sequence of JTAG commands and data that can be used to test the wires on the circuit board for opens and shorts. 7.4.1 Boundary Scan History 1985 JETAG: Joint European Test Action Group 1986 JTAG (North American companies joined) 1990 JTAG 2.0 formed basis for IEEE 1491 “Test access port and boundary scan architecture”
  • 500.
    478 CHAPTER 7. FAULT TESTING AND TESTABILITY 7.4.2 JTAG Scan Pins TDI −→ test data input: input testvector to chip TDO ←− test data output: output result of test TCK −→ test clock: clock signal that test runs on TMS −→ test mode select: controls scan state machine TRST −→ test reset (optional): resets the scan state machine chip BSR BSC BSC circuit under test BSC BSC BSC BSC chip control scan registers TDI BR TDO Instruction Decoder normal circuit normal input under output pins pins IR test IRC IRC TCK IDCODE TDI TDO TCK control TAP Controller TMS TMS High-level view Detailed view 7.4.3 Scan Registers and Cells Basic Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TDR Test data register The boundary scan registers on a chip DR Fig 14.2 Data register cell Often used as a Boundary scan cell (BSC) JTAG Components ....................................................................
  • 501.
    7.4.4 Scan Instructions 479 Fig 14.8 Top level diagram BSR Fig 14.5 Boundary scan register A chain of boundary scan cells (BSCs) BSC Fig 14.2 Boundary scan cell Connects external input and scan signal to internal circuit. Acts as wire between external input and internal circuit in normal mode. BR Fig 14.3 Bypass-register cell Allows direct connection from TDI to TDO. Acts as a wire when executing BYPASS instruction. IDCODE Device identification register data register to hold manufacturer’s name and chip identifier. Used in IDCODE instruction. IR cell Fig 14.4 Instruction register cell Cells are combined together as a shift register to form an instruction register (IR) IR Fig 14.6 Instruction register Two or more IR cells in a row. Holds data that is shifted in on TDI, sends this data in parallel to instruction decoder. IDecode Table 14.4 Instruction decoder Reads instruction stored in instruction register (IR) and sends control signals to bypass register (BR) and boundary scan register (BSR) Fig 14.7 TAP Controller State machine that, together with instruction decoder, controls the scan circuitry. 7.4.4 Scan Instructions This the set of required instructions, other instructions are optional. EXTEST Test board-level interconnect. Drive output pins of chip with hard- coded test vector. Sample results on inputs. SAMPLE Sample result data PRELOAD Load test vector BYPASS Directly connect TDI to TDO. This is used when several chips are daisy chained together to skip loading data into some chips. IDCODE Output manufacturer and part number 7.4.5 TAP Controller The TAP controller is required to have 16 states and obey the state machine shown in Fig 14.7 of Smith.
  • 502.
    480 CHAPTER 7. FAULT TESTING AND TESTABILITY 7.4.6 Other descriptions of JTAG/IEEE 1194.1 Texas Instruments introductory seminar on IEEE 1149.1 http://www.ti.com/sc/docs/jtag/seminar1.pdf Texas Instruments intermediate seminar on IEEE 1149.1 http://www.ti.com/sc/docs/jtag/seminar2.pdf Sun midroSPARC-IIep scan-testing documentation http://www.sun.com/microelectronics/whitepapers/wpr-0018-01/ Intellitech JTAG overview: http://www.intellitech.com/resources/technology.html Actel’s JTAG description: http://www.actel.com/appnotes/97s05d15.pdf Description of JTAG support on Motorola Coldfile microprocessor: http://e-www.motorola.com/collateral/MCF5307TR-JTAG.pdf
  • 503.
    7.5. BUILT INSELF TEST 481 7.5 Built In Self Test With built-in self test, the circuit tests itself. Both test vector generation and checking are done using linear feedback shift registers (LFSRs). 7.5.1 Block Diagram mode test gen LFSR test signature ok(0) generator analyzer0 diz(0) d_out(0) data_in(0) signature ok(1) diz(1) analyzer1 d_out(1) data_in(1) circuit under signature ok(2) test analyzer2 diz(2) d_out(2) data_in(2) signature ok(3) diz(3) analyzer3 d_out(3) data_in(3) result checker all_ok BIST 7.5.1.1 Components There is one test generator per group of inputs (or internal flops) that drive the same circuit to be tested. There is one signature analyzer per output (or internal flop). Note: MISR An exception to the above rule is a multiple input signature register (MISR), which can be used to analyze several outputs of the circuit under test. The test generator and signature analyzer are both built with linear-feedback shift registers.
  • 504.
    482 CHAPTER 7. FAULT TESTING AND TESTABILITY Test Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . • generates a psuedo-random set of test vectors • for n output bits, generates all vectors from 1 to 2n − 1 in a pseudo random order • built with a linear-feedback shift register (shift-register portion is the input flops) The figure below shows an LFSR that generates all possible 3-bit vectors except 000. (An n bit LFSR that generates 2n − 1 different vectors is called a “maximal-length LFSR”.) Assume that reset initializes the circuit to 111. The sequence that is generated is: 111, 011, 001, 100, 010, 101, 110. This sequence q2 is repeated, so the number after 110 is 111. q1 q0 Question: Why not just use a counter to generate 1..2n − 1? Answer: • An LFSR has less area than an incrementer. Just a few XOR gates for an LFSR, compared to a half-adder per bit for an incrementer. • There is a strong correlation between consecutive test vectors generated by an incrementer, while there is no correlation between consecutive test vectors generated by an LFSR. When doing speed binning, if consecutive test vectors should generate the same output, we cannot distringuish between a slow critical path and a correctly working circuit. Signature Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Checking is done by building one signature analyzer circuit for each signal tested. The circuit returns true if the signal generates the correct sequence of outputs for the test vectors. Doing this with complete accuracy would require storing 2n bits of information for each output for a circuit with n inputs. This would be as expensive as the original circuit. So, BIST uses mathematics similar to error correction/detection to approximate whether the outputs are correct. This technique is called “signature analysis” and originated with Hewlett-Packard in the 1970s. The checking is done with an LFSR, similar to the BIST generation circuit. The checking circuit is designed to output a 1 at the end of the sequence of 2n − 1 test results if the sequence of results matches the correct circuit. We could do this with an LFSR of 2n − 1 flops, but as said before, this would be at least as expensive as duplicating the original circuit. The checking LFSR is designed similarly to a hashing function or parity checking circuit. If it returns 0, then we know that there is a fault in the circuit. If it returns a 1, then there is probably not a fault in the circuit, but we can’t say for sure.
  • 505.
    7.5.1 Block Diagram 483 There is a tradeoff between the accuracy of the analyzer and it’s area. The more accurate it is, the more flip flops are required. Summary: the signature analyzer: • checks that the output it is examining has the correct results for the complete set of tests that are run • only has a meaningful result at the end of the entire test sequence. • built with a linear-feedback shift register • similar to a hash function or a lossy compression function • if there are no faults, the signature analyzer will definitely say “ok” (no false negatives) • if there is a fault, the signature analyzer might say “ok” or might say “bad” (false positives are possible) • design tradeoff: more accurate signature analyzers require more hardware Result Checker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . • signature analyzers output “ok”/”bad” on every clock cycle, but the result is only meaningful at the end of running the complete set of test vectors • the result checker looks at test vector inputs to detect the end of the test suite and outputs “all ok” if all signature analyzers report “ok” at that moment • implemented as an AND gate 7.5.1.2 Linear Feedback Shift Register (LFSR) Basically, a shift register (sequence of flip-flops) with the output of the last flip-flop fed back into some of the earlier flip-flops with XOR gates. Design parameters: • number of flip-flops • external or internal XOR • feedback taps (coefficients) • external-input or self-contained • reset or set Example LFSRs ....................................................................... reset d0 R q0 d1 R q1 d2 R q2 d0 R q0 d1 R q1 d2 R q2 i S S S S S S set External-XOR, input, reset External-XOR, no input, set
  • 506.
    484 CHAPTER 7. FAULT TESTING AND TESTABILITY reset d0 R q0 d1 R q1 d2 R q2 i d0 R q0 d1 R q1 d2 R q2 i S S S set S S S Internal-XOR, input, set Internal-XOR, input, reset In E&CE 327, we use internal-XOR LFSR’s, because the circuitry matches the mathematics of Galois fields. External-XOR LFSR’s work just fine, but they are more difficult to analyze, because their behaviour can’t be treated as Galois fields. 7.5.1.3 Maximal-Length LFSR Definition maximal-length linear feedback shift register: An LFSR that outputs a pseudo-random sequence of all representable bit-vectors except 0...00. Definition pseudo random: The same elements in the same order every time, but the relationship between consecutive elements is apparantly random. Maximal-length linear feedback shift registers are used to generate test vectors for built-in self test. Maximal-Length LFSR Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The figures below illustrate the two maximal-length internal-XOR linear feedback shift registers that can be constructed with 3 flops. d0 R q0 d1 R q1 d2 R q2 S S S set Maximal-length internal-XOR LFSR d0 R q0 d1 R q1 d2 R q2 S S S set Maximal-length internal-XOR LFSR
  • 507.
    7.5.2 Test Generator 485 Question: Why do maximal-length LFSRs not generate the test vector 0...00? Answer: If all flops had 0 as output, then the LFSR would get stuck at 0 and would generate only 0...00 in the future. Maximal-length LFSRs: • set to all 1s initially • self contained (no external i input) 1 2 3 4 5 6 7 8 reset clk d0 q0 d1 q1 q2 val 7 6 4 1 2 5 3 7 6 Timing diagram for a 3-flop maximal-length LFSR 7.5.2 Test Generator The test generator component is a maximal-length LFSR with multiplexors on the inputs to each flip-flop. In test mode, the data input on each flip flop is connected to the output of the previous flip flop. In normal mode, the input of each flip flop is connected to the environment. mode d0 R q0 d1 R q1 d2 R q2 S S S i_d(0) i_d(1) i_d(2) set q0 q1 q2
  • 508.
    486 CHAPTER 7. FAULT TESTING AND TESTABILITY 7.5.3 Signature Analyzer There are four things that change between different signature analyzers: • number of flops (⇑ flops =⇒ ⇑ area, ⇑ accuracy) • choice of feedback taps: a good choice can improve accuracy (more isn’t necessarily better) • bubbles on input to AND gate for “ok”: determined by expected result from simulating test sequence through circuit under test and LFSR of analyzer. reset This circuit: d0 R q0 d1 R q1 • Two flops, most analyzers use i more — the HP boards in the 1970s S S used 37 flops! ok • Feedback taps on both flops. Differ- reset ent signature analyzers have differ- clk ent configurations of feedback taps. i i6 i5 i4 i3 i2 i1 i0 - • Also contains “ok” tester (AND gate). Expected output of LFSR at d0 i6 i5 i4⊕i6 356 245 1346 02356 - end of test sequence is: q0=1 and q0 0 i6 i5 i4⊕i6 356 245 1346 02356 q1=1, or 01. (We know this be- d1 0 i6 i5⊕i6 i4⊕i5 346 2356 1245 - cause of bubble on AND gate. To see q1 0 0 i6 i5⊕i6 i4⊕i5 346 2356 1245 why this is the expected output of the signature analyzer, we would need to know the correct sequence of outputs 356 = i3⊕i5⊕i6 of the circuit under test.) 2356 = i2⊕i3⊕i5⊕i6 etc... 7.5.4 Result Checker The purpose of the result checker is to check the “ok” circuit at the end of the test sequence. To do this, we need to recognize the end of the test sequence. The simplest way to do this is to notice that the first test vector is all 1s and that the test vector sequence will repeat as long as the circuit is in test mode. We want to sample the “ok” signal one clock cycle after the sequence is over. This is the same as the first clock cycle of the second test sequence. In this clock cycle, the output of the test generator will be all 1s and reset will be 0. We need to look at reset, because otherwise we could not distinguish the first sequence (when reset is 1) from the subsequenct sequences. reset q0 q1 all_ok q2 ok
  • 509.
    7.5.5 Arithmetic overBinary Fields 487 7.5.5 Arithmetic over Binary Fields • Galois Fields! • Two operations: “+” and “×” • Two values: 0 and 1 • Bit vectors and shift-registers are written as polynomials in terms of x. + represents XOR × represents concatenating shift registers expression result expression result 0+0 0 x4 × 1 x4 0+1 1 x2 × x3 x5 1+0 1 1+1 0 x+x 0 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Calculate (x3 + x2 + 1) × (x2 + x) x2 × (x3 + x2 + 1) = x5 + x4 + x2 x × (x3 + x2 + 1) = x4 + x3 + x x5 + x3 + x2 + x 7.5.6 Shift Registers and Characteristic Polynomials Each linear feedback shift register has a corresponding characteristic polynomial. The exponents in the polynomial correspond to the delay: x0 is the input to the shift register, x1 is the output of the first flip-flop, x2 is the output of the second, etc. The coefficient is 1 if the output feeds back into the flip flop. Usually (Internal flops or input flops with an external input), the feedback is done via an XOR gate. For input flops without an external input signal, the feedback is done directly, with a wire. The non-existant external input is equivalent to a 0, and 0 XOR a simplifies to a, which is a wire. From polynomials to hardware: • The maximum exponent denotes the number of flops • The other exponents denote the flops that tap off of feedback line from last flop • From the characteristic polynomial, we cannot determine whether the shift register has an external input. Stated another way, two shift registers that are identical except that one has an external input and the other does not will have the same characteristic polynomial.
  • 510.
    488 CHAPTER 7. FAULT TESTING AND TESTABILITY reset d0 R q0 R q1 R q2 i S S S p(x) = x3 reset d0 R q0 d1 R q1 R q2 i 0 1 2 x x x x3 S S S p(x) = x3 + x reset d0 R q0 R q1 R q2 i x0 x1 x2 x3 S S S p(x) = x3 + 1 reset d0 R q0 d1 R q1 R q2 i x0 x1 x2 x3 S S S p(x) = x3 + x + 1 reset d0 R q0 d1 R q1 d2 R q2 i x0 x1 x2 x3 S S S p(x) = x3 + x2 + x + 1 reset d0 R q0 d1 R q1 R q2 d3 R q3 i x0 x1 x2 x3 x4 S S S S p(x) = x4 + x3 + x + 1
  • 511.
    7.5.7 Bit Streamsand Characteristic Polynomials 489 7.5.6.1 Circuit Multiplication Redoing the multiplication example (x2 + x) × (x3 + x2 + 1) as circuits: x2 + x x3 + x2 + 1 (x2 + x) × (x3 + x2 + 1) = x × (x3 + x2 + 1) + x2 × (x3 + x2 + 1) = x5 + x3 + x2 + x The flop for the most-significant bit is represented by a coeffcient of 1 for the maximum exponent in the polynomial. Hence, MSB of the first partial product cancels the x4 of the second partial product, resulting in a coefficient of 0 for x4 in the answer. 7.5.7 Bit Streams and Characteristic Polynomials A bit stream, or bit sequence, can be represented as a polynomial. The oldest (first) bit in a sequence of n bits is represented by xn−1 and the youngest (last) bit is x0 . The bit sequence 1010011 can be represented as x6 + x4 + x + 1: 1 0 1 0 0 1 1 = 1x 6 + 0x5 + 1x4 + 0x3 + 0x2 + 1x1 + 1x0 = x6 + x4 + x + 1 7.5.8 Division With rules for multiplication and addition, we can define division. A fundamental theorem of division defines q and r to be the quotient and remainder, respectively, of m ÷ p iff: m(x) = q(x) × p(x) + r(x)
  • 512.
    490 CHAPTER 7. FAULT TESTING AND TESTABILITY In Galois fields, we do division just as with long division in elementary school. Given: m(x) = x6 + x4 + x3 p(x) = x4 + x Calculate the quotient, q(x) and remainder r(x) for m(x) ÷ p(x): x2 + 1 x4 + x x6 + 0x5 + 1x4 + 1x3 + 0x2 + 0x1 + 0x0 x6 + 1x3 1x4 1x4 + x x Quotient q(x) = x2 + 1 Remainder r(x) = x Check result: m(x) = q(x) × p(x) + r(x) = (x2 + 1) × (x4 + x) + x = x6 + x3 + x4 + x + x = x6 + x4 + x3 7.5.9 Signature Analysis: Math and Circuits The input to the signature analyzer is a “message”, m(x), which is a sequence of n bits represented as a polynomial. After n shifts through an LFSR with l flops: • The sequence of output bits forms a quotient, q(x), of length n − l • The flops in the analyzer form a remainder, r(x), of length l m(x) = q(x) × p(x) + r(x) The remainder is the signature. The mathematics for an LFSR without an input i: • same polynomial as if the circuit had an input
  • 513.
    7.5.10 Summary 491 • input sequence is all 0s An input stream with an error can be represented as m(x) + e(x) • e(x) is the error polynomial • bits in the message that are flipped have a coefficient of 1 in e(x) m(x) + e(x) = q′ (x) × p(x) + r′ (x) The error e(x) will be detected if it results in a different signature (remainder). m(x) and m(x) + e(x) will have the same remainder iff e(x) mod p(x) = 0 That is e(x) must be a multiple of p(x). The larger p(x) is, the smaller the chances that e(x) will be a multiple of p(x). 7.5.10 Summary Adding test circuitry 1. Pick number of flops for generator 2. Build generator (maximal-length linear feedback shift register) 3. Pick number of flops for signature analysis 4. Pick coeffecients (feedback taps) for analyzer 5. Based on generator, circuit under test, and signature analyzer; determine expected output of analyzer 6. Based on expected output of analyzer, build result checker Running test vectors 1. Put circuit in test mode 2. Set reset = 1 3. Run one clock cycle, set reset = 0 4. Run one clock cycle for each test vector 5. At end of test sequence, all ok signals should be 1 6. To run n test vectors requires n + 1 clock cycles.
  • 514.
    492 CHAPTER 7. FAULT TESTING AND TESTABILITY BIST for a Simple Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Outline of steps to see if a fault will be detected by BIST: 1. Output sequence from test generator 2. Output sequence from correct circuit 3. Remainder for signature analyzer with correct output sequence 4. Output sequence from faulty circuit 5. Remainder for signature analyzer with faulty output sequence 6. Compare correct and faulty remainder, if different then fault detected Components ......................................................................... . L1 a L4 L6 L2 L8 b L7 z L5 L3 a t0 t1 t2 D Q D Q D Q r0 r1 r2 z D Q D Q D Q
  • 515.
    7.5.10 Summary 493 correct faulty t0 t1 t2 t0 t1 t2 a b c z z z r0 r1 r2 z r0 r1 r2 Question: Determine if L2@1 will be detected
  • 516.
    494 CHAPTER 7. FAULT TESTING AND TESTABILITY Equation for correct circuit: ab + bc Equation for faulty circuit: a + c Output sequences for correct and faulty circuits correct Test Generation Sequence faulty t0 t1 t2 t0 t1 t2 1 1 0 1 initial values = 1 a b c z z 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 1 1 0 0 1 0 1 1 1 1 final values are repeat1 0 1 0 1 of initial values Technique is to shift; then compute result of vectors from output sequences XORs test generation from circuits sequence
  • 517.
    7.5.10 Summary 495 Signature analyzer sequence for correct Circuit Signature analyzer sequence for faulty circuit z r0 r1 r2 z r0 r1 r2 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 1 1 1 1 0 initial 1 1 1 1 1 1 0 initial values = 0 values = 0 0 1 1 0 1 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 remainder 1 0 0 0 1 0 0 remainder 0 1 0 1 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 output sequence output sequence from correct circuit from correct circuit
  • 518.
    496 CHAPTER 7. FAULT TESTING AND TESTABILITY 7.6 Scan vs Self Test Scan ⇑ less hardware ⇓ slower ⇑ well defined coverage ⇑ test vectors are easy to modify Self Test ⇓ more hardware ⇑ faster ⇓ ill defined coverage ⇓ test vectors are hard to modify
  • 519.
    7.7. PROBLEMS ONFAULTS, TESTING