A comparative analysis of fault injection methods via enhanced on-chip debug infrastructures J. M. Martins Ferreira  [ jmf@fe.up.pt ] FEUP / DEEC Rua Dr. Roberto Frias 4200-465 Porto -  PORTUGAL André Fidalgo,  Gustavo R. Alves  Manuel Gericota   [ anf/gca/mgg @isep.ipp.pt ] ISEP / DEE  Rua Ant. Bernardino Almeida, 431 4200-072 Porto -  PORTUGAL   SBCCI’08: Gramado, Brazil, 1-4 September 2008 These slides are available at  http://www.slideshare.net/josemmf
Outline of the presentation Introduction and motivation Setup, workbench, workflow Experimental results Basic, extended and OCD-FI  OCD-FI extensions (EDAC, RTREG) Comparison and discussion Conclusion
Scope, focus, setup Scope : usage of OCD resources for validating fault tolerance / fault injection Focus : comparative analysis of experimental results for various OCD configurations and debugging scenarios Setup :  a)  32-bit Freescale MPC-565, iSystem IC3000 (iTracePro), Winidea 2005  b)  OCD enhancements in VHDL
Motivation OCD offers controllability and observability features that may be used to inject faults and observe their effect (R/W access to registers and memory) Usefulness for fault tolerance validation may be limited in bandwidth, coverage and repeatability / representativeness of results Mitigation is possible by enhancing OCD
Our approach Configurations: basic (2:8), extended (8:8), OCD-FI (with a fault injection module) Fault injection scenarios: off-line or real-time, predefined or on-the-fly OCD-FI is able to cope with error detection / correction and real-time requirements Comparison of results uses a common set of workload applications and FI campaigns
NEXUS FI for the MPC565 Trace data: Program trace data output by the OCD  Campaign data: scripts that describe the FI experiments Improved Fault Effects Classification 3 Data Trace Real Time Fault Insertion 3 Dynamic Register and Memory Access Fault Effects Classification 2 Program Trace Static Fault Insertion 1 Static Register and Memory Access Real Time Triggering 1 Watchpoints Internal Triggering 1 Breakpoints External Triggering 1 Run-Control Usability for FI Class NEXUS Debug Features
OCD infrastructure developed to support this work NEXUS class 2  compliant with real- -time memory access Adjustable data bus OCD configurations Basic (2,8) Extended (8,8) OCD-FI: comprises a fault injection module
Fault injection:  Workload applications Workload applications: Matrix adder (Madder) Vector sorter (Vsorter) LUT control algorithm (Xcontrol) Each application was implemented in two versions: normal and fault tolerant Fault tolerance by duplicating data in memory and repeating each operation
Fault injection campaigns Scripts that define 10 FI experiments during system operation 100 campaigns were executed for each scenario using the three workload applications (Madder, Vsorter, Xcontrol) FI campaigns mostly target memory positions and cause a bit-flip to emulate SEU effects
Predetermination to improve performance of FI campaigns Predetermination of the contents of the target memory cell at the FI instant may be done through a “gold run” or by ensuring: Complete knowledge of the program flow Full observability of external inputs Precise control of the FI instant and location Otherwise the target memory cell must be read “immediately” before the FI instant
Experimental scenarios B : Basic;  E : Extended;  OCD-FI  : OCD for Fault Injection OF : Off-line;  RT : Real-time;  + : predetermination not required 4 57 Real Time NO MDI=2 MDO=8 OCD-FI+ 2 57 Real Time YES MDI=2 MDO=8 OCD-FI 18 6 Real Time NO MDI=8 MDO=8 ERT+ 9 6 Real Time YES MDI=8 MDO=8 ERT 44 22 Real Time NO MDI=2 MDO=8 BRT+ 35 22 Real Time YES MDI=2 MDO=8 BRT 18 6 Offline NO MDI=8 MDO=8 EOF+ 9 6 Offline YES MDI=8 MDO=8 EOF 44 22 Offline NO MDI=2 MDO=8 BOF+ 35 22 Offline YES MDI=2 MDO=8 BOF Insertion Set-Up Delays (Clk cycles) Fault injection  method Predetermination of the faulty value Bandwidth Configur. & Scenario
Experimental results (%):  B, E, OCD-FI (results) U ERR : Undetected errors (incorrect final result that goes undetected) D ERR : Detected errors (error detection signal activated) N ERR : No errors (application ended correctly) 70,1 1,1 28,8 70,2 29,8 1,2 1,9 96,9 2 98 58 13,9 28,1 80,9 19,1 OCD-FI+ 69,9 1,2 28,9 70,4 29,6 1,3 1,9 96,8 1,9 98,1 58 13,8 28,2 80,7 19,3 ERT+ 69,4 1,5 29,1 70,7 29,3 1,4 1,9 96,7 1,8 98,2 57,8 13,8 28,4 80,5 19,5 BRT+ 1 2 97 2 98 58,1 13,9 28 81 19 OCD-FI 1,1 2 96,9 2 98 58 13,9 28,1 80,8 19,2 ERT 1,2 2 96,8 1,9 98,1 57,9 13,8 28,3 80,6 19,4 BRT Not Possible 1 2 97 2 98 58,1 13,9 28 81 19 OFF N ERR U ERR D ERR N ERR U ERR N ERR U ERR D ERR N ERR U ERR N ERR U ERR D ERR N ERR U ERR SW-FT non-FT SW-FT non-FT SW-FT non-FT Configur .  & Scenario XControl VSorter MAdder  
Experimental results (%): Erroneous fault insertions Further experiments in RT scenarios were carried out to identify erroneous FI which were classified as  Inconclusive  (INC) 1,3 1,2 1,7 0,3 0,2 0,4 OCD-FI+ 2,4 2,1 3,7 1,5 0,8 2 ERT+ 3,2 2,8 4,8 2,1 1,2 3 BRT+ 0,2 0,2 0,1 0,2 OCD-FI 1,1 2,3 0,6 1,4 ERT Not Possible 2,2 4 Not Possible 0,9 3,1 BRT 0 0 OFF XControl VSorter MAdder XControl VSorter MAdder SW-FT non-FT Configur. & Scenario
Experimental results:  Pros and cons of FI methods Off-line configurations always produce the most reliable results The CPU may overwrite the target memory cell before the FI is complete (INC) INC results increase with the delay between fault triggering and fault insertion, and are mitigated by OCD-FI and predetermination
Experimental results (%):  OCD-FI extensions for EDAC FT versions of the workload applications were not used due to EDAC D ERR : Percentage of errors detected that were corrected by EDAC 0,5 69,5 0 30 1 69,1 0 29,9 XControl 0,3 0,7 0 99 0,9 0,8 0 98,3 VSorter 0,8 59,5 0 39,7 1,6 58,8 0 39,6 MAdder INC Nerr Uerr Derr INC Nerr Uerr Derr Predetermination No Predetermination
Experimental results: Pros and cons of OCD-FI EDAC extensions EDAC mechanisms effectively eliminate the effects of single bit-flip errors on the target system The OCD-FI EDAC extension enables FI into protected memory blocks
Experimental results (%):  OCD-FI for RTREG RT register access requires a collision manager that degrades dynamic performance… 40 14 46 40 60 VSorter 16 22 62 11 89 MAdder Nerr Uerr Derr Nerr Uerr   SW-FT non-FT
Experimental results: Pros and cons of OCD-FI RTREG extensions Due to their higher occurrence rate, INC results were explicitly avoided Not all code lines qualify to trigger a FI experiment (45% of the code lines could be used for triggering accumulator FI) FI results and software fault tolerance efficiency differ significantly between registers and memory
Performance (FI rate) Maximum faults / second rates (single bit-flips on the same memory cell, 30 MHz clock frequency): 483k 491k OCD_FI+ 1150k 1250k ERT+ 400k 454k BRT+ 1150k EOF+ 400k Not possible BOF+ Halted Access Real Time Conf. & Scenario
Performance (overhead, dynamic) Silicon overhead and maximum operating frequency on a Virtex-2 FPGA: 25 108,3% 77484 x x +BOTH   x 27 106,8% 76392 x   +RTREG   x 32 102,3% 73184   x +EDAC   x 36 100,4% 71842     x   x 36 106,4% 76127       ERT x 32 101,5% 72619   x   BRT x 36 100,0% 71527       BRT x 32 76,9% 55018   x     x 37 75,4% 53926         x [MHz] [%] [Eq Gates] Max f Overhead Area RTREG EDAC OCD-FI OCD CPU Core
Conclusions Wide spectrum (FPGA, ASIC, etc.) FI rate does not justify real-time Low overhead Better C&O than radiation techniques Less intrusive than software techniques Should be used with the final HW and SW Limitations in coverage, lack of standards

SBCCI08

  • 1.
    A comparative analysisof fault injection methods via enhanced on-chip debug infrastructures J. M. Martins Ferreira [ jmf@fe.up.pt ] FEUP / DEEC Rua Dr. Roberto Frias 4200-465 Porto - PORTUGAL André Fidalgo, Gustavo R. Alves Manuel Gericota [ anf/gca/mgg @isep.ipp.pt ] ISEP / DEE Rua Ant. Bernardino Almeida, 431 4200-072 Porto - PORTUGAL SBCCI’08: Gramado, Brazil, 1-4 September 2008 These slides are available at http://www.slideshare.net/josemmf
  • 2.
    Outline of thepresentation Introduction and motivation Setup, workbench, workflow Experimental results Basic, extended and OCD-FI OCD-FI extensions (EDAC, RTREG) Comparison and discussion Conclusion
  • 3.
    Scope, focus, setupScope : usage of OCD resources for validating fault tolerance / fault injection Focus : comparative analysis of experimental results for various OCD configurations and debugging scenarios Setup : a) 32-bit Freescale MPC-565, iSystem IC3000 (iTracePro), Winidea 2005 b) OCD enhancements in VHDL
  • 4.
    Motivation OCD offerscontrollability and observability features that may be used to inject faults and observe their effect (R/W access to registers and memory) Usefulness for fault tolerance validation may be limited in bandwidth, coverage and repeatability / representativeness of results Mitigation is possible by enhancing OCD
  • 5.
    Our approach Configurations:basic (2:8), extended (8:8), OCD-FI (with a fault injection module) Fault injection scenarios: off-line or real-time, predefined or on-the-fly OCD-FI is able to cope with error detection / correction and real-time requirements Comparison of results uses a common set of workload applications and FI campaigns
  • 6.
    NEXUS FI forthe MPC565 Trace data: Program trace data output by the OCD Campaign data: scripts that describe the FI experiments Improved Fault Effects Classification 3 Data Trace Real Time Fault Insertion 3 Dynamic Register and Memory Access Fault Effects Classification 2 Program Trace Static Fault Insertion 1 Static Register and Memory Access Real Time Triggering 1 Watchpoints Internal Triggering 1 Breakpoints External Triggering 1 Run-Control Usability for FI Class NEXUS Debug Features
  • 7.
    OCD infrastructure developedto support this work NEXUS class 2 compliant with real- -time memory access Adjustable data bus OCD configurations Basic (2,8) Extended (8,8) OCD-FI: comprises a fault injection module
  • 8.
    Fault injection: Workload applications Workload applications: Matrix adder (Madder) Vector sorter (Vsorter) LUT control algorithm (Xcontrol) Each application was implemented in two versions: normal and fault tolerant Fault tolerance by duplicating data in memory and repeating each operation
  • 9.
    Fault injection campaignsScripts that define 10 FI experiments during system operation 100 campaigns were executed for each scenario using the three workload applications (Madder, Vsorter, Xcontrol) FI campaigns mostly target memory positions and cause a bit-flip to emulate SEU effects
  • 10.
    Predetermination to improveperformance of FI campaigns Predetermination of the contents of the target memory cell at the FI instant may be done through a “gold run” or by ensuring: Complete knowledge of the program flow Full observability of external inputs Precise control of the FI instant and location Otherwise the target memory cell must be read “immediately” before the FI instant
  • 11.
    Experimental scenarios B: Basic; E : Extended; OCD-FI : OCD for Fault Injection OF : Off-line; RT : Real-time; + : predetermination not required 4 57 Real Time NO MDI=2 MDO=8 OCD-FI+ 2 57 Real Time YES MDI=2 MDO=8 OCD-FI 18 6 Real Time NO MDI=8 MDO=8 ERT+ 9 6 Real Time YES MDI=8 MDO=8 ERT 44 22 Real Time NO MDI=2 MDO=8 BRT+ 35 22 Real Time YES MDI=2 MDO=8 BRT 18 6 Offline NO MDI=8 MDO=8 EOF+ 9 6 Offline YES MDI=8 MDO=8 EOF 44 22 Offline NO MDI=2 MDO=8 BOF+ 35 22 Offline YES MDI=2 MDO=8 BOF Insertion Set-Up Delays (Clk cycles) Fault injection method Predetermination of the faulty value Bandwidth Configur. & Scenario
  • 12.
    Experimental results (%): B, E, OCD-FI (results) U ERR : Undetected errors (incorrect final result that goes undetected) D ERR : Detected errors (error detection signal activated) N ERR : No errors (application ended correctly) 70,1 1,1 28,8 70,2 29,8 1,2 1,9 96,9 2 98 58 13,9 28,1 80,9 19,1 OCD-FI+ 69,9 1,2 28,9 70,4 29,6 1,3 1,9 96,8 1,9 98,1 58 13,8 28,2 80,7 19,3 ERT+ 69,4 1,5 29,1 70,7 29,3 1,4 1,9 96,7 1,8 98,2 57,8 13,8 28,4 80,5 19,5 BRT+ 1 2 97 2 98 58,1 13,9 28 81 19 OCD-FI 1,1 2 96,9 2 98 58 13,9 28,1 80,8 19,2 ERT 1,2 2 96,8 1,9 98,1 57,9 13,8 28,3 80,6 19,4 BRT Not Possible 1 2 97 2 98 58,1 13,9 28 81 19 OFF N ERR U ERR D ERR N ERR U ERR N ERR U ERR D ERR N ERR U ERR N ERR U ERR D ERR N ERR U ERR SW-FT non-FT SW-FT non-FT SW-FT non-FT Configur . & Scenario XControl VSorter MAdder  
  • 13.
    Experimental results (%):Erroneous fault insertions Further experiments in RT scenarios were carried out to identify erroneous FI which were classified as Inconclusive (INC) 1,3 1,2 1,7 0,3 0,2 0,4 OCD-FI+ 2,4 2,1 3,7 1,5 0,8 2 ERT+ 3,2 2,8 4,8 2,1 1,2 3 BRT+ 0,2 0,2 0,1 0,2 OCD-FI 1,1 2,3 0,6 1,4 ERT Not Possible 2,2 4 Not Possible 0,9 3,1 BRT 0 0 OFF XControl VSorter MAdder XControl VSorter MAdder SW-FT non-FT Configur. & Scenario
  • 14.
    Experimental results: Pros and cons of FI methods Off-line configurations always produce the most reliable results The CPU may overwrite the target memory cell before the FI is complete (INC) INC results increase with the delay between fault triggering and fault insertion, and are mitigated by OCD-FI and predetermination
  • 15.
    Experimental results (%): OCD-FI extensions for EDAC FT versions of the workload applications were not used due to EDAC D ERR : Percentage of errors detected that were corrected by EDAC 0,5 69,5 0 30 1 69,1 0 29,9 XControl 0,3 0,7 0 99 0,9 0,8 0 98,3 VSorter 0,8 59,5 0 39,7 1,6 58,8 0 39,6 MAdder INC Nerr Uerr Derr INC Nerr Uerr Derr Predetermination No Predetermination
  • 16.
    Experimental results: Prosand cons of OCD-FI EDAC extensions EDAC mechanisms effectively eliminate the effects of single bit-flip errors on the target system The OCD-FI EDAC extension enables FI into protected memory blocks
  • 17.
    Experimental results (%): OCD-FI for RTREG RT register access requires a collision manager that degrades dynamic performance… 40 14 46 40 60 VSorter 16 22 62 11 89 MAdder Nerr Uerr Derr Nerr Uerr   SW-FT non-FT
  • 18.
    Experimental results: Prosand cons of OCD-FI RTREG extensions Due to their higher occurrence rate, INC results were explicitly avoided Not all code lines qualify to trigger a FI experiment (45% of the code lines could be used for triggering accumulator FI) FI results and software fault tolerance efficiency differ significantly between registers and memory
  • 19.
    Performance (FI rate)Maximum faults / second rates (single bit-flips on the same memory cell, 30 MHz clock frequency): 483k 491k OCD_FI+ 1150k 1250k ERT+ 400k 454k BRT+ 1150k EOF+ 400k Not possible BOF+ Halted Access Real Time Conf. & Scenario
  • 20.
    Performance (overhead, dynamic)Silicon overhead and maximum operating frequency on a Virtex-2 FPGA: 25 108,3% 77484 x x +BOTH   x 27 106,8% 76392 x   +RTREG   x 32 102,3% 73184   x +EDAC   x 36 100,4% 71842     x   x 36 106,4% 76127       ERT x 32 101,5% 72619   x   BRT x 36 100,0% 71527       BRT x 32 76,9% 55018   x     x 37 75,4% 53926         x [MHz] [%] [Eq Gates] Max f Overhead Area RTREG EDAC OCD-FI OCD CPU Core
  • 21.
    Conclusions Wide spectrum(FPGA, ASIC, etc.) FI rate does not justify real-time Low overhead Better C&O than radiation techniques Less intrusive than software techniques Should be used with the final HW and SW Limitations in coverage, lack of standards