SDC in Enterprise Class Servers

                                    Ishwar Parulkar
                             Sun Micr...
Outline
          • Sources of SDC
          • Examples of cases of SDC
          • How big a concern is SDC?
            ...
Silent Data Corruption (SDC)


         SDC is defined as incorrect data being
         generated in hardware and the inco...
Outline
          • Sources of SDC
          • Examples of cases of SDC
          • How big a concern is SDC?
            ...
Sources of SDC in Servers

         1. Cosmic radiation induced bit flips in silicon
         2. Design and process margin...
Outline
          • Sources of SDC
          • Examples of cases of SDC
          • How big a concern is SDC?
            ...
Example – Cosmic Radiation

     • Sun UltraSPARC-II servers had a noticeable
       crash rate in the field in 2000
     ...
Example - Design Marginality
       • “AMD Options suffer heat issue” - CNET 4/28/0
       • From AMD web site:
          ...
Example - Process Marginality

     • Very infrequent, intermittent parity errors noticed in
       the field (NOT SDC)
  ...
Example - Logic Design Bug - (1)
     Famous Pentium FDIV Bug in 1994


     • Discovered by a user running code to enumer...
Example - Logic Design Bug - (2)
     A more subtle case
       • Multithreaded processor with multiple strands sharing
  ...
Examples – Silicon Degradation

       • Several phenomena
          – Electromigration
          – Gate Oxide Breakdown
 ...
Outline
          • Sources of SDC
          • Examples of cases of SDC
          • How big a concern is SDC?
            ...
Server Market Segments

                                     Back Office
                                         • CRM
  ...
Server SDC and Availability
  Typical Targets

                    Server Type            MTBSDC          Availability
   ...
Classification of Silicon Errors from
  a User Perspective



                                       Universe of
         ...
Classification of Silicon Errors from
  a User Perspective




                                     C         U


        ...
Classification of Silicon Errors from
  a User Perspective


                 Silent             SC        SU


          ...
Classification of Silicon Errors from
  a User Perspective
   Customer
  does not care

                 Silent           ...
Classification of Silicon Errors from
  a User Perspective
   Customer
  does not care

                 Silent           ...
Classification of Silicon Errors from
  a User Perspective
   Customer
  does not care

                 Silent           ...
Classification of Silicon Errors from
  a User Perspective
   Customer                                            Silent D...
A Typical Data Centric Server

          Component            Approx. Count                  Comments
           Processor...
Server Sensitivity to Processor SDC
                                        Sensitivity of Server to Processor SU Rate
   ...
Server Sensitivity to Processor SDC
                                                     Sensitivity to Processor SU Rate
...
Outline
          • Sources of SDC
          • Examples of cases of SDC
          • How big a concern is SDC?
            ...
Design for SDC Mitigation
         VOC, Field Data, Marketing




DSN 2008 Panel: SDC – Myth or Reality?   Slide 27
Design for SDC Mitigation
         VOC, Field Data, Marketing



               System Level
             MTBSDC, MTBUSI
 ...
Design for SDC Mitigation
         VOC, Field Data, Marketing



               System Level
             MTBSDC, MTBUSI
 ...
Design for SDC Mitigation
         VOC, Field Data, Marketing



               System Level
             MTBSDC, MTBUSI
 ...
Design for SDC Mitigation
         VOC, Field Data, Marketing



               System Level
             MTBSDC, MTBUSI
 ...
Design for SDC Mitigation
         VOC, Field Data, Marketing



               System Level
             MTBSDC, MTBUSI
 ...
Design for SDC Mitigation
         VOC, Field Data, Marketing



               System Level
             MTBSDC, MTBUSI
 ...
Design for SDC Mitigation
         VOC, Field Data, Marketing



               System Level
             MTBSDC, MTBUSI
 ...
Design for SDC Mitigation
         VOC, Field Data, Marketing



               System Level
             MTBSDC, MTBUSI
 ...
Design for SDC Mitigation
         VOC, Field Data, Marketing



               System Level
             MTBSDC, MTBUSI
 ...
Design for SDC Mitigation
         VOC, Field Data, Marketing



               System Level
             MTBSDC, MTBUSI
 ...
Outline
          • Sources of SDC
          • Examples of cases of SDC
          • How big a concern is SDC?
            ...
Solution Trends for SDC

    • Unit level redundancy is too costly
    • Logic and flops need to be protected
    • Circui...
Outline
          • Sources of SDC
          • Examples of cases of SDC
          • How big a concern is SDC?
            ...
Conclusions

    • SDC is a reality
         – criticality and investment in mitigation highly dependent
           on app...
Backup Slides




DSN 2008 Panel: SDC – Myth or Reality?        Slide 42
Using Sun Processor “Ranch”
   (Testing in Broomfield, CO)




DSN 2008 Panel: SDC – Myth or Reality?   Slide 43
Broomfield Test Setup

    • Altitude and geomagnetic location give ~4.1x
      acceleration over sea-level
    • 600 US-I...
Soft Error Testing of SUN Processors
  - A Chronology


       Date         Process Node         Device Under Test     Loc...
A Typical LANL Test Setup

       • Recently tested UltraSPARC T2 and a next
         generation processor in 65nm technol...
Design/Process Marginality
   Where do you solve it?

                                         Design      Guard-bands
   ...
Upcoming SlideShare
Loading in …5
×

Silent Data Corruption in Servers

828 views

Published on

This presentation was given as part of an industry panel at DSN 2008 (Dependable Systems and Networks). The topic was "Is SDC a myth or reality?". This presentation gives the SDC perspective in the enterprise server class domain.

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
828
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Silent Data Corruption in Servers

  1. 1. SDC in Enterprise Class Servers Ishwar Parulkar Sun Microsystems, Inc. DSN 2008 Panel: SDC – Myth or Reality? Slide 1
  2. 2. Outline • Sources of SDC • Examples of cases of SDC • How big a concern is SDC? – Application space – Server sensitivity to SDC • Design/Measurement for SDC mitigation • Solution trends • Conclusions DSN 2008 Panel: SDC – Myth or Reality? Slide 2
  3. 3. Silent Data Corruption (SDC) SDC is defined as incorrect data being generated in hardware and the incorrect data being communicated to the application layer without being detected for a period of time (it might get detected eventually). DSN 2008 Panel: SDC – Myth or Reality? Slide 3
  4. 4. Outline • Sources of SDC • Examples of cases of SDC • How big a concern is SDC? – Application space – Server sensitivity to SDC • Design/Measurement for SDC mitigation • Solution trends • Conclusions DSN 2008 Panel: SDC – Myth or Reality? Slide 4
  5. 5. Sources of SDC in Servers 1. Cosmic radiation induced bit flips in silicon 2. Design and process marginalities 3. Very corner case logic design bugs 4. Defects occurring in silicon due to ageing DSN 2008 Panel: SDC – Myth or Reality? Slide 5
  6. 6. Outline • Sources of SDC • Examples of cases of SDC • How big a concern is SDC? – Application space – Server sensitivity to SDC • Design/Measurement for SDC mitigation • Solution trends • Conclusions DSN 2008 Panel: SDC – Myth or Reality? Slide 6
  7. 7. Example – Cosmic Radiation • Sun UltraSPARC-II servers had a noticeable crash rate in the field in 2000 – symptom was system panic, NOT SDC • Diagnosed to cosmic radiation induced soft errors in external cache – symptom exhibited by SRAM from one vendor (IBM) • Several examples and experiments from aerospace, NASA, medical implant electronics industries DSN 2008 Panel: SDC – Myth or Reality? Slide 7
  8. 8. Example - Design Marginality • “AMD Options suffer heat issue” - CNET 4/28/0 • From AMD web site: http://www.amd.com/usen/0,,3715_13965,00.html?redir=CORPR01 – “A few processors have been observed to produce inconsistent results in a non-production synthetic test environment with the convergence of the following three simultaneous conditions: • The running of FP intensive code sequences, • elevated CPU temperatures, and • elevated ambient temperatures” • In general, temperature gradients in silicon can be up to 30oC per mm on large dice Question: Design, Manufacturing test or In-field reliability issue? DSN 2008 Panel: SDC – Myth or Reality? Slide 8
  9. 9. Example - Process Marginality • Very infrequent, intermittent parity errors noticed in the field (NOT SDC) • Symptom seen on few parts – long, unpredictable time to failure – parts were from one manufacturing line • Diagnosed to a long route with multiple jogs – no DFM rule violation – combination of • location of die on wafer • mechanical warping • electrical use condition (load) DSN 2008 Panel: SDC – Myth or Reality? Slide 9
  10. 10. Example - Logic Design Bug - (1) Famous Pentium FDIV Bug in 1994 • Discovered by a user running code to enumerate primes • Symptom: Reduction in precision of division operations • Concern in scientific/engineering and financial engineering fields • Source: Few missing entries in a look-up table used in floating point divide operations, not detected in verification • Intel estimated MTBSDC of 27000 years, IBM estimated 24 days DSN 2008 Panel: SDC – Myth or Reality? Slide 10
  11. 11. Example - Logic Design Bug - (2) A more subtle case • Multithreaded processor with multiple strands sharing resources • 1-3 cycle of vulnerability created when – more than 1 strand is using an execution pipe with specific combinations of operations • SDC occurs if all of the following arrive at the trap commit unit within 1-3 cycle window of vulnerability – A checkpoint state – A trap – A park request • Scenario pathologically possible; probability of occurring in code is close to 0 DSN 2008 Panel: SDC – Myth or Reality? Slide 11
  12. 12. Examples – Silicon Degradation • Several phenomena – Electromigration – Gate Oxide Breakdown – Channel Hot Carrier Effect – Negative Bias Temperature Instability • Addressed by DFM rules, guard-banding in design and accelerating via burn-in during manufacturing • Not a major concern for SDC, because they are not silent for long DSN 2008 Panel: SDC – Myth or Reality? Slide 12
  13. 13. Outline • Sources of SDC • Examples of cases of SDC • How big a concern is SDC? – Application space – Sensitivity to SDC • Design/Measurement for SDC mitigation • Solution trends • Conclusions DSN 2008 Panel: SDC – Myth or Reality? Slide 13
  14. 14. Server Market Segments Back Office • CRM • ERP • BIDW • Database HPC Mainstream Web • • Finance Manufacturing Infrastructure • Oil and Gas • Life Sciences • Web 2.0 • Government • Storage • Service Providers DSN 2008 Panel: SDC – Myth or Reality? Slide 14
  15. 15. Server SDC and Availability Typical Targets Server Type MTBSDC Availability Data Centric 100-1000 years 99.999 Web Centric 10-100 years 99.999-99.9999 Compute Centric 100-1000 years 99.990 MTBF in years = 109 / (FIT * 24 Hours * 365 Days) DSN 2008 Panel: SDC – Myth or Reality? Slide 15
  16. 16. Classification of Silicon Errors from a User Perspective Universe of Silicon Errors in a Server Chip DSN 2008 Panel: SDC – Myth or Reality? Slide 16
  17. 17. Classification of Silicon Errors from a User Perspective C U Corrected Uncorrected DSN 2008 Panel: SDC – Myth or Reality? Slide 17
  18. 18. Classification of Silicon Errors from a User Perspective Silent SC SU Reported RC RU Corrected Uncorrected DSN 2008 Panel: SDC – Myth or Reality? Slide 18
  19. 19. Classification of Silicon Errors from a User Perspective Customer does not care Silent SC SU Reported RC RU Corrected Uncorrected DSN 2008 Panel: SDC – Myth or Reality? Slide 19
  20. 20. Classification of Silicon Errors from a User Perspective Customer does not care Silent SC SU Reported RC RU Required by Corrected Uncorrected Service/Customer to monitor health DSN 2008 Panel: SDC – Myth or Reality? Slide 20
  21. 21. Classification of Silicon Errors from a User Perspective Customer does not care Silent SC SU Reported RC RU System Crash Required by Corrected Uncorrected Service/Customer to monitor health DSN 2008 Panel: SDC – Myth or Reality? Slide 21
  22. 22. Classification of Silicon Errors from a User Perspective Customer Silent Data does not care Corruption Silent SC SU Reported RC RU System Crash Required by Corrected Uncorrected Service/Customer to monitor health DSN 2008 Panel: SDC – Myth or Reality? Slide 22
  23. 23. A Typical Data Centric Server Component Approx. Count Comments Processors 8-64 8-64 way systems ASICs 320 Memory controllers, IO bridges, Crypto, etc. Memory DIMMs 640 Depends on memory capacity AC/DC 8-10 Main power supply Power Supplies DC/DC 640 High and low voltage supplies Power Supplies Clocking 64 Clock synthesizers and distribution Service Processor 4 Small processors, FPGA Miscellaneous 1000-10000 Resistors, Capacitors, Pins, Connectors Small Components DSN 2008 Panel: SDC – Myth or Reality? Slide 23
  24. 24. Server Sensitivity to Processor SDC Sensitivity of Server to Processor SU Rate 120 110 100 Server MTBSDC (Years) 90 80 70 60 50 40 30 20 10 0 100 200 300 400 500 600 700 Processor SU (Silent Uncorrected) FIT DSN 2008 Panel: SDC – Myth or Reality? Slide 24
  25. 25. Server Sensitivity to Processor SDC Sensitivity to Processor SU Rate 120 110 100 Server MTBSDC (Years) 90 89 years 80 70 60 50 42 years 40 30 20 10 0 100 200 300 400 500 600 700 Processor SU (Silent Uncorrected) FIT • A 150 FIT increase in processor implies: – 52.8% degradation of MTBSDC DSN 2008 Panel: SDC – Myth or Reality? Slide 25
  26. 26. Outline • Sources of SDC • Examples of cases of SDC • How big a concern is SDC? – Application space – Sensitivity to SDC • Design/Measurement for SDC mitigation • Solution trends • Conclusions DSN 2008 Panel: SDC – Myth or Reality? Slide 26
  27. 27. Design for SDC Mitigation VOC, Field Data, Marketing DSN 2008 Panel: SDC – Myth or Reality? Slide 27
  28. 28. Design for SDC Mitigation VOC, Field Data, Marketing System Level MTBSDC, MTBUSI Targets Chip Level FIT Targets DSN 2008 Panel: SDC – Myth or Reality? Slide 28
  29. 29. Design for SDC Mitigation VOC, Field Data, Marketing System Level MTBSDC, MTBUSI Targets Chip Level FIT Targets SER Estimation Raw Static SER from SPICE Measurement Simulations at LANL DSN 2008 Panel: SDC – Myth or Reality? Slide 29
  30. 30. Design for SDC Mitigation VOC, Field Data, Marketing System Level MTBSDC, MTBUSI Targets Chip Level FIT Targets Raw Soft Error Rate SER Estimation Raw Static SER from SPICE Measurement Simulations at LANL DSN 2008 Panel: SDC – Myth or Reality? Slide 30
  31. 31. Design for SDC Mitigation VOC, Field Data, Marketing System Level MTBSDC, MTBUSI Targets Chip Level FIT Targets Raw Soft Error Rate SER Estimation Raw Static SER from SPICE Measurement GOI, NBTI,CHC,EM Accelerated Test Simulations at LANL Reliability Modeling of Samples DSN 2008 Panel: SDC – Myth or Reality? Slide 31
  32. 32. Design for SDC Mitigation VOC, Field Data, Marketing System Level MTBSDC, MTBUSI Targets Chip Level FIT Targets Raw Soft Error Rate Raw Hard Error Rate SER Estimation Raw Static SER from SPICE Measurement GOI, NBTI,CHC,EM Accelerated Test Simulations at LANL Reliability Modeling of Samples DSN 2008 Panel: SDC – Myth or Reality? Slide 32
  33. 33. Design for SDC Mitigation VOC, Field Data, Marketing System Level MTBSDC, MTBUSI Targets Chip Level FIT Targets Circuit, Logic, Architecture, SW Detection, Correction, Recovery Solutions Raw Soft Error Rate Raw Hard Error Rate SER Estimation Raw Static SER from SPICE Measurement GOI, NBTI,CHC,EM Accelerated Test Simulations at LANL Reliability Modeling of Samples DSN 2008 Panel: SDC – Myth or Reality? Slide 33
  34. 34. Design for SDC Mitigation VOC, Field Data, Marketing System Level MTBSDC, MTBUSI Targets Chip Level Electrical, Logical and FIT Targets Architectural Derating Circuit, Logic, Architecture, SW Detection, Correction, Recovery Solutions Raw Soft Error Rate Raw Hard Error Rate SER Estimation Raw Static SER from SPICE Measurement GOI, NBTI,CHC,EM Accelerated Test Simulations at LANL Reliability Modeling of Samples DSN 2008 Panel: SDC – Myth or Reality? Slide 34
  35. 35. Design for SDC Mitigation VOC, Field Data, Marketing System Level MTBSDC, MTBUSI Targets Chip Level Actual Chip Level Electrical, Logical and FIT Targets FIT Architectural Derating Circuit, Logic, Architecture, SW Detection, Correction, Recovery Solutions Raw Soft Error Rate Raw Hard Error Rate SER Estimation Raw Static SER from SPICE Measurement GOI, NBTI,CHC,EM Accelerated Test Simulations at LANL Reliability Modeling of Samples DSN 2008 Panel: SDC – Myth or Reality? Slide 35
  36. 36. Design for SDC Mitigation VOC, Field Data, Marketing System Level MTBSDC, MTBUSI Targets Chip Level Actual Chip Level FIT Targets = FIT Electrical, Logical and Architectural Derating Circuit, Logic, Architecture, SW Detection, Correction, Recovery Solutions Raw Soft Error Rate Raw Hard Error Rate SER Estimation Raw Static SER from SPICE Measurement GOI, NBTI,CHC,EM Accelerated Test Simulations at LANL Reliability Modeling of Samples DSN 2008 Panel: SDC – Myth or Reality? Slide 36
  37. 37. Design for SDC Mitigation VOC, Field Data, Marketing System Level MTBSDC, MTBUSI Targets Not Equal Chip Level Actual Chip Level FIT Targets = FIT Electrical, Logical and Architectural Derating Not Equal Circuit, Logic, Architecture, SW Detection, Correction, Recovery Solutions Raw Soft Error Rate Raw Hard Error Rate SER Estimation Raw Static SER from SPICE Measurement GOI, NBTI,CHC,EM Accelerated Test Simulations at LANL Reliability Modeling of Samples DSN 2008 Panel: SDC – Myth or Reality? Slide 37
  38. 38. Outline • Sources of SDC • Examples of cases of SDC • How big a concern is SDC? – Application space – Sensitivity to SDC • Design/Measurement for SDC mitigation • Solution trends • Conclusions DSN 2008 Panel: SDC – Myth or Reality? Slide 38
  39. 39. Solution Trends for SDC • Unit level redundancy is too costly • Logic and flops need to be protected • Circuit level solutions can be limiting • Logic/architectural solutions more promising • Periodic on-line testing for predicting degradation • Trillions of random verification cycles DSN 2008 Panel: SDC – Myth or Reality? Slide 39
  40. 40. Outline • Sources of SDC • Examples of cases of SDC • How big a concern is SDC? – Application space – Sensitivity to SDC • Design/Measurement for SDC mitigation • Solution trends • Conclusions DSN 2008 Panel: SDC – Myth or Reality? Slide 40
  41. 41. Conclusions • SDC is a reality – criticality and investment in mitigation highly dependent on application space • Solutions to SDC need to be low overhead – mainframe level reliability/availability at server price points • Need more accurate estimation of SDC • SDC due to design bugs and design/process marginalities still hard to estimate DSN 2008 Panel: SDC – Myth or Reality? Slide 41
  42. 42. Backup Slides DSN 2008 Panel: SDC – Myth or Reality? Slide 42
  43. 43. Using Sun Processor “Ranch” (Testing in Broomfield, CO) DSN 2008 Panel: SDC – Myth or Reality? Slide 43
  44. 44. Broomfield Test Setup • Altitude and geomagnetic location give ~4.1x acceleration over sea-level • 600 US-III Processors • 3 months of testing • Used modified POST code to write 0's and 1's to memory arrays and observe bit flips • Monitored power supply fails as well DSN 2008 Panel: SDC – Myth or Reality? Slide 44
  45. 45. Soft Error Testing of SUN Processors - A Chronology Date Process Node Device Under Test Location Test Type 8/2000 250nm, 180nm US III Los Alamos Neutron Irradiation 11/2000 – 2/2001 250nm, 180nm US III Broomfield Large Volume (600 CPUs) 11/2002 150nm, 130nm US III Los Alamos Neutron Irradiation 11/2003 130nm, 90nm US IIIi, IIIi+ Los Alamos Neutron Irradiation 8/2004 - Commodity SRAM Berkeley Neutron Irradiation 4/2005 90nm US IIIi+ Los Alamos Neutron Irradiation 11/2005 90nm US T1, IIIi+, IV+ Los Alamos Neutron Irradiation 12/2005 90nm US T1 Los Alamos Neutron Irradiation 12/2006 65nm US T2 Los Alamos Neutron Irradiation 12/2007 65nm US T2/Nextgen Proc Los Alamos Neutron Irradiation DSN 2008 Panel: SDC – Myth or Reality? Slide 45
  46. 46. A Typical LANL Test Setup • Recently tested UltraSPARC T2 and a next generation processor in 65nm technology • Ran multiple systems in parallel • Different parts, voltages & test patterns • Beam time efficiency – 12% beam off – 5% of time in setup, debug – 83% of beam time gave useful data • Cumulative 775 hours of data gathered DSN 2008 Panel: SDC – Myth or Reality? Slide 46
  47. 47. Design/Process Marginality Where do you solve it? Design Guard-bands Loss of Performance Field Manufacturing In-line Correction Wider Test Box Area/Power Cost Loss of Yield DSN 2008 Panel: SDC – Myth or Reality? Slide 47

×