How FPGAs Work When They Don’t
- and how Feynman can help us understand
Summary
Clock domain crossing, timing violations, single event effects and accelerated aging in hostile environments, power supply fluctuations, etc. As if
the learning curve for HDL programming isn't steep already, as soon as we have mastered the archaic trade it is to write synthesizable code for
FPGAs, we find the physical reality intruding, breaking our assumptions, and removing any remaining illusions we might have about the soothing
comforts of deterministic programming. The physical reality is a nuisance; one we should deal with, but often do not. And understandably so. The
non-ideal behavior of CMOS is difficult to simulate, difficult to grasp, and a hassle to mitigate.
Fortunately, as we shall see in this presentation, the learning effort can be greatly reduced, as long as we apply the right perspective. One such is
Richard Feynman's File Clerk model (FCM), which is both intuitive and instructive when the goal is to understand "how FPGAs work when they
don't". With an outset in the FCM we go through the following topics:
● Basic computer organization in FPGAs
● Error mechanisms relevant in FPGA design
● Applying the FCM to explain
○ Clock domain crossing logic
○ SEE due to radiation
○ Timing violations
○ Voltage and frequency scaling
Resumé
Alex Birklykke, alex@space-inventor.com
● 2010: Msc.EE in Applied Signal Processing and Implementation
● 2015: PhD - Modeling and Predicting the behavior of computers operating without
guardbands (case study of FPGAs)
● 2013-2016: FPGA development at Rohde & Schwarz (WLAN layer-1)
● 2016-2017: FPGA development at GomSpace A/S
● 2017- : Newspace entrepreneur with Space Inventor
Research
● Empirical study of FPGA behavior when subject to
voltage and frequency scaling
● Based on 65 nm Spartan 3E
● Objective was to determine the cause of errors, as well as
model and predict errors.
● Research confirmed that
○ FPGAs are very noise immune devices
○ Timing violations are the cause of errors in
voltage/frequency scaled device
○ Precise error behavior is hard to predict
Presentation objective
Provide an intuition about how FPGAs work when they don’t
What could go wrong? Timing Closure
● Timing constraints not meet
● Multi-seed P&R or refactoring
don’t always solve problem.
Especially for systems with high
FPGA utilization
● Sometimes it is necessary to
ship systems with timing
violations
● How to assess the criticality of
timing violations?
What could go wrong? Clock domain crossings
● Clock domain crossings are commonly
encountered in FPGA applications
● Metastable behavior must be mitigated
● Error mechanism must be thoroughly understood
in order to mitigate problem
What could go wrong? Temperature effects and ageing
● Ring oscillator frequency in Virtex-5 FPGA vs:
○ Left) Location and temperature.
○ Right) Localized wearout
● Might lead to unforeseen timing violations
S. Zhang, Delay Characterization in
FPGA-based Reconfigurable
Systems. Master Thesis. 2013
What could go wrong? Radiation induced Ageing
● Microsemi SmartFusion2 SoC FPGA (65nm)
● Irradiated with Cobolt-60 gamma source
● Accelerated ageing observed
● For comparison, 20 krad ~ 5yrs in low Earth orbit
● 10% timing overhead must be introduced, to
ensure timing closure after 5 yrs
● Bad news: Other studies have found that the Flash
configuration memory cannot be reprogrammed
after a few krad’s
N. Rezzak, J. J. Wang, C. K. Huang, V. Nguyen and G. Bakker, "Total Ionizing Dose Characterization of 65 nm
Flash-Based FPGA," 2014 IEEE Radiation Effects Data Workshop (REDW), Paris, 2014, pp. 1-5.
What could go wrong? Chasing better performance
Voltage and/or frequency scaling results in timing errors
A. Birklykke, P. Koch, R. Prasad, L. Alminde and Y. Le Moullec, "Empirical verification
of fault models for FPGAs operating in the subcritical voltage region," 2013 23rd
International Workshop on Power and Timing Modeling, Optimization and Simulation
(PATMOS), Karlsruhe, 2013, pp. 16-23.
It’s all about timing
How FPGAs work when they don’t?
Feynman's Lectures on Computation
● Write-up of Feynman's lectures on computation
given at CalTech from 1983-1987
● Includes an introductory chapter on computation,
as well as five chapters addressing the limitation
of computers.
● Introduces the so-called “File Clerk Model” to
explain the system-level behavior of sequential
computers.
● Known as the as one of the great communicators
of science
The File Clerk Model
● Computers are data transfer machines first, and
only secondly an arithmetic device
● The file clerk is primarily a data transfer function.
Data processing is only secondary
● Feynman: Let’s use the file clerk as a metaphor
for understanding basic computer structure
The File Clerk Model
File clerk “total sales for California” procedure
Take out next “sales” card
If “Location” says California, then
Take out “total” card
Add sales number to number on card
Put “total” card back
Put “sales” card back
Repeat
Sales cards
Salesman: “Smith”
Location: “Tahoe”
Salary: 100
Sales: 1000
xxx.xx
Total card
File cabinet
The File Clerk Model
File clerk “total sales for California” procedure
Take out next “sales” card
If “Location” says California, then
Add sales number to S
Put “sales” card back
Repeat until end
Take out “total” card
Replace total with S
Put “total” card back
Sales cards
Salesman: “Smith”
Location: “Tahoe”
Salary: 100
Sales: 1000
xxx.xx
Total card
File cabinet
S : 0
Local scratch pad
Local scratch pad limits data
transfer, thus increasing file clerk
performance
The File Clerk Model - Stored Program Clerking
1. R2 <- 1
2. R3 <- ADD (R1) (R2)
3. R1 <- (R2)
4. R2 <- (R3)
5. R4 <- SUB 1000 (R3)
6. PC <- 8 IF (CARRY)
7. PC <- 2
8. HALT
Fetch instruction from address PC
PC <- (PC) + 1
Do instruction
R1 : 0
R2 : 0
R3 : 0
R4 : 0
User registers Program/Data
Memory
Fibonacci.exe
PC : 0
CARRY : 0
Control register
Generic file clerk with instruction set
The File Clerk Model with deadlines
● Same model, but where results must be available
at a certain deadline.
● Imagine an angry office manager dictating the
pace
● Claim: The time-dependence allow us to
intuitively explain how computers work when
they don’t
● Trick: Use sympathetic insight/empathy for
our file clerk
Intuitive Explanation of Errors using the File Clerk Model
Cause FCM eqv. FCM effect Reallife effect
Under-voltage Starving clerk Less effective clerk, more
time to do same task.
Unmet deadlines
Timing degradation
Overclocking Tight deadlines Less room for missteps Slack reduction
Electrical noise Office noise Processing errors more
likely, variable execution
time
Lower signal integrity,
probabilistic propagation
delay
Device Ageing Old file clerk Loss of vit and dexterity.
More time to do same job
Timing degradation
High temperature Uncomfortable clerk Harder to focus. More time
to do same job
Timing degradation
Adapting the File Clerk Model for FPGAs
● Timed FCM
● Think of a really simple-minded file clerk
● Vocabulary restricted to “yes”, “no”, and “maybe”
○ Maybe ~ Metastability
● Instructions limited to boolean expressions: file
clerk becomes LUTs
● Important differences:
○ Program is unrolled into one long pipeline
○ Registers and file clerks are distributed
Yes, no, maybe?
Adapting the File Clerk Model for FPGAs
● “File clerk production line”
● Information transfer is still dominating activity
● System-level intuition about FCM still hold
R
eg
R
eg
File clerks
Scratch pad
Input data
Output data
Mechanics of Timing Errors
Q: Assuming that we have timing violations, what
happens?
Q: What conditions must be met before a timing violation
result in a logic error?
Q: When do we have to worry?
Sensitization Criteria
Timing violations are a necessary condition for timing
errors, but not sufficient. The circuit must also be
exercised
FCM analogy: An idle “file clerk production line” does not
make errors
R
eg
R
eg
...,X2, X1
…,Y2, Y1
Patience solves all problems
R
eg
R
eg
...X ,X, X, X, X
…,Y, Y, #, @, ±
By repeating the input, the output will eventually settle
to the correct error-free value
Two Primary Error Modes
R
eg
R
eg
Transition from
X1 to X2
● Dynamic hazard when F(X1) != F(X2) → possible “stuck-at” error
● Static hazard when F(X1) == F(X2) → possible “bit-flip” error
F
F(X2), F(X1)
Generation of “Maybe’s”
● Register inputs must be stable during the setup
and hold period (aperture).
● Unstable signals during latching → probability of
meta-stabilities
● Given sufficient patiences, “maybe’s” will settle to
a fixed yes or no. However, there is no guarantee
that the value is correct (coin flip)
● With some probability, logic hazard can result in
“maybe’s”
Clock Domain Crossing
● Ubiquitous in FPGA designs
● Metastable behavior in receiving clock domain
● Critical for control signals
● Data signals are usually less critical (but it
depends)
● Constant signals usually not critical (e.g.
configuration signals for subsystem)
Clock Domain Crossing
Classical mitigation using synchronizer
● Decreases the probability of “maybe’s”
○ More levels, less probability
● No guarantee for correct signal transfer!!!
● To ensure signal integrity, the patience principle
must be applied
○ Sig1 must be repeated
When to worry about timing violations?
Evaluate and accept
● Some data signals
● Debug
● Configuration
● Low frequency signals re. fclk
Evaluate and avoid
● Mitigate
○ Switch to level signaling
○ Add synchronizers
● Refactor
That’s all folks

How fpgas work when they don't

  • 1.
    How FPGAs WorkWhen They Don’t - and how Feynman can help us understand
  • 2.
    Summary Clock domain crossing,timing violations, single event effects and accelerated aging in hostile environments, power supply fluctuations, etc. As if the learning curve for HDL programming isn't steep already, as soon as we have mastered the archaic trade it is to write synthesizable code for FPGAs, we find the physical reality intruding, breaking our assumptions, and removing any remaining illusions we might have about the soothing comforts of deterministic programming. The physical reality is a nuisance; one we should deal with, but often do not. And understandably so. The non-ideal behavior of CMOS is difficult to simulate, difficult to grasp, and a hassle to mitigate. Fortunately, as we shall see in this presentation, the learning effort can be greatly reduced, as long as we apply the right perspective. One such is Richard Feynman's File Clerk model (FCM), which is both intuitive and instructive when the goal is to understand "how FPGAs work when they don't". With an outset in the FCM we go through the following topics: ● Basic computer organization in FPGAs ● Error mechanisms relevant in FPGA design ● Applying the FCM to explain ○ Clock domain crossing logic ○ SEE due to radiation ○ Timing violations ○ Voltage and frequency scaling
  • 3.
    Resumé Alex Birklykke, alex@space-inventor.com ●2010: Msc.EE in Applied Signal Processing and Implementation ● 2015: PhD - Modeling and Predicting the behavior of computers operating without guardbands (case study of FPGAs) ● 2013-2016: FPGA development at Rohde & Schwarz (WLAN layer-1) ● 2016-2017: FPGA development at GomSpace A/S ● 2017- : Newspace entrepreneur with Space Inventor
  • 4.
    Research ● Empirical studyof FPGA behavior when subject to voltage and frequency scaling ● Based on 65 nm Spartan 3E ● Objective was to determine the cause of errors, as well as model and predict errors. ● Research confirmed that ○ FPGAs are very noise immune devices ○ Timing violations are the cause of errors in voltage/frequency scaled device ○ Precise error behavior is hard to predict
  • 5.
    Presentation objective Provide anintuition about how FPGAs work when they don’t
  • 6.
    What could gowrong? Timing Closure ● Timing constraints not meet ● Multi-seed P&R or refactoring don’t always solve problem. Especially for systems with high FPGA utilization ● Sometimes it is necessary to ship systems with timing violations ● How to assess the criticality of timing violations?
  • 7.
    What could gowrong? Clock domain crossings ● Clock domain crossings are commonly encountered in FPGA applications ● Metastable behavior must be mitigated ● Error mechanism must be thoroughly understood in order to mitigate problem
  • 8.
    What could gowrong? Temperature effects and ageing ● Ring oscillator frequency in Virtex-5 FPGA vs: ○ Left) Location and temperature. ○ Right) Localized wearout ● Might lead to unforeseen timing violations S. Zhang, Delay Characterization in FPGA-based Reconfigurable Systems. Master Thesis. 2013
  • 9.
    What could gowrong? Radiation induced Ageing ● Microsemi SmartFusion2 SoC FPGA (65nm) ● Irradiated with Cobolt-60 gamma source ● Accelerated ageing observed ● For comparison, 20 krad ~ 5yrs in low Earth orbit ● 10% timing overhead must be introduced, to ensure timing closure after 5 yrs ● Bad news: Other studies have found that the Flash configuration memory cannot be reprogrammed after a few krad’s N. Rezzak, J. J. Wang, C. K. Huang, V. Nguyen and G. Bakker, "Total Ionizing Dose Characterization of 65 nm Flash-Based FPGA," 2014 IEEE Radiation Effects Data Workshop (REDW), Paris, 2014, pp. 1-5.
  • 10.
    What could gowrong? Chasing better performance Voltage and/or frequency scaling results in timing errors A. Birklykke, P. Koch, R. Prasad, L. Alminde and Y. Le Moullec, "Empirical verification of fault models for FPGAs operating in the subcritical voltage region," 2013 23rd International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS), Karlsruhe, 2013, pp. 16-23.
  • 11.
    It’s all abouttiming How FPGAs work when they don’t?
  • 12.
    Feynman's Lectures onComputation ● Write-up of Feynman's lectures on computation given at CalTech from 1983-1987 ● Includes an introductory chapter on computation, as well as five chapters addressing the limitation of computers. ● Introduces the so-called “File Clerk Model” to explain the system-level behavior of sequential computers. ● Known as the as one of the great communicators of science
  • 13.
    The File ClerkModel ● Computers are data transfer machines first, and only secondly an arithmetic device ● The file clerk is primarily a data transfer function. Data processing is only secondary ● Feynman: Let’s use the file clerk as a metaphor for understanding basic computer structure
  • 14.
    The File ClerkModel File clerk “total sales for California” procedure Take out next “sales” card If “Location” says California, then Take out “total” card Add sales number to number on card Put “total” card back Put “sales” card back Repeat Sales cards Salesman: “Smith” Location: “Tahoe” Salary: 100 Sales: 1000 xxx.xx Total card File cabinet
  • 15.
    The File ClerkModel File clerk “total sales for California” procedure Take out next “sales” card If “Location” says California, then Add sales number to S Put “sales” card back Repeat until end Take out “total” card Replace total with S Put “total” card back Sales cards Salesman: “Smith” Location: “Tahoe” Salary: 100 Sales: 1000 xxx.xx Total card File cabinet S : 0 Local scratch pad Local scratch pad limits data transfer, thus increasing file clerk performance
  • 16.
    The File ClerkModel - Stored Program Clerking 1. R2 <- 1 2. R3 <- ADD (R1) (R2) 3. R1 <- (R2) 4. R2 <- (R3) 5. R4 <- SUB 1000 (R3) 6. PC <- 8 IF (CARRY) 7. PC <- 2 8. HALT Fetch instruction from address PC PC <- (PC) + 1 Do instruction R1 : 0 R2 : 0 R3 : 0 R4 : 0 User registers Program/Data Memory Fibonacci.exe PC : 0 CARRY : 0 Control register Generic file clerk with instruction set
  • 17.
    The File ClerkModel with deadlines ● Same model, but where results must be available at a certain deadline. ● Imagine an angry office manager dictating the pace ● Claim: The time-dependence allow us to intuitively explain how computers work when they don’t ● Trick: Use sympathetic insight/empathy for our file clerk
  • 18.
    Intuitive Explanation ofErrors using the File Clerk Model Cause FCM eqv. FCM effect Reallife effect Under-voltage Starving clerk Less effective clerk, more time to do same task. Unmet deadlines Timing degradation Overclocking Tight deadlines Less room for missteps Slack reduction Electrical noise Office noise Processing errors more likely, variable execution time Lower signal integrity, probabilistic propagation delay Device Ageing Old file clerk Loss of vit and dexterity. More time to do same job Timing degradation High temperature Uncomfortable clerk Harder to focus. More time to do same job Timing degradation
  • 19.
    Adapting the FileClerk Model for FPGAs ● Timed FCM ● Think of a really simple-minded file clerk ● Vocabulary restricted to “yes”, “no”, and “maybe” ○ Maybe ~ Metastability ● Instructions limited to boolean expressions: file clerk becomes LUTs ● Important differences: ○ Program is unrolled into one long pipeline ○ Registers and file clerks are distributed Yes, no, maybe?
  • 20.
    Adapting the FileClerk Model for FPGAs ● “File clerk production line” ● Information transfer is still dominating activity ● System-level intuition about FCM still hold R eg R eg File clerks Scratch pad Input data Output data
  • 21.
    Mechanics of TimingErrors Q: Assuming that we have timing violations, what happens? Q: What conditions must be met before a timing violation result in a logic error? Q: When do we have to worry?
  • 22.
    Sensitization Criteria Timing violationsare a necessary condition for timing errors, but not sufficient. The circuit must also be exercised FCM analogy: An idle “file clerk production line” does not make errors R eg R eg ...,X2, X1 …,Y2, Y1
  • 23.
    Patience solves allproblems R eg R eg ...X ,X, X, X, X …,Y, Y, #, @, ± By repeating the input, the output will eventually settle to the correct error-free value
  • 24.
    Two Primary ErrorModes R eg R eg Transition from X1 to X2 ● Dynamic hazard when F(X1) != F(X2) → possible “stuck-at” error ● Static hazard when F(X1) == F(X2) → possible “bit-flip” error F F(X2), F(X1)
  • 25.
    Generation of “Maybe’s” ●Register inputs must be stable during the setup and hold period (aperture). ● Unstable signals during latching → probability of meta-stabilities ● Given sufficient patiences, “maybe’s” will settle to a fixed yes or no. However, there is no guarantee that the value is correct (coin flip) ● With some probability, logic hazard can result in “maybe’s”
  • 26.
    Clock Domain Crossing ●Ubiquitous in FPGA designs ● Metastable behavior in receiving clock domain ● Critical for control signals ● Data signals are usually less critical (but it depends) ● Constant signals usually not critical (e.g. configuration signals for subsystem)
  • 27.
    Clock Domain Crossing Classicalmitigation using synchronizer ● Decreases the probability of “maybe’s” ○ More levels, less probability ● No guarantee for correct signal transfer!!! ● To ensure signal integrity, the patience principle must be applied ○ Sig1 must be repeated
  • 28.
    When to worryabout timing violations? Evaluate and accept ● Some data signals ● Debug ● Configuration ● Low frequency signals re. fclk Evaluate and avoid ● Mitigate ○ Switch to level signaling ○ Add synchronizers ● Refactor
  • 29.