How fpgas work when they don't

How FPGAs Work When They Don’t
- and how Feynman can help us understand

Summary
Clock domain crossing, timing violations, single event effects and accelerated aging in hostile environments, power supply fluctuations, etc. As if
the learning curve for HDL programming isn't steep already, as soon as we have mastered the archaic trade it is to write synthesizable code for
FPGAs, we find the physical reality intruding, breaking our assumptions, and removing any remaining illusions we might have about the soothing
comforts of deterministic programming. The physical reality is a nuisance; one we should deal with, but often do not. And understandably so. The
non-ideal behavior of CMOS is difficult to simulate, difficult to grasp, and a hassle to mitigate.
Fortunately, as we shall see in this presentation, the learning effort can be greatly reduced, as long as we apply the right perspective. One such is
Richard Feynman's File Clerk model (FCM), which is both intuitive and instructive when the goal is to understand "how FPGAs work when they
don't". With an outset in the FCM we go through the following topics:
● Basic computer organization in FPGAs
● Error mechanisms relevant in FPGA design
● Applying the FCM to explain
○ Clock domain crossing logic
○ SEE due to radiation
○ Timing violations
○ Voltage and frequency scaling

Resumé
Alex Birklykke, alex@space-inventor.com
● 2010: Msc.EE in Applied Signal Processing and Implementation
● 2015: PhD - Modeling and Predicting the behavior of computers operating without
guardbands (case study of FPGAs)
● 2013-2016: FPGA development at Rohde & Schwarz (WLAN layer-1)
● 2016-2017: FPGA development at GomSpace A/S
● 2017- : Newspace entrepreneur with Space Inventor

Research
● Empirical study of FPGA behavior when subject to
voltage and frequency scaling
● Based on 65 nm Spartan 3E
● Objective was to determine the cause of errors, as well as
model and predict errors.
● Research confirmed that
○ FPGAs are very noise immune devices
○ Timing violations are the cause of errors in
voltage/frequency scaled device
○ Precise error behavior is hard to predict

Presentation objective
Provide an intuition about how FPGAs work when they don’t

What could go wrong? Timing Closure
● Timing constraints not meet
● Multi-seed P&R or refactoring
don’t always solve problem.
Especially for systems with high
FPGA utilization
● Sometimes it is necessary to
ship systems with timing
violations
● How to assess the criticality of
timing violations?

What could go wrong? Clock domain crossings
● Clock domain crossings are commonly
encountered in FPGA applications
● Metastable behavior must be mitigated
● Error mechanism must be thoroughly understood
in order to mitigate problem

What could go wrong? Temperature effects and ageing
● Ring oscillator frequency in Virtex-5 FPGA vs:
○ Left) Location and temperature.
○ Right) Localized wearout
● Might lead to unforeseen timing violations
S. Zhang, Delay Characterization in
FPGA-based Reconfigurable
Systems. Master Thesis. 2013

What could go wrong? Radiation induced Ageing
● Microsemi SmartFusion2 SoC FPGA (65nm)
● Irradiated with Cobolt-60 gamma source
● Accelerated ageing observed
● For comparison, 20 krad ~ 5yrs in low Earth orbit
● 10% timing overhead must be introduced, to
ensure timing closure after 5 yrs
● Bad news: Other studies have found that the Flash
configuration memory cannot be reprogrammed
after a few krad’s
N. Rezzak, J. J. Wang, C. K. Huang, V. Nguyen and G. Bakker, "Total Ionizing Dose Characterization of 65 nm
Flash-Based FPGA," 2014 IEEE Radiation Effects Data Workshop (REDW), Paris, 2014, pp. 1-5.

What could go wrong? Chasing better performance
Voltage and/or frequency scaling results in timing errors
A. Birklykke, P. Koch, R. Prasad, L. Alminde and Y. Le Moullec, "Empirical verification
of fault models for FPGAs operating in the subcritical voltage region," 2013 23rd
International Workshop on Power and Timing Modeling, Optimization and Simulation
(PATMOS), Karlsruhe, 2013, pp. 16-23.

It’s all about timing
How FPGAs work when they don’t?

Feynman's Lectures on Computation
● Write-up of Feynman's lectures on computation
given at CalTech from 1983-1987
● Includes an introductory chapter on computation,
as well as five chapters addressing the limitation
of computers.
● Introduces the so-called “File Clerk Model” to
explain the system-level behavior of sequential
computers.
● Known as the as one of the great communicators
of science

The File Clerk Model
● Computers are data transfer machines first, and
only secondly an arithmetic device
● The file clerk is primarily a data transfer function.
Data processing is only secondary
● Feynman: Let’s use the file clerk as a metaphor
for understanding basic computer structure

File clerk “total sales for California” procedure
Take out next “sales” card
If “Location” says California, then
Take out “total” card
Add sales number to number on card
Put “total” card back
Put “sales” card back
Repeat
Sales cards
Salesman: “Smith”
Location: “Tahoe”
Salary: 100
Sales: 1000
xxx.xx
Total card
File cabinet

File clerk “total sales for California” procedure
Take out next “sales” card
If “Location” says California, then
Add sales number to S
Put “sales” card back
Repeat until end
Take out “total” card
Replace total with S
Put “total” card back
Sales cards
Salesman: “Smith”
Location: “Tahoe”
Salary: 100
Sales: 1000
xxx.xx
Total card
File cabinet
S : 0
Local scratch pad
Local scratch pad limits data
transfer, thus increasing file clerk
performance

The File Clerk Model - Stored Program Clerking
1. R2 <- 1
2. R3 <- ADD (R1) (R2)
3. R1 <- (R2)
4. R2 <- (R3)
5. R4 <- SUB 1000 (R3)
6. PC <- 8 IF (CARRY)
7. PC <- 2
8. HALT
Fetch instruction from address PC
PC <- (PC) + 1
Do instruction
R1 : 0
R2 : 0
R3 : 0
R4 : 0
User registers Program/Data
Memory
Fibonacci.exe
PC : 0
CARRY : 0
Control register
Generic file clerk with instruction set

The File Clerk Model with deadlines
● Same model, but where results must be available
at a certain deadline.
● Imagine an angry office manager dictating the
pace
● Claim: The time-dependence allow us to
intuitively explain how computers work when
they don’t
● Trick: Use sympathetic insight/empathy for
our file clerk

Intuitive Explanation of Errors using the File Clerk Model
Cause FCM eqv. FCM effect Reallife effect
Under-voltage Starving clerk Less effective clerk, more
time to do same task.
Unmet deadlines
Timing degradation
Overclocking Tight deadlines Less room for missteps Slack reduction
Electrical noise Office noise Processing errors more
likely, variable execution
time
Lower signal integrity,
probabilistic propagation
delay
Device Ageing Old file clerk Loss of vit and dexterity.
More time to do same job
Timing degradation
High temperature Uncomfortable clerk Harder to focus. More time
to do same job
Timing degradation

Adapting the File Clerk Model for FPGAs
● Timed FCM
● Think of a really simple-minded file clerk
● Vocabulary restricted to “yes”, “no”, and “maybe”
○ Maybe ~ Metastability
● Instructions limited to boolean expressions: file
clerk becomes LUTs
● Important differences:
○ Program is unrolled into one long pipeline
○ Registers and file clerks are distributed
Yes, no, maybe?

Adapting the File Clerk Model for FPGAs
● “File clerk production line”
● Information transfer is still dominating activity
● System-level intuition about FCM still hold
R
eg
R
eg
File clerks
Scratch pad
Input data
Output data

Mechanics of Timing Errors
Q: Assuming that we have timing violations, what
happens?
Q: What conditions must be met before a timing violation
result in a logic error?
Q: When do we have to worry?

Sensitization Criteria
Timing violations are a necessary condition for timing
errors, but not sufficient. The circuit must also be
exercised
FCM analogy: An idle “file clerk production line” does not
make errors
R
eg
R
eg
...,X2, X1
…,Y2, Y1

Patience solves all problems
R
eg
R
eg
...X ,X, X, X, X
…,Y, Y, #, @, ±
By repeating the input, the output will eventually settle
to the correct error-free value

Two Primary Error Modes
R
eg
R
eg
Transition from
X1 to X2
● Dynamic hazard when F(X1) != F(X2) → possible “stuck-at” error
● Static hazard when F(X1) == F(X2) → possible “bit-flip” error
F
F(X2), F(X1)

Generation of “Maybe’s”
● Register inputs must be stable during the setup
and hold period (aperture).
● Unstable signals during latching → probability of
meta-stabilities
● Given sufficient patiences, “maybe’s” will settle to
a fixed yes or no. However, there is no guarantee
that the value is correct (coin flip)
● With some probability, logic hazard can result in
“maybe’s”

Clock Domain Crossing
● Ubiquitous in FPGA designs
● Metastable behavior in receiving clock domain
● Critical for control signals
● Data signals are usually less critical (but it
depends)
● Constant signals usually not critical (e.g.
configuration signals for subsystem)

Clock Domain Crossing
Classical mitigation using synchronizer
● Decreases the probability of “maybe’s”
○ More levels, less probability
● No guarantee for correct signal transfer!!!
● To ensure signal integrity, the patience principle
must be applied
○ Sig1 must be repeated

When to worry about timing violations?
Evaluate and accept
● Some data signals
● Debug
● Configuration
● Low frequency signals re. fclk
Evaluate and avoid
● Mitigate
○ Switch to level signaling
○ Add synchronizers
● Refactor

How fpgas work when they don't

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How fpgas work when they don't

Similar to How fpgas work when they don't (20)

More from InfinIT - Innovationsnetværket for it

More from InfinIT - Innovationsnetværket for it (20)

Recently uploaded

Recently uploaded (20)

How fpgas work when they don't