projekt_praktikum

KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft
Institut für Technik der
Informationsverarbeitung (ITIV)
Institutsleitung
Prof. Dr.-Ing. Dr. h. c. J. Becker (Sprecher)
Prof. Dr.-Ing. E. Sax
Prof. Dr. rer. nat. W. Stork
Projektarbeit Nr. ID-2138
von Herrn Dominik Maszczyk
Prof. Dr.-Ing. Dr. h. c. Jürgen Becker
On the Sensitivity of FPGA Configuration
Bits to Radiation-Induced Soft Errors
Beginn: 09.12.2015
Abgabe: 09.05.2016
Betreuer: M. Sc. Eng. Mojtaba Ebrahimi
Institut für Technische Informatik
Korreferent: Prof. Dr. Mehdi Tahoori
Institut für Technische Informatik

I hereby declare that I wrote my thesis on my own and that I have followed the regulations
relating to good scientific practice of the Karlsruhe Institute of Technology (KIT) in its latest
form. I did not use any unacknowledged sources or means and I marked all references I used
literally or by content.
Karlsruhe, den 09.05.2016
Dominik Maszczyk

Imię i nazwisko studenta: Dominik Maszczyk
Nr albumu: 137135
Studia pierwszego stopnia
Forma studiów: stacjonarne
Kierunek studiów: Elektronika i telekomunikacja
Specjalność/profil: -
PRACA DYPLOMOWA INŻYNIERSKA
Tytuł pracy w języku polskim: Implementacja systemu do badania wpływu błędów w bitach
konfiguracyjnych na niezawodność systemów zbudowanych w oparciu o układy FPGA
Tytuł pracy w języku angielskim: Implementation of the system for testing the impact of erroes in the
configuration bits on the reliability of systems built on FPGAs.
Potwierdzenie przyjęcia pracy
Opiekun pracy
podpis
Kierownik Katedry/Zakładu
podpis
dr inż. Miron Kłosowski
Data oddania pracy do dziekanatu:

Institute of Computer Engineering
Bachelor Thesis
On the Sensitivity of FPGA Conﬁguration
Bits to Radiation-induced Soft Errors
Dominik Maszczyk
Supervisors: Dr. M.Sc. Eng. Mojtaba Ebrahimi,
Prof. Dr. Mehdi B. Tahoori,
Prof. Dr.-Ing. Dr. h. c. Jürgen Becker,
Dr. M.Sc. Eng. Miron Kłosowski
Period: 9.12.2015 – 9.05.2016
Karlsruhe, 9.05.2016
Postanschrift: Institut für Technische Informatik Tel.: +49 (0) 721 608 4 3771
Haid-und-Neu-Straße 7 Fax: +49 (0) 721 608 4 3962
76-131 Karlsruhe Web: www.capp.itec.kit.edu
KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association www.kit.edu

OŚWIADCZENIE
Imię i nazwisko: Dominik Maszczyk
Data i miejsce urodzenia: 01.09.1992, Gliwice
Nr albumu: 137135
Wydział: Wydział Elektroniki, Telekomunikacji i Informatyki
Kierunek: elektronika i telekomunikacja
Poziom studiów: I stopnia - inżynierskie
Forma studiów: stacjonarne
Ja, niżej podpisany(a), wyrażam zgodę/nie wyrażam zgody* na korzystanie z mojej pracy
dyplomowej zatytułowanej: Implementacja systemu do badania wpływu błędów w bitach
konfiguracyjnych na niezawodność systemów zbudowanych w oparciu o układy FPGA
do celów naukowych lub dydaktycznych.1
Gdańsk, dnia .................................. .....................................................
podpis studenta
Świadomy(a) odpowiedzialności karnej z tytułu naruszenia przepisów ustawy z dnia 4 lutego
1994 r. o prawie autorskim i prawach pokrewnych (Dz. U. z 2006 r., nr 90, poz. 631)
i konsekwencji dyscyplinarnych określonych w ustawie Prawo o szkolnictwie wyższym (Dz. U. z
2012 r., poz. 572 z późn. zm.),2 a także odpowiedzialności cywilno-prawnej oświadczam, że
przedkładana praca dyplomowa została opracowana przeze mnie samodzielnie.
Niniejsza(y) praca dyplomowa nie była wcześniej podstawą żadnej innej urzędowej procedury
związanej z nadaniem tytułu zawodowego.
Wszystkie informacje umieszczone w ww. pracy dyplomowej, uzyskane ze źródeł pisanych i
elektronicznych, zostały udokumentowane w wykazie literatury odpowiednimi odnośnikami
zgodnie z art. 34 ustawy o prawie autorskim i prawach pokrewnych.
Potwierdzam zgodność niniejszej wersji pracy dyplomowej z załączoną wersją elektroniczną.
Gdańsk, dnia .................................. .....................................................
podpis studenta
Upoważniam Politechnikę Gdańską do umieszczenia ww. pracy dyplomowej w wersji
elektronicznej w otwartym, cyfrowym repozytorium instytucjonalnym Politechniki Gdańskiej oraz
poddawania jej procesom weryfikacji i ochrony przed przywłaszczaniem jej autorstwa.
Gdańsk, dnia ................................. .....................................................
podpis studenta
*) niepotrzebne skreślić
1 Zarządzenie Rektora Politechniki Gdańskiej nr 34/2009 z 9 listopada 2009 r., załącznik nr 8 do instrukcji
archiwalnej PG.
2 Ustawa z dnia 27 lipca 2005 r. Prawo o szkolnictwie wyższym:
Art. 214 ustęp 4. W razie podejrzenia popełnienia przez studenta czynu podlegającego na przypisaniu sobie autorstwa
istotnego fragmentu lub innych elementów cudzego utworu rektor niezwłocznie poleca przeprowadzenie postępowania
wyjaśniającego.
Art. 214 ustęp 6. Jeżeli w wyniku postępowania wyjaśniającego zebrany materiał potwierdza popełnienie czynu,
o którym mowa w ust. 4, rektor wstrzymuje postępowanie o nadanie tytułu zawodowego do czasu wydania orzeczenia
przez komisję dyscyplinarną oraz składa zawiadomienie o popełnieniu przestępstwa.

Streszczenie
Bł˛edy zaindukowane przez radiacj˛e s ˛a istotnym problemem w obwodach gdzie wysokie oczekiwa-
nia co do niezawodno´sci id ˛a w parze z nanomilimetrow ˛a technologi ˛a wytwarzania podzespołów.
Progresywny proces skalowania tranzystorów wraz ze wzrostem zło˙zono´sci urz ˛adzeń sprawia, ˙ze
ka˙zda nowa generacja podzespołów jest bardziej podatna na wpływ zaindukowanych bł˛edów. W
rezultacie w niedalekiej przyszło´sci przewiduje si˛e ˙ze, wszystkie systemy b˛ed ˛a wymagały imple-
mentacji technik do łagodzenia rezultatów w zwi ˛azku z wyst˛epowaniem mi˛ekkich bł˛edów (soft
errors). Warto jednak wpierw zastanowić si˛e nad konsekwencjami a co za tym idzie klasyfikacj ˛a
wyst˛epuj ˛acych bł˛edów. Ka˙zdy bł ˛ad mo˙ze potencjalnie prowadzić do awarii systemu. Aczkolwiek,
wyzwaniem jest dokonanie analizy, które z bł˛edów lub jaki wła´sciwie odsetek prowadzi do awarii
systemu.
W pracy tej, na wst˛epie, mechanizm wyst˛epowania zaindukowanych bł˛edów został dokładnie opi-
sany wraz z dalsz ˛a analiz ˛a poruszaj ˛ac ˛a rozwa˙zania, w których warstwach bł ˛ad mo˙ze być zama-
skowany oraz w jaki sposób. Moje pomiary koncentruj ˛a si˛e wył ˛acznie na mi˛ekkich bł˛edach w
bitach konfiguracyjnych FPGA. Owa specyficzna pamieć wyró˙znia t˛e achitektur˛e jednocze´snie b˛e-
d ˛ac nawi˛ekszym atutem jak i wad ˛a czyni ˛ac FPGA bardziej podatne na zaindukowane bł˛edy ani˙zeli
konkurencyjne rozwi ˛azania. Co wi˛ecej bł˛edy w pami˛eci konfiguracyjnej maj ˛a potencjalnie długo-
trwałe skutki - a˙z do nastepnego cyklu programowania. Do przeprowadzenia badań posłu˙zyłem si˛e
zestawem rozwojowym Virtex-6 FPGA ML605 Evaluation Kit oraz zintegrowanym ´srodowiskiem
programistycznym Xilinx (Integrated Development Environment). Na bazie tych narz˛edzi, zdo-
bytej wiedzy oraz kilku jezyków skryptowych stworzyłem w pełni funkcjonalny system testowy
do analizy wpływu mi˛ekkich bł˛edów na wła´sciwe działanie badanego systemu. W zaproponowa-
nym rozwi ˛azaniu, bł ˛ad jest losowo wstrzykiwany do pami˛eci konfiguracyjnej, po czym system jest
wielokrotnie sprawdzany z ró˙znymi parametrami wej´sciowymi by sprawdzić, czy w ich wyniku
uległ awarii. Co wi˛ecej, by uzyskać jak najbardziej dokładny wynik, opisany pomiar został w pełni
zautomatyzowany i powtórzony 20 000 razy statystycznie obrazuj ˛ac procent bitów b˛ed ˛acych kry-
tycznymi dla poprawnego działania urz ˛adzenia. Za pomoc ˛a stworzonego ´srodowiska testowego
dokonałem pomiarów dla implementacji dwóch powszechnie u˙zywanych systemów koduj ˛acych,
AES oraz JPEG. Moje wyniki do´swiadczalne wskazuj ˛a, ˙ze jeden procent bitów konfiguracyjnych
odgrywa krytyczn ˛a rol˛e je´sli chodzi o niezawodno´sć urz ˛adzenia. Jak mo˙zna było tak˙ze przewidzieć
ró˙znica w wielko´sci realizowanych systemów wpływa na odsetek krytycznych bitów, system zaj-
muj ˛acy wi˛ecej miejsca w pami˛eci posiada proporcjonalnie wi˛ecej krytycznych bitów. Na podstawie
wyników pomiarów oraz danych dostarczonych przez firm˛e Xilinx, oszacowałem, i˙z implementa-
cja systemu AES i JPEG do´swiadcza 102 i 124 FIT1
.
1
Jednostka opisuj ˛aca ilo´sć bł˛edów w przeci ˛agu miliarda godzin pracy.
3

4
Moje wyniki obrazuj ˛a efektywny wpływ radiacyjnie zaindukowanych bł˛edów na niezawodno´sć
urz ˛adzenia oraz poruszaj ˛a kwesti˛e w jakich przypadkach trzeba zwrócić szczególn ˛a uwag˛e na
nie podczas procesu projektowania. Ponadto zaprojektowałem wiarygodn ˛a platform˛e do prostego
oszacowania efektywnego wpływu opisanych bł˛edów na działanie projektowanego systemu oraz w
pełni zautomatyzowałem kilkudniowy proces pomiarów.
Słowa kluczowe: zaindukowane mi˛ekkie bł˛edy, soft errors, SEM, soft error mitigation, UART,
JTAG, boundary scan, ChipScope, ILA, ICON, maskowanie, niezawodno´sć, Monitor Interface,
AES, JPEG, VHDL, FIT, bity konfiguracyjne, bity krytyczne, FPGA, skalowanie tranzysto-
rów, SRAM, EEPROM, uderzenie cz ˛asteczki, cz ˛asteczki alfa, promienie kosmiczne, JEDEC89A,
wstrzykiwanie bł˛edów.
Dziedzina nauki i techniki, zgodnie z wymogami OECD:
• Nauki in˙zynieryjne i techniczne
– Elektrotechnika, elektronika, in˙zyniera informatyczna
∗ Elektrotechnika i elektronika

Abstract
Radiation-induced soft errors are important reliability concern in circuits fabricated using nanoscale
technology nodes. Progressive transistor scaling along with increasing devices complexity makes
each new generation of devices more vulnerable to soft errors. This lead to the point where all
systems, would require error mitigation techniques. Each soft error can potentially lead to a system
failure. The challenging part of analyzing these errors is to distinguish if particular one would lead
to system failure.
This work focus only on soft errors in FPGA’s configuration where soft errors have potentially long
term effects. Based on Virtex-6 FPGA ML605 Evaluation Kit and Xilinx Integrated Development
Environment, fully functional test system for analyzing soft error impact on the device was devel-
oped. In proposed approach, soft errors are randomly injected and checked if they result in system
failures. Finally, reliable measurements for encrypting cores has been made which demonstrate a
possible impact of radiation-induced soft errors to commonly used FPGA-based devices.
Our experimental results shows that about one percent of configuration bits have a critical role
in output computation. The size difference in implemented systems affects the percentage of the
critical bits. Based on measurement results amount of failures in time were estimated for AES and
JPEG implementation to be 102 FIT and 124 FIT accordingly. Moreover, a weak point in testing
platform has been found and overcome which lead to even more accurate results.
Keywords: radiation-induced soft errors, soft errors, SEM, soft error mitigation, UART, JTAG,
boundary scan, ChipScope, ILA, ICON, masking, reliability, Monitor Interface, AES, JPEG,
VHDL, FIT, configuration bits, critical bits, FPGA, transistor scaling, SRAM, EEPROM, parti-
cle strike, alpha particle, cosmic rays, JEDEC89A, fault injection.
5

Contents
1 Introduction 9
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Background 13
2.1 FPGA architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Soft errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Origin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Technology scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Soft errors in FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Mitigation solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Experimental Setup 21
3.1 Overall Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Systems under test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1 AES encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 JPEG encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Radiation Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Fault Injection Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.1 Core enrichment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.2 Implemented cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.3 Implemented components . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.4 PC scripts implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Experimental Results 35
4.1 Fault Injection Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Communication and testing facilities errors . . . . . . . . . . . . . . . . . . . . . 36
7

8 Contents
4.3 Estimation of radiation-induced failures over the time . . . . . . . . . . . . . . . . 37
5 Conclusions 39

1 Introduction
Radiation-induced soft errors had been observed during the first space missions. These errors are
originated mainly due to cosmic rays and radioactive impurities during device fabrication process.
The phenomenon of soft errors which are just unpredictable bit changes in the electronic devices
was later explained to be a consequence of a high energy, charged particle hits on an electrical
device. As particle hits semiconductor circuit, it generates electron-hole pairs which could be
absorbed by source and drain of the surrounding transistors. In case this effect is strong enough,
it changes a state of the transistor for a very short period of time, leading to a so-called soft error.
Although this is a micro-scale event, under favorable conditions it can result in a system-level
failure.
Due to unpredictable nature of soft errors and no practical possibility of complete elimination, these
errors are important reliability concerns of integrated circuits. Although a variety of research in this
field has been made during last 40 years, there is no perfect solution for this issue. Moreover, recent
studies show that the soft error sensitivity of the systems has been exponentially grown over the
past few years mainly due to the ever increasing device count per chip (i.e, Moore’s law) as well as
the aggressive transistor feature size and supply voltage downscaling. As a result, it is essential to
evaluate the impact of soft errors on integrated circuits in early design phases.
To effectively mitigate soft errors in a system, the application and working environment needs to
be carefully studied. For instance, medical equipment for radiotherapy is expected to suffer from
frequently occurring soft errors. High energy particles such as cosmic rays get attenuated in earth
atmosphere so the device which would be closer to the sea level are much less vulnerable to soft
errors. Even so, due to aforementioned, scaling trend of transistors, nowadays we have to take into
account these errors even in the sea level devices.
Architecture which is expected to be especially vulnerable to soft errors is a part of common de-
sign methodology for implementing digital electronic systems. So-called Field Programmable Gate
Array (FPGA) provides outstanding features which make it almost irreplaceable in some specific
applications and more cost-efficient in many other cases. That is why there is a need to develop a
fully functional development system which would take into account soft error occurrence probabil-
ity.
9

10 1 Introduction
1.1 Problem Statement
All the microelectronic devices are vulnerable to errors induced by particle strikes. These errors,
however, not always lead to system failures. For instance, changing the value of the signal which
is not used in computation do not result in failure. This and many others circumstances make only
a small portion of induced errors actually having an influence on the device output.
Estimation of system failures as an effect of radiation-induced soft errors is a really challenging
task. So the goal would be to develop a testing platform and make measurements which show the
system failure probability in the presence of a single soft error.
For this research purpose, FPGA architecture has been chosen as it tends to be especially vulner-
able to soft errors. FPGA is programmable after production device. This feature is achieved by
introducing architecture based on the matrix of reconfigurable logic blocks with reconfigurable in-
terconnections. Thanks to this feature it has many advantages during development process like a
short time to market and no non-refundable engineering.
FPGA had been used in various application domains, including aerospace, military, industrial, au-
tomotive and medical instruments. In these fields, devices can be exposed to high level of radiation
and at the same time they fulfill high-reliability expectations. Possible radioactive environments or
high-reliability expectations make a soft error mitigation techniques an obligatory point of consid-
erations.
FPGA architecture in opposite to its rival ASIC technology requires external memory to store pro-
gram itself wherein ASIC device hardware description is simply hardware implemented. Usage of
this memory gives a possibility to program device after production in a trade off lower performance
and higher vulnerability to soft errors.
The problem is even more important if we further analyze consequences of soft errors in FPGA
memory (further called configuration bits for distinction with any other implemented memory).
Possible error in configuration bits leads to a permanent change in the FPGA hardware description.
In most FPGA-based systems, a small portion of configuration bits is actually in use so decreasing
system size would also decrease the probability that the output would be corrupted by a soft error in
random place of memory. Beside this dependence, there are also many other circumstances which
can stop soft error propagation to next logic layers. Prediction of all possible error after-effects is an
extremely complicated task as errors can be masked on all levels from physical to high abstraction
layer. Analytical models are impractical as they are extremely computation demanding and does
not include device misbehaviors due to fabrication deviations.
In this work, an accurate measurement method is proposed to acquire ratio of bits which are critical

1.2 Contributions 11
for proper device operation. This value can successfully be used for further estimation of device
reliability in the intended environment.
1.2 Contributions
In this thesis, I measure and analyze the impact of radiation-induced errors to FPGA configuration
bits. In this regard, I developed a fully functional, accurate platform to measure failure to error
ratio for FPGA-based systems. It would be based on simple hardware set of personal computer and
Xilinx development board. Measurements would be made by fully automated scripting process,
which provides information about any injected errors.
Based on my testing platform, I demonstrate measurement process for two systems. For my tests,
I choose implementations of two popular encoding standards - AES and JPEG*
. I decided to use
encoding systems as any error in data or algorithm should effect in corrupted output. Besides gen-
eral similarity these systems differ mostly in size, so measurements will also validate dependence
between system size and device vulnerability.
Experimental results show real ratio expressing system failure per soft error in the configuration
memory to be around one percent. Moreover, I comprehensively analyze the soft error sensitivity
of a implemented AES and JPEG encoder. These two cores are expected to experience failures in
time on level of 102 and 124 FIT accordingly.
To sum up, this work had been done to demonstrate the overall probability of commonly used sys-
tems to fail due to radiation induced soft errors. Moreover, automated device testing platform had
been developed and extensively described in this thesis giving great opportunity to easily implement
own testing environment using well-tested ideas and gain from 5 months development experience
[1][2].
1.3 Outline
The rest of this thesis is organized in four chapters. A short description of each chapter is as follows:
The second chapter describes FPGA architecture and soft error phenomena. In the chapter named
"Experimental Setup", an extensive description of all steps being taken to build efficient yet accurate
testing platform, including implementation of JPEG and AES encoders has been described. Chapter
four contains juxtaposition of acquired results along with explanation and interpretation. Finally,
*
Encryption in this thesis is considered as a subgroup of encoding.

12 1 Introduction
last chapter sets directions for further development of used testing environment with proposal for
new measurement methods followed by summary of completed work.

2 Background
This chapter provides a short review of FPGA architecture and soft errors.
2.1 FPGA architecture
A field-programmable gate array (FPGA) consist of an array of programmable logic blocks sur-
rounded by wires with programmable intersections. This architecture allows performing a variety
of logic operation in a block and freely routing the signals between blocks. Furthermore, the ar-
ray itself is surrounded by programmable input/output blocks, which connects the chip with other
devices. Key role in describing FPGA architecture has word ’programmable’ meaning simply that
executed program (hardware description) can be upload to the device after the production process.
This important feature has been achieved by either system of fuses - permanent change or transistor-
based switches. The majority of FPGA devices use transistor-based switches among with memory
to store desired hardware description. In fact, there are two types of memory used in this case.
First is Static Random-Access Memory (SRAM), electrical scheme of single memory cell has been
shown in figure 2.1 a) this cells are used into design to act as a multiplexer (figure. 2.1 b)) or can
be used as lookup tables to perform possible logic function (figure. 2.1 c)) [3].
13

14 2 Background
Fig. 2.1: a) SRAM cell b) multiplexer c) LUT [3]
As it can be seen from scheme, SRAM memory is volatile, so the program needs to be uploaded to
the device each time it is powered on. Beside SRAM memory which is the most popular and used
also in this thesis, there are also FPGAs based on EEPROM/Flash memory. This is a non-volatile
type of memory based on ﬂoating gate, this kind of memory cell is used directly as a switch what
was shown in ﬁgure 2.2.

2.1 FPGA architecture 15
Fig. 2.2: EEPROM memory cell [3]
Overall FPGA architecture has been shown in figure 2.3. All the connections in this architecture
are declared by mentioned memory, moreover, it also defines logic blocks functions. This memory
is further called configuration bits and it is a topic of this research.

16 2 Background
Fig. 2.3: FPGA structure [3]
2.2 Soft errors
The error itself can be easily deﬁned by a difference between desired and performed operation.
This misbehavior in electronic engineering used to be considered in two categories with respect to
consequences in device usability.
Fig. 2.4: Types of errors [4]
The ﬁrst category contains errors which can be observed as corrupted information and in the general

2.2 Soft errors 17
assumption it is temporary. Due to its nondestructive nature, these errors used to be called ’soft’.
The second group includes all the errors which directly cause permanent system damages, typically
further divided for errors induced during the production process and these which occur later due to
for example device aging [5].
In this thesis, I focus only on the soft errors, being even more precise on errors originated to
radiation sources.
2.2.1 Origin
Soft errors can have really different etymologies. The reasons for soft errors occurrence can be this
trivial as fast change of electromagnetic ﬁeld near to the device or crosstalk between signals. How-
ever, it is possible to almost fully reduce them by good design practice. The real reliability concern
for devices are errors induced by radiation. This error occurs when high energy, charged particle
strikes electronics circuit. As a particle penetrates semiconductor material it creates electron-hole
trace which can inﬂuence transistor source or drain changing transistor state for a really short time.
This short pulse can propagate through the circuit or simply change the value stored in memory.
It is worth to mention that it would not physically destroy the structure of circuit so device works
normally after reset [6].
Fig. 2.5: Particle strike on transistor [6]
The previously mentioned particles are mainly originated to radioactive impurities in the device
package, solder bumps or semiconductor material. So-called alpha particles are charged helium
nucleons (He2+
) emitted during heavy element decay. Since the discovery, many alpha-induced
errors mitigation technologies were developed. However, all of them can be consider as two ap-
proaches. First, is to reduce impurities and possibly replace packing materials by ones with possibly
lowest radioactivity. Second, to develop chip shielding methods so particles would not penetrate
into the chip. Both of these approaches do not fully eliminate soft errors induced by alpha particles.

18 2 Background
Moreover, they significantly increase the cost of device production. These circumstances lead to
the development of effective soft errors toleration systems.
Another important source of radiation-induced soft errors is cosmic rays. This strong radiation
in earth outer-space still has not a well-explained source. It is mentioned separately because in
counter to alpha-induced errors the probability of error vary with altitude in relation to the earth.
Outside the atmosphere, the mechanism is the same as for alpha particle but with getting closer to
earth surface it gets more and more attenuated. The cosmic rays hit molecules in the atmosphere
which effects in of shower secondary particles moving towards earth. Even if 95% of secondary
particles are neutrons with by definition does not carry any charge, hit in the device can result in
a production of charged nucleons leading to soft errors. As mentioned cosmic ray flux is mostly
dependent on altitude so with each kilometer towards earth it decreased by the factor of 2.2. In the
results chapter, a simple calculation has been shown to estimate ray flux with respect to position on
the earth.
2.2.2 Technology scaling
Following Moore’s law, each generation of transistors become smaller and smaller. This transistor
scaling process results in lowering of charge needed to change transistor state. In consequence,
lower energy is needed to change value also in a memory cell. Because of lowering the energy
barrier the device itself turns to be more vulnerable to soft errors with each new generation. How-
ever, with shrinking size of the single transistor the probability of a hit by charged particle decrease
linearly, making transistor scaling process slightly reduce device vulnerability to induced errors.
Fig. 2.6: Impact of technology scaling on soft error rate [6]
This effect only occurs when we assume that each next generation of devices has the same number

2.3 Soft errors in FPGAs 19
of transistors. In reality number of transistors used per chip grows exponentially. So finally consid-
ering all the facts we can assume that device vulnerability to soft errors exponentially grows along
with technology scaling process [6].
2.3 Soft errors in FPGAs
Because of additional memory which stores hardware description, FPGA turns to be more vulner-
able to soft errors than its counter technologies.
2.3.1 Modelling
In case fault appears in unused memory resources, it is clear that it would not result in any error.
Furthermore, if the corrupted information is not used until the end of the program it would not lead
to a system failure. This phenomenon is called "masking" and it is the main challenge in soft error
modeling.
The introduction examples already point out two types of masking which take place in FPGA
configuration memory. As FPGA is a reprogrammable device it needs a lot of additional resources
to have an ability to implement a variety of functions. It is practically impossible to use up all FPGA
resources so the device is always partially "free". In this case, a single event upset can either affect
used or unused bits in the configuration memory. A bit flip in unused memory does not affect device
operation so a device remains fully functional. However, even if an error appeared in used memory
it can still be suppressed by logical masking. This kind of masking contains all circumstances
when errors get eliminated by logical functions, for instance, error in the least significant bit of
number being divided by two. Moreover, there is also a masking factor which can occur due to
used workload. In design testing platform system is tested to be fully operational by the run of an
especially designed workload. As for instance, there are about 115 quattuorvigintillion variations
for AES core input*
checking all these combinations would be too time-consuming. Instead, a
workload is designed to check a possible big variety of significantly different values*
. Because not
all combinations are checked there is a probability that error can be masked by used workload.
*
Both data and a cryptic key is 128-bit long which gives 2256
combinations.
*
In each workload of this thesis 10.000 pseudo-random values has been checked.

20 2 Background
2.3.2 Mitigation solutions
During the last 40 years of research, many solutions were proposed. In general, all of them can be
divided into two groups, first containing all prevent techniques and second to reduce the impact to
the system caused by soft errors.
The first category, preventing techniques turned to be inefficient and in most of the cases generates
additional production costs. Besides previously mentioned shielding which turns to be not effective
yet increasing costs, there are also other techniques like for instance layout manipulation. However,
this approaches only slightly improve device vulnerability.
An Error migration techniques is a successor in this field. With an assumption that radiation-
induced errors cannot be fully eliminated, many mitigation solutions were proposed. Used tech-
niques differ according to aimed structure.
FPGA device 50-80% of the area is populated by memory blocks. This structure due to high density
and speed requirements use to be designed with possible small transistor size. As mentioned before
small transistor size increase chip vulnerability to soft errors. This fact along with the amount of
area being occupied identify this memory structures as a crucial part to apply migration techniques.
The most common practice to migrate soft errors is adding Error Correction Codes (ECCs) to each
memory structure. This approach makes possibility to check if the data is correct on each read event
and possibly correct one or more error. Commonly used ECC is based on Hamming code (extended
by one parity bit). This EEC is able to detect up to 2-bit errors and correct one bit. The second
effective technique is usage of Built-In Current (BIC) sensors which can detect a peak induced by
soft errors. BIC sensors, however, have few drawbacks. They need to be implemented directly
into FPGA architecture what slightly reduce available area and increase power consumption. Also,
error detection and correction can be performed separately, so the error would be detected and later
corrected by another set of redundant data but in this approach, the error correction occurs after
considerably long time in compare with other solutions.
Besides memories FPGA architecture consist big amount of logic blocks. In this case is nearly
impossible to use ECCs and this solution would be extremely ineffective. Instead of standard
preventing approaches has been proposed like hardening/shielding or use bigger sized transistors.
Among this classic solutions also, a new promising structure for latches/flip-flops has been pro-
posed. For instance, Dual Interlocked Cells are immune to single node soft error in implementation
on all transistors sizes.

3 Experimental Setup
This section summarizes the experimental setup including the employed cores, their implementa-
tions, workloads and the assumptions to related to soft errors.
3.1 Overall Flow
There are many techniques to measure the impact of radiation-induced soft errors. These tech-
niques range from abstract device concepts when simulation can be used, to the emulation steps
(for example FPGA emulation of ASIC device), finally till prototyping steps when experiments on
the quasi-ready device can be made.
As a good compromise between high cost and long computing time, in this research, I exploit
emulation based measurements. So the unit under test would be implemented on FPGA-based
board (Xilinx ml605) along with basic test utilities. For this approach, desired system (AES, JPEG
encoder) needs to be enriched by configuration bit management core, workload, and feedback to
distinguish if the hardware description works correctly. It is important to implement any extra
functionality with a possibly low footprint to not affect results. System prepared in this way would
be upload to the board and by external stimulus random bits in configuration memory would be
flipped while the device feedback would be constantly checked for any errors.
Because the error in configuration bit would affect all next clock cycles, the testing platform was
simplified to not check if output is correct on each clock cycle. Workload is designed to possible
check all input variations of tested core upon reset. Furthermore, output during each clock cycle of
workload is checked and added to implemented checksum. After one workload cycle, the checksum
is compared to value from no error run to distinguish if induced error does or does not get masked
during a program run.
21

22 3 Experimental Setup
Fig. 3.1: Concept diagram
3.2 Systems under test
3.2.1 AES encoder
Chosen core in an implementation of crypt algorithm which fully satisfy Advanced Encryption
Standard (AES)[7]. For encoding, 128-bit length crypt key is used and all data is exchanged by
desired data type which reﬂects the format used for computation. [8]
The algorithm for used encoding can be presented by following pseudo code:[7]
function CIPHER(bytein[4, 4], byteout[4, 4], wordw[4, (10 + 1)])
bytestate[4, 4]
state ← in
AddRoundKey(state, w[0, 4 − 1])
for round ← 1, 10 − 1 do
SubBytes(state)
ShiftRows(state)
MixColumns(state)
AddRoundKey(state, w[round ∗ 4, (round + 1) ∗ 4 − 1])
end for
SubBytes(state)
ShiftRows(state)
AddRoundKey(state, w[10 ∗ 4, (10 + 1) ∗ 4 − 1])
out ← state
end function
Where:
SubBytes() performs byte substitution based constant substitution table (S-box).

3.2 Systems under test 23
ShiftRows() shift the ﬁelds left in state*
array’s rows by the row index (starting with 0 - ﬁrst row
does not change).
Fig. 3.2: Cyclical row shift [7]
MixColumns() transform each column what can be described as a matrix multiplication (equation
3.1).
s (x) = a(x) ∗ s(x) (3.1)
Where:
s(x) - state column value
s (x) - new state column value
a(x) =






02 03 01 01
01 02 03 01
01 01 02 03
03 01 01 02






Operation had been shown in diagram 3.3.
Fig. 3.3: Column by column state array multiplication [7]
*
Intermediate Cipher result that can be pictured as a rectangular array of bytes.

AddRoundKey() adds a Round Key to the State array by bitwise XOR operation.
The basic block diagram of implementation into VHDL components is given in Figure 3.4.
Fig. 3.4: AES encoding [8]
To operate AES core speciﬁc timing restrictions needed to be fulﬁlled. The timing diagram is
shown in Figure 3.5.
D0 D1 D2
K0 K1 K2
C0 C1
t = 0 T 2T 3T 31T 32T
clk i
rst i
plaintext i
keyblock i
ciphertext o
Fig. 3.5: AES Timing Diagram [8]
It is worth to notice at this point that AES encryption takes 30 clock cycles and encrypted data have
to be read on 31st clock cycle. Another important information is maximum clocking frequency
which for Xilinx 5VLX50T board were estimated as 330 MHz. This suggest that the core can be
driven by main clock in used testing platform - 66 MHz.
3.2.2 JPEG encoder
This core provides convenient JPEG encoding utility. It does not rely on any external IP core,
instead, all necessary functions are written in Verilog by an author. This approach makes this core

3.2 Systems under test 25
extremely portable and easy to implement. Moreover, the core was provided with comprehensive
documentation which explains step by step compression process along with used techniques to
increase performance.
Operation of the JPEG Encoder core
The basic block diagram of JPEG encoding is given in Figure 3.6.
Fig. 3.6: JPEG encoding
At ﬁrst RGB color encoding is converted Y CbCr color space. So the luminance and chrominance
of picture are split to separate channels. Because a human eye is less sensitive to color differences
than to brightness level, color values can be highly compressed with negligible changes in human
perception. This operation is made based on values from ITU-R BT.601 standard and can be
expressed by equation 3.2.
Y = 0.299 ∗ R + 0.587 ∗ G + 0.114 ∗ B
Cb = −0.1687 ∗ R − 0.3313 ∗ G + 0.5 ∗ B + 128
Cr = 0.5 ∗ R − 0.4187 ∗ G − 0.0813 ∗ B + 128
(3.2)
Where:
R stands for Red channel, G for Green and B for Blue.
The image is further transformed into a frequency domain by applying two-dimensional discrete
cosine transformation for each channel (eq. 3.3 ).
Xk1,k2 =
N1−1
n1=0
N2−1
n2=0
xn1,n2 cos π
N1
n1 + 1
2
k1 cos π
N2
n2 + 1
2
k2
k1, k2 = 0, ..., N − 1
N1 = N2 = 8
(3.3)
Where N1 and N2 equals 8 as used data block is an array 8x8.

In the next step, achieved values are quantized by division with the corresponding value from
quantization table and rounding to an integer value. This is a lossy part of JPEG encoding, the
high-frequency changes in an image are averaged. In implementation quantization table is filled
with ones and workaround to not perform dividing was used.
Afterward, the quantization values are encoded by Huffman code. The Huffman table is not pro-
duced based on JPEG file instead it uses generic values. This approach sacrifices the best com-
pression ratio to provide possibly fastest compression. The output of Huffman coding components
consists of a 32-bit signal containing the Huffman codes and channels amplitudes.
Finally, all the channels are combined in one bitstream and any 0x00 occurrences are prefixes by
0xFF to fulfill specifications requirements.[9][10]
3.3 Radiation Sources
In this work, the sensitivity of FPGA for both Alpha and Neutron-induced particle strikes are as-
sessed. These two types of particles are the main sources of soft errors in a chip fabricated in a
40 nm technology and working at ground-level. The Neutron-induced soft error rate is highly de-
pendent on the altitude. The results presented in this paper are for the terrestrial environment. The
Alpha-induced soft error rate is dependent on the packaging material. For FPGA implementations,
the packaging material with Ultra Low Alpha (ULA). During SER analysis, we accurately model
both error generation and propagation. The error generation tools simulate the impact of particles
on the device and provide the average number of errors generated by the system. For error gener-
ation, we exploit accurate industrial tools for FPGA implementations. For error propagation, we
used statistical fault injection to determine the probability that a generated error leads to a system
failure. The employed statistical fault injection techniques take all masking factors into account
and provide trustable results.
3.4 Fault Injection Flow
The testing environment consists a personal computer and FPGA development board (Xilinx
ml605) connected by UART and Xilinx USB interfaces. Both AES and JPEG were developed
in Xilinx ISE development software.
This section describes the common part of development for tested systems.
Let me briefly describe the design of emulation process for one fault injection.

3.4 Fault Injection Flow 27
1. UART communication is established
2. Soft error mitigation core retrieves command to flip a configuration bit with desired address
3. Bit is flipped and confirmation message is sent back to the UART interface
4. Device experience software reset (PC command through ChipScope interface)
5. Workload applies input signal variation to core under test
6. Core retrieves clock signal until finish output computation
7. Output is added to checksum and operation is repeated
8. After checking all desired variations in workload and adding their output clock for core under
test, checksum and workload clock is masked and ’Done’ flag is set
9. Check Sum is read by computer via ChipScope interface
This procedure provides an easy to develop and debug, reliable solution to probe effects of ran-
domly induced errors in FPGA configuration bits.
The design process can be basically split into two groups, computer scripting and FPGA program
development. First, I describe more complicated FPGA part which is followed by design hierarchy
and development of automated script.
3.4.1 Core enrichment
As indicated before in this development approach desired core needs to be supplement with testing
utilities before synthesizing to FPGA device. To achieve this, the core under test was implemented
as a component in top module along with:
• Clocking Management Component
• Soft Error Mitigation Controller
• SEM Monitor interface
• ChipScope Integral Controller
• ChipScope Virtual Input/Output core
• Workload thread
• Check Sum counter
These extra components provides all necessary features to successfully inject fault, provide core
stimulation, count output check-sum and provide ready to interpret result to computer. Diagram of
ready to implement testing unit had been shown in figure 3.7.

Fig. 3.7: Structure of testing unit
3.4.2 Implemented cores
Soft Error Mitigation Controller IP core (SEM IP core) - is a basic Xilinx soft error mitiga-
tion solution. However, besides its basic functionality, it provides also easy to implement solution
for FPGA conﬁguration memory access with a really small footprint of around 1000 conﬁguration
bits. [11]
This core has been implemented with a disabled option for error correction. This operation allowed
to perform measurements of the system which does not use any error mitigation system. So the
core was implemented without its basic functionality, instead, only small part responsible for error
injection was implemented.
Moreover, to provide right control signaling and communication default control component had
been implemented along with Monitor Interface (UART communication).

Monitor Interface - is an optional component for SEM core design to provide user-friendly
interactions with implemented core. Monitor Interface in this system were connected with desired
UART shim*
. For simplicity UART implementation will be considered as an integral part of Mon-
itor Interface in this work.
Communication parameters:
Baud rate: 115200
Settings: 8-N-1
Flow Control: None
Default baud-rate of UART connection in Monitor Interface is
set 9600 to provide highest possible capability. However, to
avoid bottle neck, later on it had been changed to 115200.
Implemented Monitor Interface provides usable debugging fa-
cilities as well as a user-friendly way to introduce fault bits. Whole UART communication is based
on following commands:
S (Status) - read back few informations about the device as follows:
MF {8-digit hex value} Maximum Frame (linear count)
SN {2-digit hex value} Hardware SLR Number
SC {2-digit hex value} Current State
FC {2-digit hex value} Current Flags
FS {2-digit hex value} Feature Set
This is an extremely usable command. Due to the fact that it does not have any effect on
SEM controller, it is used to check whether UART connection does not experience any er-
rors. Beside, it supply number of available frames (linear addressing) later used to describe
addresses used for fault injection.
O (Observation) - command is used to apply core observation mode. In this mode core is trying
to identify any SEU in configuration memory. Moreover, this command is used to submit
injection order so injection into many bits can be performed at once.
I (Idle) - turn off observation feature to allow inject fault commands execution.
There are two separate commands for error injection in monitor interface. This separation is justi-
fied by two kinds of memory addressing which can be used. First of all, we can address physical
memory block then word and bit. But because memory is spread across many inconsistent blocks,
this kind of addressing is really inconvenient. Another approach is to use address translation so
from a user point of view whole memory can be accessed like a continuous list of bits. Both
addressing types have their reflection in following fault injection commands:
*
A small library that transparently intercepts API calls and changes the arguments passed, handles the operation itself,
or redirects the operation elsewhere

1100 0000 0SSL LLLL LLLL LLLL LLLL WWWW WWWB BBBB
(bitwise described command for error injection with linear addressing)
Where:
SS = Hardware SLR number for SSI (2-bit) and set to 00 for non-SSI
LLLLLLLLLLLLLLLLL = linear frame address (17-bit) [0..MF]
WWWWWWW = word address (7-bit) [0..100]
BBBBB = bit address (5-bit) [0..31]
[11]
Linear addressing is providing convenient way to access memory. Instead of direct addressing of each
single memory block, it operates on higher abstraction level where the memory can be considered as
an single block (linear addresses are automatically translated to physical ones). In my system only
this type of addressing had been used as a precision physical address information can be neglected in
this approach.
0SST THRR RRRC CCCC CCCC CMMM MMMM WWWW WWWB BBBB
(bitwise described command for error injection with physical addressing)
Where:
SS = Hardware SLR number for SSI (2-bit) and set to 00 for non-SSI
TT = block type (2-bit)
H = half address (1-bit)
RRRRR = row address (5-bit)
CCCCCCCCCC = column address (10-bit)
MMMMMMM = minor address (7-bit)
WWWWWWW = word address (7-bit) [0..100]
BBBBB = bit address (5-bit) [0..31]
[11]
In this work, I used built-in UART-USB adapter to obtain communication between computer and
SEM core. Further usage of this communication channel had been described in software imple-
mentation section.
ChipScope Pro integrated controller (ICON IP core) - provides the communication inter-
face between the computer and other implemented ChipScope cores. Communication takes place
through JTAG boundary scan component and built-in Xilinx JTAG-USB adapter.
In my system, ICON core with only one control port was used. This port was linked with Virtual
Input/Output core to read and apply states of FPGA implemented signals.
Usage of only one port results in extremely small footprint - 90 look up tables and 108 ﬂip-ﬂops
(ICON datasheet). [12]

ChipScope Pro Virtual Input/Output core (VIO IP core) - links desired system signals to
ICON core. It provides functionality for monitoring and driving FPGA signals.
Implemented code uses only 3 synchronous probes:
32 bit Check sum - probe to send counted checksum value
1 bit Done flag - set high after computation for the whole workload is done
1 bit Software reset - computer driven signal to reset FPGA board
Based on core datasheet, this core fits restriction of small footprint with used less than 200 lookup
tables and less than 400 flip-flops.
3.4.3 Implemented components
Clock management - for AES and JPEG extremely differ to meet application restrictions. All
implemented cores on other hand are supplied with same 66 MHz clock signal.
JPEG core was provided without any description about maximum possible clock frequency. Af-
ter implementation, Xilinx ISE development studio estimated maximum clock frequency for this
system as 30 MHz, as all the others components could work with frequency valued at least 200
MHz it was clear that this restriction concerns only JPEG core. As the time was not crucial at this
point (emulation process takes a much shorter period of time than communication), the clock was
reduced 100 times in clock divider component.
To sum up, all the components (not including cores other that JPEG) were provided with 660kHz
clock signal.
AES core does not provide any feedback. Instead of, it was designed in the way that each en-
cryption task would take exactly 30 clock cycles. So after this period of time, the valid result read
at output*
. This approach leads to more complicated clocking system. To perform single output
data reading and start testing next input values, main clock signal was divided by 34. This lower
frequency clock ( 2MHz) was connected to workload process and checksum component. Further-
more, to prevent checksum component from reading start transient signals and to not read a few
times an established output at the end, clock masking for checksum component was introduced.
Checksum clock was restricted only to window from 3 clock cycle to 1.000.003 one.
*
Output and stay the same until input data change.

Check sum had to be adjusted to each core output, but for both systems the basic steps are
exactly the same. First, strip output into the bits and second add all ones to defined variable.
This simple checksum function had been used as it provides small footprint and is convenient for
debugging process.
Workload provides pseudo-random combinations of signals with restriction that the same com-
bination cannot occur twice.
3.4.4 PC scripts implementation
Because a single measurement process employs about 20.000 iterations of checksum validation and
bit flipping, after successful testing platform implementation it was important concern to develop
automation process. To achieve it in possibly convenient and clear way a Python scripting language
had been used. One of the most challenging parts was to use callbacks from Xilinx proprietary
software. Because there was no other way to program FPGA board the Xilinx iMPACT program
was launched with Tcl configuration script and console callbacks were interpreted to further usage.
Another challenging part was to achieve command line access to implemented VIO core, this is
not officially supported function in used Xilinx IDE version (feature added with Vivado release).
Written Tcl script is based on example file found in Xilinx installation patch.
The final automation algorithm has been presented in figure 3.8.
It shows that after algorithm starts, FPGA board is checked to be responsive to external stimulus
by checking if there is UART connection and if VIO core responds with a proper checksum value.
If the device does not have to be reprogrammed fault bit in a random place is induced. Board is
restarted and after computing, the checksum is checked. If it does not change bit is reflipping, the
memory address stored and the process is continued for next 9 999 cycles. I case of any failure
the error message is stored in the log file and board is reprogrammed. If the reprogram feature
does not work 3 times the whole computer is suspended for 60 seconds so the board powered
by computer power supply would lose all stored data (hard reset). This operation needs to be
performed sometimes as introduced fault can affect communication port and the board cannot be
simply reprogrammed. If this operation does not make FPGA board responding the communication
error is returned.

Fig. 3.8: Scripted algorithm

4 Experimental Results
This chapter contains comprehensive description of emulation results and interpretation was pre-
sented. The ﬁrst section provides result along with description basic interpretation. Next section
describes uncertainty factor due to communication errors. Finally, the full interpretation was pro-
vided in the last section.
4.1 Fault Injection Results
Results of my emulation along with important factors are presented in the following table.
Errors including
exceptions *
Errors excluding
exceptions *
Slice
Registers
Slice LUTs
errors/iterations errors/iterations used/total used/total
AES encoder
board wired to PC 0,95 % 0,93 % 3,07 % 12,69 %
board installed in PC 0,94 %
JPEG encoder
board wired to PC 1,16 % 1,05 % 10,85 % 16,21 %
board installed in PC 1,13 %
Raw testing
environment
– – 0,43 % 0,61 %
Table 4.1: Juxtaposition of acquired results.
The results contain 4 values for each measurement. The ﬁrst column contains the amount of all
logged errors divided by a number of injection iterations (expressed in percentage). This value,
however, takes into account all communication errors which can have a different source than simu-
lated soft error. In the next column, the result with excluded communication errors has been shown
for comparison. Other presented values contain a number of slice registers and LUTs tables which
35

36 4 Experimental Results
were used by particular project*
. These numbers provide basic information about system size with
distinction for definitions of logic blocks and memory. The table gives the results for two tested
cores along with values of raw testing environment for compassion. Moreover, each core was tested
twice. As a first measurement raise uncertainty about connection error, measurements were per-
formed second time with tested board installed to the computer what efficiently eliminated these
errors.
My experimental results clearly show that only about one percent of soft errors affect device out-
put. Moreover, as expected, the factor of critical bits increase with size of the project. Finally,
the testing platform designed to achieve possible low footprint is using only about a half percent
of device resources, possible error with reference to this number has been further described in the
next section. Because each emulation contained 10 thousand of injections, it took 2-3 days of con-
tinues computing achieve a result. It is worth to mark that continues computing on board wired to a
computer were impossible due to connection errors which exceed necessary time to 5-7 days with
possibly continues supervision as a user interaction in the case of this errors were needed. To over-
come this problem and solve possible ambiguities behind it further improvement were developed
as described in next section.
4.2 Communication and testing facilities errors
Due to the technique of fault injection into random configuration bits, there is actually uncertainty
if the bit actually affected tested core or testing facilities. It was an important concern during devel-
opment of this testing environment. To possibly minimalize footprint of all the others implemented
modules few compromise solutions were used to finally end up with a usage of about 0,5 % for
both slice registers and used LUTs. Assuming that for register slice and for configuration slice is
used the same amount of configuration bits we can estimate probability of hitting test facility by
the following equation:
*
Values acquired from Xilinx implementation report.

4.3 Estimation of radiation-induced failures over the time 37
PHIT−TEST = (SRRAW ∗ 2 + SLRAW )/(SRencoder ∗ 2 + SLencoder) (4.1)
Where:
PHIT−TEST is a probability that the injection affects used hardware description.
Index "RAW" refers to testing facility implemented without tested core, wherein expressions
marked "encoder" describes whole testing system (including tested encoder).
SR stands for the amount of slice registers being used in percentage and SL refers to the amount of
slice LUTs being used (also in percent).
Multiplication by two had been used to balance equation for percent usage as there are exactly twice much
Slice Registers available as Slice LUTs.
This probability counted for AES encoder is 7,8% and 3,9% for JPEG respectively which shows that this
method has higher accuracy for bigger systems with theoretical limit getting to 0,5%.
The measurement result can differ in reference to the interpretation of errors records. That is why two re-
sults have been shown. One which does not contain errors during communication and second containing
this errors. These errors are further called iMPACT and VIO exceptions as they occur in iMPACT software
callbacks during hardware description uploading or in script responsible for communication with VIO core.
Because first measurements log contained few percent of iMPACT and VIO exceptions it raised the ques-
tion if these errors are not an effect of injecting a fault into the parts responsible for communication with a
computer. As later noticed all of the communication errors occurred during the day what raised the question
that all these communication errors could be induced by accidental movement of the connection cables. To
check this FPGA board was placed into the computer case, internally connected and powered by stable com-
puter power adapter. This lead to exceptional results, both AES and JPEG cores were once again emulated,
without experiencing any communication errors.
As result achieved after installing FPGA board into PC does not contain communication errors there will be
taken into account for further investigation.
4.3 Estimation of radiation-induced failures over the time
A value representing radiation induced faults in configuration memory over the time for a Virtex-6 device is
provided by Xilinx company and is equal to 105 FIT/Mb of memory. Because my measurements were not
restricted to any specific area of FPGA all configuration bits have to be taken into consideration. The basic
equation to express failure in time ratio, in this case, would be presented as:

38 4 Experimental Results
Failure_in_time = DV F ∗ CRAM ∗ ER (4.2)
Where:
DV F - device vulnerability factor (measured errors to injections ratio)
CRAM - total size of conﬁguration memory used in emulation process [Mb]
ER - soft error rate caused by single event upset affecting memory cells [FIT/Mb] [13]
Mentioned above value is given with relative point which is sea level in New York, USA. To estimate this
value for more local use, Karlsruhe in these case relative factor of 1.12* have to be included into calculation.
Based on my emulation results, AES system experience 102 FIT (102 failures per 1 billion device hour)
when JPEG outcomes with 124 FIT. One can say that an error per 917 years is negligible, but if 1000 devices
would be produced this mean one failure per year. Furthermore, in many devices, more than one FPGA chip
is used. Let take as an example IP PBX system were FPGA chips are commonly used to handle the high
demand for parallel tasks. Typically each modular board contains 2 FPGA chips and system is supplied with
about 5 boards. In this system build on 10 FPGA chips and sell in a number of 10 000 devices, amusing
programs reliability of 105 FIT, would result in disappointed customer at each 4 days!
This example in an escalated way shows how big consequences radiation-induced errors in FPGA conﬁg-
uration memory can be. Furthermore, obtained results show reasonable tendency that device vulnerability
factor is increasing as a system is using more FPGA resources. So more complex systems would be even
more vulnerable to inspected errors.
*
Relative value for 49N 8.24E on 200m AMSL calculated with respect to JEDEC89A standard. [14]

5 Conclusions
Radiation-induced soft errors are hard to analyze phenomena. As a particle strike can happen in any time,
anywhere, with a variety of power, these errors are simply unpredictable. Moreover, as error propagates
through the device it can be masked on all levels, from physical dependencies to high abstraction level. This
variety of cases makes an analytical modeling impractical.
Also, increasing vulnerability to soft errors is a fact which cannot be omitted. With present knowledge, it is
impossible to fully eliminate these errors occurrence. Instead of elimination only technique with effectively
suppress soft error effects is to start with assumption that they are unavoidable and use techniques to mitigate
their influence. These techniques can suppress soft error effects to less than 1% [6].
As mentioned before systems based on FPGA are especially vulnerable to soft errors, that is why it is ex-
tremely important to include mitigation techniques into build projects. An additional memory to implement
the design in FPGA, in this case, needs to be treat with a special care. Because in a case of any error in
configuration bits not only current output can be affected but it can lead to permanent* corruption of device
operation. Outstanding FPGA vulnerability to soft errors makes mitigation techniques obligatory to apply in
new systems.
Because an analytically based modeling is extremely complex and high demanding for computation power
task in this work measurement based method was use. The used methodology is independent of the used
system and provides the fast yet accurate way to estimate effects of error in the configuration memory. I
designed an fully automated testing platform so the measurement process does not need human interactions.
Based on my testing system I estimated the amount of critical bits for implementation of two popularly used
coding standards. Measured values let me estimate the number of failures in time for AES and JPEG system
to be 102 and 124 FIT respectively.
Presented results clearly show that consideration of radiation-induced soft errors should be an obligatory
part of designing any reliable system. Moreover as presented examples demonstrate, even in devices where
high reliability is not a concern if the device would be mass produced consideration of soft errors is not
Independently, if the system is desired to work in space or on sea level error mitigation techniques should be
introduced in any high-end devices to provide stable and reliable operation.
*
Up to next programing cycle.
39

Bibliography
[1] S. Kiamehr, F. Firouzi, M. Ebrahimi, and M. B. Tahoori, “Aging-aware standard cell library design,” in
Proceedings of the conference on Design, Automation & Test in Europe, p. 261, European Design and
Automation Association, 2014.
[2] M. Ebrahimi, F. Oboril, S. Kiamehr, and M. B. Tahoori, “Aging-aware logic synthesis,” in Proceedings
of the International Conference on Computer-Aided Design, pp. 61–68, IEEE Press, 2013.
[3] J. R. I. Kuon, R. Tessier, “Fpga architecture: Survey and challenges,” Foundations and Trends in Elec-
tronic Design Automation.
[4] M. Nicolaidis, “Soft errors in modern electronic systems, volume 41.,”
[5] F. Kaddachi, “System-level reliability evaluation through cache-aware software-based fault injection,”
2015.
[6] M. Ebrahimi, “Cross-layer soft error analysis and mitigation at nanoscale technologies,” 2016.
[7] “Fips 197, advanced encryption standard (aes),” in Federal Information Processing Standards, 2001.
[8] S. Das, “Fully pipelined aes core,”
[9] “Information technology digital compression and coding of continuous-tone still images requirements
and guidelines,” 1992.
[10] D. Lundgren, “Jpeg encoder ip core,”
[11] “Soft error mitigation controller v4.1,” 2015.
[12] “Chipscope pro icon,” 2009.
[13] “Device reliability report,” 2016.
[14] “Continuing experiments of atmospheric neutron effects on deep submicron integrated circuits,” 2016.
41

projekt_praktikum

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Similar to projekt_praktikum

Similar to projekt_praktikum (20)

projekt_praktikum