The IEEE Standard for Floating-Point Arithmetic (IEEE 754) is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in the diverse floating-point implementations that made them difficult to use reliably and portable. RISC-V also uses the IEEE 754 standard.
Standard vs Custom Battery Packs - Decoding the Power Play
IEEE-754 standard format to handle Floating-Point calculations in RISC-V CPUs.pptx
1. IEEE-754 standard format to handle Floating-
Point calculations in RISC-V CPUs
General Information and Hardware Implementation
Zeeshan Rafique
Research Associate at MERL
2. Agenda!
● What is standard and why we should to rely on it?
● What is Floating Point and why we need it? Precision and Accuracy
● General Information about Floating Point
○ Bits encoding for single and double precision
○ Conversion from floating point to IEEE 754 standard
○ Floating point just mimic mathematical arithmetic
○ Features of floating point
○ Range of single and double precision
○ Representation of floating point
● Floating Point Arithmetic examples
● Hardware Implementation of FPU with RISC-V Core
○ RISC-V Floating Extensions
○ Floating Point Register file
○ Rounding Mode
○ Floating Point CSRs (fcsr, frm, fflags)
○ Exception Flags , Implementation in Decoder and Controller
2
3. What is standard and why we should to rely on it?
3
● What does Standard mean?
○ Something established by authority, custom, or general consent as a model or for
reference or benchmark.
● Why do we need a Standard?
○ Standards are needed to assure safety of products, to ensure that products and materials
are tailored-made for their purpose, promote the interoperability of products and services,
facilitate trade by removing trade barriers, promote common understanding of a product
4. Fixed Notation
● We are accustomed to using a fixed notation where the decimal point is fixed and we
know that any numbers to the right of the decimal point are the decimal portion and
to the left if the integer part.
E.g. 10.75
10 is the integer portion and 0.75 is the decimal portion.
4
5. Floating Point Representation
● The number should be encoded into scientific notation.
● The structure of a floating point (real) number is as follow:
5.5 * 109
Mantissa Base
Exponent
5
● Only the mantissa and the exponent are stored. The base is implied (know already).
As it is not stored this save the memory capacity.
6. Why we need floating point?
● Floating point representation makes numerical computation much easier. You
could write all your programs using integers or fixed-point representations, but this is
tedious and error-prone.
● Many numeric applications need numbers over a huge range.
○ e.g., nanoseconds to centuries
● A programmer do arithmetic operation which can results in real numbers (π).
● In either case, the idea is to represent a real rational number in a way similar to
scientific notation.
● For example, the following number is given in scientific notation:
○ 6.022 × 1023 (an approximation to Avogadro’s constant)
6
7. Floating Point Applications
● Digital Signal Processing
● Domain-Specific Accelerators
● Vector Processing Units
● Medical Electronics
● Everywhere where you need more accurate result.
* TechInsight.com Apple iPhone XS teardown
7
8. Floating Point Disasters
● Scud Missiles get through, 28 die
○ In 1991, during the 1st Gulf War, patriot missile defense system let a Scud get through, hit a
barracks, and kill 28 people. The problem was due to a floating-point error when taking the
difference of a converted & scaled integer. (https://medium.com/nerd-for-tech/floating-point-
rounding-error-in-computers-6485cc26f5e8)
Source: https://slideplayer.com/slide/16330842/ slide 25
8
● $7B Rocket crashes (Ariane 5)
○ When the first ESA Ariane 5 was launched on June 4, 1996, it lasted only 39 seconds, then the
rocket veered off course and self-destructed. An inertial system, produced a floating-point
exception while trying to convert a 64-bit floating point number to an integer. Ironically, the same
code was used in the Ariane 4, but the larger values were never generated.
(https://around.com/ariane.html)
● Intel Ships and Denies Bugs
○ In 1994, Intel shipped its first Pentium processors with a floating-point divide bug. The bug was
due to bad look-up tables used to speed up quotient calculations. After months of denials, Intel
adopted a no-questions replacement policy, costing $300M.
(https://www.intel.com/support/processors/pentium/fdiv/)
9. Precision and Accuracy
● Precision: Maximum number p of significant digits that can be represented in a format.
● Accuracy: How accurately the number is defined in a format.
3 bit precision
8 bit precision
lost the accuracy and precision
● Lower the precision, lesser the accuracy. [Not true in all cases]
● In which cases you will get perfect accuracy with very lower precision? 9
10. Precision and Accuracy
● Precision: Maximum number p of significant digits that can be represented in a format.
● Accuracy: How accurately the number is defined in a format.
● Lower the precision, lesser the accuracy.
● In some cases you can still have perfect accuracy with very low precision.
● If we have 4 bits for precision then,
● 5 / 2 = 2.5 , here we have 100% accuracy.
● 10/3 = 3.33333333 , not 100% accuracy.
10
14. Quick conversion
Assume that we have a numbers 6.625, convert it according to IEEE 754 standard into
● single precision
● double precision
Solution:
For single precision:
Convert the number into binary. (6.625)10=(?)2
link
Conversion
(6.625)10=(0110.101)2
6 .625
6 / 2 = 3 -> 0 0.625 x 2 = 1.25
3 / 2 = 1 ->1 0.25 x 2 = 0.5
1 / 2 = 0 -> 1 0.5 x 2 = 1.0
Final value = 110 Final value = 101
LSB LSB
14
16. ● So, we have (6.625)10=(110.101)2
● Setting format of the number: 110.101 x 20 => 1.10101 x 22
● When we move decimal point to left the exponent will get incremented (+ive) and if
we move decimal point to right the power will get decremented (-ive).
+1.10101 x 22
● The orange is Sign, blue is Mantissa and red one is Exponent.
● For single precision:
S = 0
`E = E + Bias
`E = 2 + 127 => 129
`E = 10000001
M = 10101000000000000000000
31 30 29 23 22
0
Quick conversion - Single Precision
0 10000001 10101000000000000000000
Sign bit(1) `Exponent bits(8) Mantissa / fraction bits(23)
16
17. ● So, we have (9.625)10=(110.101)2
● Setting format of the number: 110.101 x 20 => 1.10101 x 22
● When we move point left the power will get incremented (+ive) and if we move point
right the power will get decremented (-ive).
+1.10101 x 22
● The orange is Sign, blue is Mantissa and red one is Exponent.
● For single precision:
S = 0
`E = E + Bias
`E = 2 + 1023 => 1025
`E = 10000000001
M = 10101000000000000000000...000
63 62 61 52 51
0
Quick conversion - Double Precision
0 10000000001 10101000000000000000000...000
Sign bit(1) `Exponent bits(11) Mantissa / fraction bits(52)
17
18. Features of floating point numbers (IEEE 754-2008)
● Every floating point numbers has a sign. Every number is either positive or negative.
● There are two representations for zero: positive zero (i.e., +0.0) and negative zero
(i.e., -0.0).
● There are two representations of infinity: positive infinity (+∞ or +inf) and negative
infinity (-∞ or -inf).
● The exponent may be positive or negative, allowing both very large numbers and
very small numbers.
● There is a special representation called “not a number” (“NaN”). This value can
represent a missing value or the result of a undefined operation, such as divide by
zero. In some implementations there are two variations, called “quiet NaN” and
“signaling NaN”.
18
19. Exponent Bit Pattern for Single and Double Precision
Single Precision
Double Precision
19
21. Range for Single and Double Precision
https://youtu.be/A2HflP5sa_0
Digits of accuracy = log10(2$bits(M))
21
22. Floating point just mimic mathematical arithmetic
● The exact value or result of an operation is not always representable, so the
computed answer is often not mathematically correct.
● Floating point addition is not always associative, due to rounding errors.
That is, (x + y) + z is not always equal to x + (y + z).
● Floating point multiplication is not always associative. That is, (x * y) * z is
not always equal to x * (y * z).
● Floating point multiplication does not always distribute over addition with the
exact same results. That is, x * (y + z) is not always equal to (x * y) + (x * z).
● Floating point addition and multiplication are commutative, like math. For
example, x+y = y+x, so you don’t have to worry about the order of operands
for a single operation.
22
23. Representations
● Positive zero (+0.0)
● Negative zero (-0.0)
● Positive infinity (+∞ or +inf)
● Negative infinity (-∞ or -inf)
● Not-a-number (NaN)
○ Quiet Nan (qNaN)
○ Signaling Nan (sNaN)
● Normal numbers (or “normalized numbers”)
● Denormalized numbers (or “denormals”)
● Subnormal numbers
23
24. Positive and Negative Zero (+0, -0)
● 1/+0 yields +∞
● 1/-0 yields –∞
● +0 will normally compare as equal to -0
● -0/-∞ yields +0
● Although +0 and -0 may compare as equal, they may also result in different
outcomes in some computations. This challenges our understanding of the meaning
of “equal”, to say the least.
24
25. Positive and Negative Infinity (+∞, -∞)
● Positive infinity can be represented as “all the bits in exponent field are set high, all
the bits in fraction field are set to low and the sign bit will define sign”.
31 30 23 22
0
Single precision
63 62 52 51
0
Double precision
0/1 11111111 00000000000000000000000
0/1 11111111111 00000000000000000000000...000
25
26. ● An expression whose result is not possible in Mathematics.
○ 0/0 , ∞/∞ , 0 * ∞ , sqrt(-ive) , log(-ive)
● There are two types of NaN signaling and quiet NaN.
● Signaling NaN will raise an exception if NaN arrived at either operands or it becomes
the result of any operation.
● Quiet NaN will propagate and does not raise an error.
● In case of quite there will be a number that will move forward and that number will
known as canonical NaN and it will be further discussed in next slide.
● RISC-V set the implementation of signaling NaN optional.
● When implementing double precision floating point, and instruction encountered of F-
extension then the result will be stored in the lower 32 bits of the register and the
upper bits will be set high, so that if D-extension’s instruction read the value so it
found it as NaN.
Not a Number - NaN
26
27. Not a Number - NaN
● It can be represented as all 1’s in exponent field and other than all 0’s in mantissa or
fraction field. (all 0’s represents +∞ or -∞)
31 30 23 22
0
S
63 62 52 51
0
D
● The MSB bit in fraction bits is always set to represent that it is a NaN and following
bit is mask to define it is quiet NaN.
X 11111111 10000000000000000000000
X 11111111111 10000000000000000000000...000
27
28. Normalized and Denormalized Numbers
● The most of the floating point numbers are normalized number.
● The greater the integer part is, the less space is left for floating part precision.
● All denormalized numbers are very close to zero.
● Denormalized numbers extend on both the positive and negative sides of zero.
● +0.0 and -0.0 are themselves represented as denormalized numbers.
● The largest denormalized number is just less than the smallest positive normal
number.
● Likewise, the most negative denormalized number is just greater than the least
negative normal number.
● It is generally safe to ignore the distinction between normalized and denormalized
numbers when using floating point in your applications.
● Computation on very small values (denormalized numbers) may loose all of your
precision. 28
29. Subnormal Numbers
● A subnormal number is a nonzero floating-point number with magnitude less than the
magnitude of that format smallest normal number.
● As a result, a subnormal number in a given format fails to use the full precision
available to normal numbers of the same format.
● 0.0 is also a subnormal number.
● Subnormal numbers always have equals to 0.
● The largest subnormal number is:
○ 0.FFFFFE * 2 ^ (-126)
● Smallest non-subnormal number is:
○ 1.0 * 2 ^ (-126)
29
38. RISC-V Floating Extensions
● There are 3 Extensions for floating points.
● F- single precision (32-bit)
● D- double precision (64-bit)
● Q- quad precision (128-bit)
● L- decimal floating point (64/128-bits)
❏ Note: RISC-V floating point is complained with IEEE 754-2008
38
39. Floating Point Register File
● The register file contain 32 number of registers of WIDTH
○ 32 - while implementing F extension or Single precision
○ 64 - while implementing D extension or Double precision
● It has 3 read and 2 write ports because there are few instructions with 3 operands.
● It is a separate register file from integer register file.
● Things to remember while implementing D extension / Double precision:
○ Single precision or F extension is a prerequisite.
○ The single precision instruction will store the data in lower 32 bits of the register,
the upper 32 bits will set to all 1s or if it is NaN than all 1s.
39
40. Rounding modes
● There are two rounding mode, Dynamic or Static rounding mode.
● The rm field in the instruction will decide the mode.
● The 111 tells that its is Dynamic mode.
● In dynamic mode we choose the rm value from fcsr and in static mode from
instruction.
40
41. Rounding modes
● Round to nearest: The system chooses the nearer of the two possible outputs. If the
correct answer is exactly halfway between the two, the system chooses the output where
the least significant bit of Frac is zero. This behavior (round-to-even) prevents various
undesirable effects. This is the default mode when an application starts up. It is the only
mode supported by the ordinary floating-point libraries. Hardware floating-point
environments and the enhanced floating-point libraries support all four rounding modes.
● Round up, or round toward plus infinity: The system chooses the larger of the two
possible outputs (that is, the one further from zero if they are positive, and the one closer
to zero if they are negative).
● Round down, or round toward minus infinity: The system chooses the smaller of the
two possible outputs (that is, the one closer to zero if they are positive, and the one
further from zero if they are negative).
● Round toward zero, or chop, or truncate: The system chooses the output that is closer
to zero, in all cases.
41
42. FCSR (frm+fflags)
● The fcsr register has an address 0x003 and it is a Read/Write register.
● It has the two basic parts, ‘frm’ and ‘fflags’
● Both of the parts can be accessed individually by their respective addresses.
● All the upper bits should be set to zero when READ and WRITE.
42
43. Exception Flags
● The status flags are used to tell us the status of the operation currently done in fpu.
● The base RISC-V ISA does not support generating a trap on the setting of a floating-
point exception flag.
43
44. Implementing in Decoder and Controller
● An illegal instructions is raised if the coming instruction is not matching with the
format of any given format.
● If the detection of rounding mode in the instruction is 101 or 110, it will consider as
illegal instruction.
● If the rounding mode is dynamic, then the rm value ‘101-111’ in the fcsr will be
considered illegal instruction.
● Make sure, the following instruction will write back to the integer register file.
○ FCVT.W.S, FCVT.WU.S
○ FMV.X.W
○ FLT.S, FLE.S, FEQ.S
○ FCLASS.S
● The following instructions will read the rs1 from integer register file.
○ FCVT.S.W, FCVT.S.WU
○ FMV.W.X
○ FLW.S, FSW.S
44