IEEE Floating Point

IEEE 754 Standard for Floating Point Representation
of Real Numbers.
There are four pieces of info to be represented:
Sign of the number
(Always the high order bit; 0=positive, 1=negative.)
Magnitude of the number (Stored in binary with leading "1" understood. See below.)
Sign of the exponent
(Stored as an offset "bias" on value of exponent. See below.)
Magnitude of the exponent. (Stored as unsigned binary added to offset bias.)
I. Float. (32 bits) Binary template: SEEE EEEE EMMM MMMM ... MMMM
where S = sign bit, E = exponent bits (8 of these), M = mantissa bits (23 of these).
Process to represent:
A. Convert decimal number to binary.
B. Move radix point to 1.xxx... * 2^exp representation.
C. Now, do the work:
i.
S. Assign the sign bit. positive --> 0, negative --> 1.
ii.

Assign the mantissa. This is the fractional part of the value of the number.
--ignore the leading "1". It is always 1 in this scientific notation, so there is no need to
store it. The system will re-insert it later when the number is used.
--form the 23 MMM...MMM bits from the first 23 of the remaining "xxx..." bits above.
If there are not 23 of them, fill out with trailing zeros.

iii.

The exponent "exp" may be positive or negative. Float is stored with excess-127
notation. That is, you ADD 127 to your exponent and store as a pure (unsigned)
binary. When the number is re-created for use later, 127 will be subtracted from the
exponent. This method saves having to use 2's complement for negative exponents.

iv: Combine in recipe SEEEEEEEEMMMM...MMM format.
Example:
Find the representation of decimal -10.5 in float IEEE-754 representation.
A. Convert the number: -10.5 decimal --> -1010.1 binary
B. Move radix to get scientific notation:
-1010.1 binary --> -1.0101 * 2^(+3) (binary)
C. Now, do the work:
i. Sign bit will be 1 since the number is negative.
ii. Mantissa.
(101010000...0)
Ignore leading one: 01010000...0
Use first 23 bits:
01010000000000000000000 -->This is mantissa.

iii. Exponent is +3.
ADD excess-127 offset: 127 + (+3) = 130 decimal = 10000010 binary.
These are the 8 bits for the exponent: 10000010.
D. Combine in SEEEEEEEEMMMMMMMMMMMMMMMMMMMMMMM format:
11000001001010000000000000000000
To make it easier to read, group in 4's and convert each group to hex:
original
grouped:

11000001001010000000000000000000
1100 0001 0010 1000 0000 0000 0000 0000
C
1
2
8
0
0
0
0

binary
binary
hex

or,
C128 0000

hex

II. double.
The process to convert to "double" is similar to "float":
1. a total of 64 bits are used to represent the number.
2. The sign bit still is one digit, always the high order digit.
3. The mantissa is stored with 52 bits instead of 23. This gives doubles greater precision.
4. The exponent is stored with 11 bits instead of 8. This gives doubles a wider range of
representable numbers.
Example: Decimal +10.5 --> 0100 0000 0010 0101 0000 ... 0000 binary (64 bits)
4025 0000 0000 0000 hex

IEEE Floating Point

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to IEEE Floating Point

Similar to IEEE Floating Point (20)

More from Jason Ricardo Thomas

More from Jason Ricardo Thomas (8)

Recently uploaded

Recently uploaded (20)

IEEE Floating Point