IEEE 754 FLOATING POINT
REPRESENTATION
Prof. Tanvi Goswami
Dept. Of Information Technology
DDU, Nadiad
Prof. Tanvi Goswami
Fixed Point and Floating Point Number
Representations
Storing Real Number
There are two major approaches to store real numbers (i.e.,
numbers with fractional component) in modern computing.
These are
(i) Fixed Point Notation and
(ii) Floating Point Notation.
In fixed point notation, there are a fixed number of digits after
the decimal point, whereas floating point number allows for
a varying number of digits after the decimal point.
Prof. Tanvi Goswami
Floating point decimal number
Prof. Tanvi Goswami
Floating point decimal number
• There are different representations for the
same number and there is no fixed position
for the decimal point.
• Given a fixed number of digits, there may be a
loss of precession.
• Three pieces of information represents a
number: sign of the number, the significant
value and the signed exponent of 10.
Prof. Tanvi Goswami
note
• Given a fixed number of digits, the floating-
point representation covers a wider range of
values compared to a fixed-point
representation.
Prof. Tanvi Goswami
IEEE 754 standard
• Most of the binary floating-point
representations follow the IEEE-754 standard.
→ The data type float uses IEEE 32-bit single
precision format and the data type double
uses IEEE 64-bit double precision format.
Prof. Tanvi Goswami
IEEE 754 32 bit format
N = (-1)^s * (1.M) * 2^(E-127)
S is sign
M is mantissa
E is exponent
Prof. Tanvi Goswami
IEEE 754 32 bit format
Example:
3.5
Binary of 3.5 = 11.1
Sign is = 0 (no is positive) i.e. (-1)^0 = is
Normalize 11.1 => 1.11 x 2^1
Compare the exponent
 E-127 = 1
 E = 1+127
 E = 128
 Binary of 128 = 1000 0000
Mantissa is => .11
Therefore 1.M is 1.11
IEEE 754 representation is:
sign 8 bit Exponent 23 bits mantissa
0 1000 0000 11 000 …………..
Prof. Tanvi Goswami
N (-1)^s * (1.M) * 2^(E-127)
IEEE 754 32 bit format
Reverse Example:
• 0 1000 0000 11 000 …………..
=> S bit is 0 i.e. number is positive
E-127 = 128 – 127 = 1
 (-1)^0 x (1. 11) x 2^1 (N (-1)^s * (1.M) * 2^(E-127) )
1.11 x 2^1
11.1
Therefore number is 3.5
Prof. Tanvi Goswami
IEEE 754 32 bit format
Example: 85.125
Prof. Tanvi Goswami
IEEE 754 32 bit format
Example: 85.125
85.125 85 = 1010101
0.125 = 001
85.125 = 1010101.001
=1.010101001 x 2^6
sign = 0
Single precision:
Exponent: E-127=6 => E = 127+6 = 133
133 = 10000101
Normalised mantisa = 010101001 we will add 0's to complete the 23
bits
The IEEE 754 Single precision is: =
0 10000101 01010100100000000000000
sign 8 bit Exponent 23 bits mantissa
0 1000 0101 01010100100000000000000
Prof. Tanvi Goswami
Single precision range
Prof. Tanvi Goswami
IEEE 754 64 bit format
N = (-1)^s (1.M) 2^(E-1023)
S is sign
M is mantissa
E is exponent
Prof. Tanvi Goswami
What is 127 in IEEE 754?
• In “excess 127 form” negative exponents range from 0
to 126, and positive exponents range from 128 to 255.
The missing exponent, 127, is the one right in the
middle and represents a power of zero.
• The eight-bit exponent uses excess 127 notation.
What this means is that the exponent is represented in
the field by a number 127 greater than its value.
Why?
Because it lets us use an integer comparison to tell if one
floating point number is larger than another, so long as
both are the same sign.
Prof. Tanvi Goswami
Prof. Tanvi Goswami
Special conditions
Prof. Tanvi Goswami
Special conditions
• Special Values: IEEE has reserved some values that can ambiguity.
• Zero –
Zero is a special value denoted with an exponent and mantissa of 0. -0 and +0 are
distinct values, though they both are equal.
• Denormalised –
If the exponent is all zeros, but the mantissa is not then the value is a
denormalized number. This means this number does not have an assumed leading
one before the binary point.
• Infinity –
The values +infinity and -infinity are denoted with an exponent of all ones and a
mantissa of all zeros. The sign bit distinguishes between negative infinity and
positive infinity. Operations with infinite values are well defined in IEEE.
• Not A Number (NAN) –
The value NAN is used to represent a value that is an error. This is represented
when exponent field is all ones with a zero sign bit or a mantissa that it not 1
followed by zeros. This is a special value that might be used to denote a variable
that doesn’t yet hold a value.
Prof. Tanvi Goswami
Example
• The following scheme is used for floating point number
representation using 16 bits.
• Let the floating point number is represented as
N= (-1)^s * [ (1 + m * 2 ^(-9) ) ] * 2 ^ (e-31) , if
exponent is not equal to 111111 & 0 otherwise.
• What is the maximum difference between two
successive real numbers that can be represented in this
system?
Prof. Tanvi Goswami
Sign Exponent Mantissa
1 bit 6 bits 9 bits
Solution
For 1st number:
Let s=0 , e = 62
(as e != 111111, we assume e =
111110, m = 111 111 111)
N1= (1+511*2^-9) * 2 ^ (62-31)
= 2^31 + 511 * 2^22
Prof. Tanvi Goswami
For 2nd number:
Let s=0 , e = 62
(as e != 111111, we assume e =
111110, m = 111 111 110)
N1= (1+510*2^-9) * 2 ^ (62-31)
= 2^31 + 510 * 2^22
difference between two successive real numbers = N1-N2
= 2^31 + 511 * 2^22 - (2^31 + 510 * 2^22)
= 2^22
Example
• The following scheme is used for floating point number
representation using 16 bits.
• Let the floating point number is represented as (-1)^s (1
+ m * 2 ^(-9) ) * 2 ^ (e-31) , if exponent is not equal to
000000 & 0 otherwise.
• What is the maximum difference between two
SMALLEST real numbers that can be represented in
this system?
Prof. Tanvi Goswami
Sign Exponent Mantissa
1 bit 6 bits 9 bits
Solution
For 1st number:
Let s=0 , e = 1, M=0
(as e != 000000, we assume e =
000001, m = 000000000)
Prof. Tanvi Goswami
For 2nd number:
Let s=0 , e = 1, M=1
(as e != 000000, we assume e =
000001, m = 000000001)
difference between two successive real numbers = N1-N2
Example
(I)Convert the following IEEE-754 32 bit number
to decimal: 46800380(Hex)
(II)Convert the following IEEE-754 64 bit
number to decimal: 4041E00000000000(Hex)
Prof. Tanvi Goswami, D.D. University,
Nadiad
Solution
(I)Convert the following IEEE-754 32 bit number
to decimal: 46800380(Hex)
0100 0110 1000 0000 0000 0011 1000 0000
0 1000 1011 0000 0000000 0011 1000 0000
Prof. Tanvi Goswami, D.D. University,
Nadiad
Example
(II)Convert the following IEEE-754 64 bit
number to decimal: 4041E00000000000(Hex)
Prof. Tanvi Goswami, D.D. University,
Nadiad

3. IEEE 754 FLOATING POINT For Comp. ORG.pdf

  • 1.
    IEEE 754 FLOATINGPOINT REPRESENTATION Prof. Tanvi Goswami Dept. Of Information Technology DDU, Nadiad Prof. Tanvi Goswami
  • 2.
    Fixed Point andFloating Point Number Representations Storing Real Number There are two major approaches to store real numbers (i.e., numbers with fractional component) in modern computing. These are (i) Fixed Point Notation and (ii) Floating Point Notation. In fixed point notation, there are a fixed number of digits after the decimal point, whereas floating point number allows for a varying number of digits after the decimal point. Prof. Tanvi Goswami
  • 3.
    Floating point decimalnumber Prof. Tanvi Goswami
  • 4.
    Floating point decimalnumber • There are different representations for the same number and there is no fixed position for the decimal point. • Given a fixed number of digits, there may be a loss of precession. • Three pieces of information represents a number: sign of the number, the significant value and the signed exponent of 10. Prof. Tanvi Goswami
  • 5.
    note • Given afixed number of digits, the floating- point representation covers a wider range of values compared to a fixed-point representation. Prof. Tanvi Goswami
  • 6.
    IEEE 754 standard •Most of the binary floating-point representations follow the IEEE-754 standard. → The data type float uses IEEE 32-bit single precision format and the data type double uses IEEE 64-bit double precision format. Prof. Tanvi Goswami
  • 7.
    IEEE 754 32bit format N = (-1)^s * (1.M) * 2^(E-127) S is sign M is mantissa E is exponent Prof. Tanvi Goswami
  • 8.
    IEEE 754 32bit format Example: 3.5 Binary of 3.5 = 11.1 Sign is = 0 (no is positive) i.e. (-1)^0 = is Normalize 11.1 => 1.11 x 2^1 Compare the exponent  E-127 = 1  E = 1+127  E = 128  Binary of 128 = 1000 0000 Mantissa is => .11 Therefore 1.M is 1.11 IEEE 754 representation is: sign 8 bit Exponent 23 bits mantissa 0 1000 0000 11 000 ………….. Prof. Tanvi Goswami N (-1)^s * (1.M) * 2^(E-127)
  • 9.
    IEEE 754 32bit format Reverse Example: • 0 1000 0000 11 000 ………….. => S bit is 0 i.e. number is positive E-127 = 128 – 127 = 1  (-1)^0 x (1. 11) x 2^1 (N (-1)^s * (1.M) * 2^(E-127) ) 1.11 x 2^1 11.1 Therefore number is 3.5 Prof. Tanvi Goswami
  • 10.
    IEEE 754 32bit format Example: 85.125 Prof. Tanvi Goswami
  • 11.
    IEEE 754 32bit format Example: 85.125 85.125 85 = 1010101 0.125 = 001 85.125 = 1010101.001 =1.010101001 x 2^6 sign = 0 Single precision: Exponent: E-127=6 => E = 127+6 = 133 133 = 10000101 Normalised mantisa = 010101001 we will add 0's to complete the 23 bits The IEEE 754 Single precision is: = 0 10000101 01010100100000000000000 sign 8 bit Exponent 23 bits mantissa 0 1000 0101 01010100100000000000000 Prof. Tanvi Goswami
  • 12.
  • 13.
    IEEE 754 64bit format N = (-1)^s (1.M) 2^(E-1023) S is sign M is mantissa E is exponent Prof. Tanvi Goswami
  • 14.
    What is 127in IEEE 754? • In “excess 127 form” negative exponents range from 0 to 126, and positive exponents range from 128 to 255. The missing exponent, 127, is the one right in the middle and represents a power of zero. • The eight-bit exponent uses excess 127 notation. What this means is that the exponent is represented in the field by a number 127 greater than its value. Why? Because it lets us use an integer comparison to tell if one floating point number is larger than another, so long as both are the same sign. Prof. Tanvi Goswami
  • 15.
  • 16.
  • 17.
    Special conditions • SpecialValues: IEEE has reserved some values that can ambiguity. • Zero – Zero is a special value denoted with an exponent and mantissa of 0. -0 and +0 are distinct values, though they both are equal. • Denormalised – If the exponent is all zeros, but the mantissa is not then the value is a denormalized number. This means this number does not have an assumed leading one before the binary point. • Infinity – The values +infinity and -infinity are denoted with an exponent of all ones and a mantissa of all zeros. The sign bit distinguishes between negative infinity and positive infinity. Operations with infinite values are well defined in IEEE. • Not A Number (NAN) – The value NAN is used to represent a value that is an error. This is represented when exponent field is all ones with a zero sign bit or a mantissa that it not 1 followed by zeros. This is a special value that might be used to denote a variable that doesn’t yet hold a value. Prof. Tanvi Goswami
  • 18.
    Example • The followingscheme is used for floating point number representation using 16 bits. • Let the floating point number is represented as N= (-1)^s * [ (1 + m * 2 ^(-9) ) ] * 2 ^ (e-31) , if exponent is not equal to 111111 & 0 otherwise. • What is the maximum difference between two successive real numbers that can be represented in this system? Prof. Tanvi Goswami Sign Exponent Mantissa 1 bit 6 bits 9 bits
  • 19.
    Solution For 1st number: Lets=0 , e = 62 (as e != 111111, we assume e = 111110, m = 111 111 111) N1= (1+511*2^-9) * 2 ^ (62-31) = 2^31 + 511 * 2^22 Prof. Tanvi Goswami For 2nd number: Let s=0 , e = 62 (as e != 111111, we assume e = 111110, m = 111 111 110) N1= (1+510*2^-9) * 2 ^ (62-31) = 2^31 + 510 * 2^22 difference between two successive real numbers = N1-N2 = 2^31 + 511 * 2^22 - (2^31 + 510 * 2^22) = 2^22
  • 20.
    Example • The followingscheme is used for floating point number representation using 16 bits. • Let the floating point number is represented as (-1)^s (1 + m * 2 ^(-9) ) * 2 ^ (e-31) , if exponent is not equal to 000000 & 0 otherwise. • What is the maximum difference between two SMALLEST real numbers that can be represented in this system? Prof. Tanvi Goswami Sign Exponent Mantissa 1 bit 6 bits 9 bits
  • 21.
    Solution For 1st number: Lets=0 , e = 1, M=0 (as e != 000000, we assume e = 000001, m = 000000000) Prof. Tanvi Goswami For 2nd number: Let s=0 , e = 1, M=1 (as e != 000000, we assume e = 000001, m = 000000001) difference between two successive real numbers = N1-N2
  • 22.
    Example (I)Convert the followingIEEE-754 32 bit number to decimal: 46800380(Hex) (II)Convert the following IEEE-754 64 bit number to decimal: 4041E00000000000(Hex) Prof. Tanvi Goswami, D.D. University, Nadiad
  • 23.
    Solution (I)Convert the followingIEEE-754 32 bit number to decimal: 46800380(Hex) 0100 0110 1000 0000 0000 0011 1000 0000 0 1000 1011 0000 0000000 0011 1000 0000 Prof. Tanvi Goswami, D.D. University, Nadiad
  • 24.
    Example (II)Convert the followingIEEE-754 64 bit number to decimal: 4041E00000000000(Hex) Prof. Tanvi Goswami, D.D. University, Nadiad