Finite word length effects

Finite Word Length
Effects in Digital Filters
V Semester B.E. ECE
Dr.S.Periyanayagi
Associate Professor/ ECE
Ramco Institute of Technology
Academic year: 2016-2017

Fixed Point and Floating
Point number
Representations

Introduction
• Digital Signal Processing algorithms are realized either with special
purpose hardware or general purpose digital computer.
• In both cases the numbers and coefficients are stored in finite length
registers.
• The coefficients and numbers are quantized by truncation or
rounding off when they are stored.
• The following errors arise due to quantization of numbers.
– Input quantization error
– Product quantization error
– Coefficient quantization error
3

Review of number systems
• It is a system in which quantities are expressed in numeric
symbols.
• Numerous number systems are available.
• Knowledge of number system is the most essential
requirement for understanding and designing the digital
circuits.
• Generally number has 2 parts
• Integer
• Fractional, set apart by radix point
5

Most Common Number systems
• Decimal system - (0 to 9) -------()10
• Binary system – 0,1 ------()2
• Octal system – (0 to 7) ---------()8
• Hexadecimal system - 0 to 15 –
0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F
• ----------()16
6

Decimal system
• Most common number system used in day to day life.
• It is also known as base 10 system.
• Decimal to -------?------- ( Binary or Octal or Hexadecimal)
• Binary to -------?------- (Decimal or Octal or Hexadecimal)
• Octal to -------?------- (Decimal or Binary or Hexadecimal)
• Hexadecimal to -------?------- (Decimal or Binary or Octal)
7

(Decimal) to (??)
Number system Integer Fractional
Dec to Bin ÷ 2 X 2
Dec to Octal ÷ 8 X 8
Dec to Hexadec ÷ 16 X 16
8

(Binary) to (??)
Bin to Dec X 2 ÷ 2
Bin to Octal
Starting from Least Significant Bit, each
group of 3 bits is replaced by its decimal
equivalent
Bin to Hexadec
Starting from Least Significant Bit, each
group of 4 bits is replaced by its decimal
equivalent
9

(Octal) to (??)
Oct to Dec X 8 ÷ 8
Oct to Binay
Significant Octal digit is replaced by binary
equivalent
Oct to Hexadec
i). First Convert Octal to Binary
ii). Then Binary to Hexadecimal
10

(Hexadecimal) to (??)
Hexadec to Dec X 16 ÷ 16
Hexadec to Binay
Each Significant digit in the given number
is replaced by its 4 bit binary equivalent
Hexadec to Oct
i). First Convert Hexadec to Binary
ii). Then Binary to Octal
11

Analog and Digital Signal
• Analog system
– The physical quantities or signals may vary continuously over a
specified range.
• Digital system
– The physical quantities or signals can assume only discrete values.
• Greater accuracy
12

Number System
• The decimal number system is commonly used.
• Any number is possible to express in any base or radix “r”.
• In general, any number with radix r, having m digits to the left
and n digits to the right of the decimal point, can be expressed as
• Where am is the digit in mth position.
• The coefficient am is termed as Most Significant Digit(MSD)
• bn is termed as Least Significant Digit(LSD)
13

14
Base-10 (decimal) arithmetic
• Uses the ten numbers from 0 to 9
• Each column represents a power of 10
Thousands (103
) column
Hundreds (102
) column
Tens (101
) column
Ones (100
) column
1999.10
= 1x103
+ 9x102
+ 9x101
+ 9x100

15
Base-10 (decimal) arithmetic
• Uses the ten numbers from 0 to 9
• Each column represents a power of 10
Tens (101
) column
Ones (100
) column
Tenths (10-1
) column
Hundredths (10-2
) column
19.9910 = 1x101
+ 9x100
+ 9x10-1
+ 9x10-2

• Base (also called radix) = 10
– 10 digits { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }
• Digit Position
– Integer & fraction
• Digit Weight
– Weight = (Base)
Position
• Magnitude
– Sum of “Digit x Weight”
• Formal Notation
5 1 2 7
1 0 -12
4
-2
10 1 0.1100 0.01
500 10 2 0.7 0.04
d2*B
2
+d1*B
1
+d0*B
0
+d-1*B
-1
+d-2*B
-2
(512.74)10
Decimal Number System
16

Addition
• Decimal Addition
5 5
55+
011
= Ten ≥ Base
11 Carry
17

18
Standard binary representation
• Uses the two numbers from 0 to 1
• Every column represents a power of 2
1001.2
= 1x23
+ 0x22
+ 0x21
+ 1x20
Eights (23
) column
Fours (22
) column
Twos (21
) column
Ones (20
) column

19
Fixed-point representation
• Uses the two numbers from 0 to 1
• Every column represents a power of 2
10.012
= 1x21
+ 0x20
+ 0x2-1
+ 1x2-2
Twos (21
) column
Ones (20
) column
Halves (2-1
) column
Fourths (2-2
) column

• Base = 2
– 2 digits { 0, 1 }, called binary digits or “bits”
• Weights
– Weight = (Base)
Position
• Magnitude
– Sum of “Bit x Weight”
• Formal Notation
• Groups of bits 4 bits = Nibble
8 bits = Byte
1 0 1 0 1
1 0 -12 -2
2 1 1/24 1/4
1 *22
+0 *21
+1 *20
+0 *2-1
+1 *2-
2
=(5.25)10
(101.01)2
1 0 1 1
1 1 0 0 0 1 0 1
Binary Number System
20

Binary Addition
• Column Addition
1 0 1111
1111 0+
0000 1 11
≥ (2)10
111111
= 61
= 23
= 84
21

22
Range of values in a byte
Lowest
exponent
Min Step Max Value of
00110001
0 0 1 255
-1 0 .5 127.5
-2 0 .25 63.75
-4 0 .0625 15.9375

Decimal (Integer) to Binary Conversion
• Divide the number by the „Base‟ (=2)
• Take the remainder (either 0 or 1) as a coefficient
• Take the quotient and repeat the division
Example: (13)10
Quotient Remainder Coefficient
Answer: (13)10 = (a3 a2 a1 a0)2 = (1101)2
MSB LSB
13/ 2 = 6 1 a0 = 1
6 / 2 = 3 0 a1 = 0
3 / 2 = 1 1 a2 = 1
1 / 2 = 0 1 a3 = 1
23

Decimal (Fraction) to Binary Conversion
• Multiply the number by the „Base‟ (=2)
• Take the integer (either 0 or 1) as a coefficient
• Take the resultant fraction and repeat the division
Example: (0.625)10
Integer Fraction Coefficient
Answer: (0.625)10 = (0.b-1 b-2 b-3)2 = (0.101)2
MSB LSB
0.625 * 2 = 1 . 25
0.25 * 2 = 0 . 5 b-2 = 0
0.5 * 2 = 1 . 0 b-3 = 1
b-1 = 1
24

Types of Number Representation
• There are 3 types
• These are used to represent the numbers in digital computer or
any other digital hardware.
• Fixed Point Representation
• Floating Point Representation
• Block Floating Point Representation
25

Fixed Point Representation
• In this arithmetic the position of the binary point is fixed.
• The bit to the right represent  Fractional part
• The bit to the left represent  Integer Part
(eg) The binary number 01.1100 has the value 1.75 in decimal
• The negative numbers are represented gives three different forms for
fixed point arithmetic
• Sign-Magnitude form
• One‟s complement form
• Two‟s complement form
26

Sign-Magnitude form
• The most significant digit is set to 1 to represent the negative
sign.
(eg)
• The decimal number -1.75 is represented as 11.110000
• 1.75 is represented as 01.110000
27

One’s complement form
• Here the positive number is represented as in the sign-magnitude
notation.
• The negative number is obtained by complementing all the bits of
the positive number.
(0.875)10 = (0.111000)2
(-0.875)10 = (1.000111)2
28

Two’s complement form
• Here the positive numbers are represented as in the sign-magnitude
and one‟s complement.
• The negative number is obtained by complementing all the bits of
the positive number and adding one to the least significant bit.
29

Complements
• 1’s Complement (Diminished Radix Complement)
– All ‘0’s become ‘1’s
– All ‘1’s become ‘0’s
Example (10110000)2
 (01001111)2
If you add a number and its 1’s complement …
1 0 1 1 0 0 0 0
+ 0 1 0 0 1 1 1 1
1 1 1 1 1 1 1 1
30

Complements
• 2‟s Complement (Radix Complement)
– Take 1‟s complement then add 1
– Toggle all bits to the left of the first „1‟ from the right
Example:
Number:
1‟s Comp.:
0 1 0 1 0 0 0 0
1 0 1 1 0 0 0 0
0 1 0 0 1 1 1 1
+ 1
OR
1 0 1 1 0 0 0 0
00001010
31

Complements
• Subtraction of unsigned numbers can also be done by means of the (r 
1)'s complement.
• using 1's complement.
There is no end carry,
Therefore, the answer is Y – X
=  (1's complement of
1101110) =  0010001.
32

Complements
• Example
– Given the two binary numbers X = 1010100 and Y = 1000011,
perform the subtraction (a) X – Y ; and (b) Y  X, by using 2's
complement.
There is no end carry.
Therefore, the answer is
Y – X =  (2's complement
of 1101111) =  0010001.
33

Signed Binary Numbers
• To represent negative integers, we need a notation for negative
values.
• It is customary to represent the sign with a bit placed in the leftmost
position of the number since binary digits.
• The convention is to make the sign bit 0 for positive and 1 for
negative.
• Example:
Table 1 lists all possible four-bit signed binary numbers in the three
representations.
34

Different Number Representation for 3-bit word length
(excluding sign bit)
Binary number Sign-magnitude Two’s Complement One’s Complement
0.111 7/8 7/8 7/8
0.110 6/8 6/8 6/8
0.101 5/8 5/8 5/8
0.100 4/8 4/8 4/8
0.011 3/8 3/8 3/8
0.010 2/8 2/8 2/8
0.001 1/8 1/8 1/8
0.000 0 0 0
1.000 0 -1 -7/8
1.001 -1/8 -7/8 -6/8
1.010 -2/8 -6/8 -5/8
1.011 -3/8 -5/8 -4/8
1.100 -4/8 -4/8 -3/8
1.101 -5/8 -3/8 -2/8
1.110 -6/8 -2/8 -1/8
1.111 -7/8 -1/8 0 36

• Arithmetic addition
– The addition of two numbers in the signed-magnitude system follows
the rules of ordinary arithmetic. If the signs are the same, we add the
two magnitudes and give the sum the common sign. If the signs are
different, we subtract the smaller magnitude from the larger and give
the difference the sign if the larger magnitude.
– The addition of two signed binary numbers with negative numbers
represented in signed-2's-complement form is obtained from the
addition of the two numbers, including their sign bits.
– A carry out of the sign-bit position is discarded.
• Example:
37

• Arithmetic Subtraction
– In 2‟s-complement form:
• Example:
1. Take the 2‟s complement of the subtrahend (including the
sign bit) and add it to the minuend (including sign bit).
2. A carry out of sign-bit position is discarded.
( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
A B A B
A B A B
      
      
( 6)  ( 13) (11111010  11110011)
(11111010 + 00001101)
00000111 (+ 7)
38

Addition of two fixed point numbers
• The two numbers are added bit by bit starting from right, with carry bit
being added to the next bit.
Eg. (0.5)10 + (0.125)10
• Assuming the total number of bits b+1=4 (including sign bit)
we obtain
(0.5)10 = 0.1002
(0.125)10 = 0.0012
sign bit 0.1012 (0.625) 10
When two numbers of b bits are added and the sum cannot be
represented by b bits an overflow is said to be occur
39

eg.
(0.5)10 = 0.1002
(0.625)10 = 0.1012
sign bit 1.0012 (-0.125) 10 in sign magnitude
• Overflow occurs in above result because (1.125)10 cannot be
represented in 3 bit number system.
• Thus in general, addition of fixed point numbers causes an overflow.
• The subtraction of two fixed point numbers can be performed easily
by using two‟s complement representation.
40

Multiplication in fixed point arithmetic
• In Multiplication of two fixed point numbers first the sign and
magnitude components are separated.
• The magnitude of the numbers are multiplied first, then the sign of the
product is determined and applied to the result.
• When a b bit number is multiplied with another b bit number the
product may contain 2b bits.
For eg. (11)2 x (11)2 = (1001)2 .
• If b bits are organized into b=bi + bj
where bi – Integer part & bj – Fraction part
The product may contain 2bi + 2bj bits
In fixed point arithmetic, multiplication of two fraction results in a fraction.
For multiplication with fractions overflow can never occur.
41

Floating Point representation
• In floating point representation a positive number is represented as
F=2C.M
• where M  mantissa is a fraction such that ½ ≤ M ≤ 1
• C  exponent , either positive or negative.
• The decimal numbers 4.5, 1.5, 6.5 and 0.625 have floating point
representations as
23x0.5625, 21x0.75, 23x0.8125, 20x0.625 respectively
• Equivalently
23x0.5625 = 2011x0.1001
21x0.75 = 2001x 0.1100
23x0.8125 = 2011x0.1101
20x0.625 = 2000x0.1010
(3)10 = (011)2
(0.5625)10 = (0.1001)2
(1)10 = (001)2
(0.75)10 = (0.1100)2
(0.8125)10 = (0.1101)2
(0.625)10 = (0.1010)2
42

• Negative floating point numbers are generally represented by
considering the mantissa as a fixed point number.
• The sign of the floating point number is obtained from the first bit of
mantissa.
• The floating point arithmetic multiplications are carried out as
follows. Let F1=2C1xM1 and F2=2C2xM2
• Then the product F3=F1 x F2 = (M1 x M2) 2C1+C2
(i.e) the mantissa are multiplied using fixed point arithmetic and
exponents are added.
The product (M1 x M2 ) must be in the range of 0.25 to 1.0.
To correct this problem the exponent (C1+C2) must be altered.
43

• (1.5)10 = 21 x 0.75 = 2001 x 0.1100
• (1.25)10 = 21 x 0.625 = 2001 x 0.1010
• (1.5)10 x (1.25)10 = (2001 x 0.1100) x (2001 x 0.1010)
= 2010 x 0.01111
• Addition and subtraction of two floating
point numbers are more difficult than
the addition and subtraction of
two fixed point numbers.
To carry out addition, first adjust the exponent of the smaller number
until it matches the exponent of the larger number. the mantissa are then
added and subtracted.
Finally, the resulting representation is rescaled so that its mantissa lies
in the range 0.5 to 1.
(0.75)10 = (0.1100)2
(0.625)10 = (0.1010)2
(0.1100)2 x (0.1010)2
(0.01111000)2
44

• Suppose we are adding 3.0 and 0.125
3.0=2010 x 0.11000
0.125=2000 x 0.001000
• Now we adjust the exponent of smaller number. So that both
exponents are equal.
0.125=2010 x 0.0000100
• Now the sum is equal to 2010 x 0.110010
45

Fixed Point Arithmetic
• Fast Operation
• Relatively economical
• Small dynamic range
• Round off Errors occur only
for addition
• Overflow occur in addition
• Used in small computers
Floating point Arithmetic
• Slow Operation
• More Expensive because of
costlier hardware
• Increased dynamic range
• Roundoff errors can occur with
both addition and
multiplication
• Overflow does not arise.
• Used in larger, general purpose
computers.
46

Quantization
• For most of the applications the input signal is continuous in time or
analog waveform.
• The signal is converted into digital by using ADC.
• The signal x(t) is sampled at regular interval t=nT where n=0,1,2,3…
to create a sequence x(n). This is done by a sampler.
• The numeric equivalence of each sample X(n) is expressed by a finite
number of bits giving the sequence Xq (n).
• The difference e(n) = Xq (n) - X(n) is called quantization noise or A/D
conversion noise.
Sampler QuantizerX(t)
X(n)
Xq (n)
48

• Let us assume sinusoidal signal varying between +1 and -1 having a
dynamic range 2.
• If ADC is used to convert the sinusoidal signal it employs (b+1) bits
including sign bit.
• Then the number of levels available for quantizing x(n) is 2b+1 .
• The interval between successive levels is
• q=2/ 2b+1 = 2-b q-quantization step size
• If b=3 bits then q= 2-3 = 0.125
• The common methods of quantization are
Truncation
Rounding
49

Truncation
• The process of discarding all bits less significant than least significant
bit that is retained.
• Suppose if we truncate the following binary numbers from 8 bits to 4
bits, we obtain
0.00110011 to 0.0011
8 bits 4 bits
1.01001001 to 1.0100
8 bits 4 bits
When we truncate a number, the signal value is approximated by the
highest quantization level that is not greater than the signal
50

Rounding
• Rounding of a number of b bits is accomplished by choosing the
rounded result as the b bit number closest to the original number
unrounded.
Eg.
• 0.11010 is rounded to three bits is either 0.110 or 0.111
• If the number 0.110111111 is rounded to 8 bits then the results
may be 0.11011111 or 0.1110000.
• Rounding up or down will have negligible effect on accuracy of
computation.
51

DSP-CIS / Chapter-6: Filter Implementation / Version 2012-2013 p. 52
Quantization Noise
Quantization mechanisms:
Rounding Truncation Magnitude Truncation
mean=0 mean=(-0.5)LSB (biased!) mean=0
variance=(1/12)LSB^2 variance=(1/12)LSB^2 variance=(1/6)LSB^2
input
probability
error
output

Finite word length effects

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Finite word length effects

Similar to Finite word length effects (20)

More from PeriyanayagiS

More from PeriyanayagiS (7)

Recently uploaded

Recently uploaded (20)

Finite word length effects