Knowledge of Floating to Fixed point conversion of DSP codes is must for every aspiring DSP Er. This is a quick course to know that how to do the fixed point conversion of DSP codes. After reading this pdf, you can write never failing fixed point DSP codes e.g. FIR / IIR digital filter and also audio compression codes like mp3 codec...
1. Floating Point to
Fixed Point Conversion
Audio DSP Ksharing: Rajesh Sharma
www.rajeshsharma.co.in sharma.rajesh@gmail.com
2. 7/17/20102
www.rajeshsharma.co.in
Numbers
•Real Numbers –
•Rational, irrational, positive, negative, zero…
•An infinite decimal representation such as
2.487154145934798874502350989093….
•Any point on a infinitely long number line
•Integer Numbers-
•whole numbers, no fractional part
•Discrete points on number line… e.g. -10, -4, 0, 7, 9
3. 7/17/20103
www.rajeshsharma.co.in
Number Representation
•Computer
•Finite Word Length is a major issue
•Processor Register Length: 8Bit, 16Bit, 24Bit, 32Bit, 64Bit..
•A limitation on the possible representations in a register
•Floating Point Numbers:
•IEEE754 Standard
•Pseudo Float
•Fixed Point Numbers: No Defined Standard
•Fractional Numbers: Q Format
•Integer Numbers
4. 7/17/20104
www.rajeshsharma.co.in
Few Terms
•Format- Representing numbers in binary system.
•e.g. for signed number we have…
•Float: Mantissa Bits & Exponent Bits
•Fixed: Integer Bits & Fractional Bits
•Total Bits = Mantissa / Integer + Exponent /fractional + sign bit
•Precision - Effective Word Length = N
•Accuracy – Maximum error between actual number and its
representation.
•Resolution- smallest non zero magnitude representable with given
format
e.g. representation is accurate till 8th decimal place
e.g. representation has maximum of 0.001% error
5. 7/17/20105
www.rajeshsharma.co.in
Dynamic Range
• While Processing/Storing variables: we need to find
•Varmax absolute maximum value a variable can have
•Varmin absolute minimum value a variable can have
•Range of a Variable: ℜ : -Varmax to +Varmax
•Resolution of a Variable: ∆ : Varres
•Resolution can be determined from…..
•Absolute Minimum value a variable can have: Varmin
•How precisely two consecutive values of a variable may differ
•what is the accuracy required (e.g. till 8th decimal place)
6. 7/17/20106
www.rajeshsharma.co.in
Dynamic Range
•The ratio of the largest and smallest possible values of the
variable quantity
•For Fixed point, resolution of the representation becomes the
smallest value of the variable.
•Bits for signed representation = B = N + 1
•Bits for unsigned representation = B = N
BitsN
Var
Var
BitsRangeDynamic
Var
VarVar
RatioRangeDynamic
res
res
)(log)(
)(
max
2
maxmax
==
=
∆
=
Dynamic Range is more often expressed
as Ratio 5000:1 or as dB 75 dB
Or in Bits 12 Bits
7. 7/17/20107
www.rajeshsharma.co.in
Dynamic Range: Audio system
•Audio system specification uses unsigned representation
•The ratio of amplitude of loudest possible undistorted sine wave to rms
noise amplitude
•e.g. of microphone, loudspeaker, ADC, DAC etc..
•For display devices it is often called contrast ratio
76.1*02.6)(
log20)( max
10
+=
∆
=
NdBRangeDynamic
Var
dBRangeDynamic
For Fixed point conversion normallyFor Fixed point conversion normallyFor Fixed point conversion normallyFor Fixed point conversion normally
Signed representation methodSigned representation methodSigned representation methodSigned representation method
is used to calculate dynamic rangeis used to calculate dynamic rangeis used to calculate dynamic rangeis used to calculate dynamic range
8. 7/17/20108
www.rajeshsharma.co.in
Finite Word Length
•Computers- µP, DSP, µController
•Store results in Registers having finite word length
•N Bit storage can have only 2N possibilities
•How to represent a variable ?
•Floating point format
•Fixed point format
•Choice is governed by…
Dynamic Range of Variable (Most Imp…)
Power consumption by specific type of ALU
Cost of implementation of ALU..
Other possibilities….
9. 7/17/20109
www.rajeshsharma.co.in
Floating Point
•The decimal point or binary point is floating
•Economical for very large/small number storage
•FPU silicon implementation is costly
•Usually represented as
•Two ways of finding values….
•S : sign Bit value; B- Bias ; e – exponent; f - Mantissa
frfr BesBes
.12)1(or.02)1( )()( −−
−=−=
IEEE754, Single Precision: s- 1; Bit; B-127; e-8Bits; f-23Bits
10. 7/17/201010
www.rajeshsharma.co.in
Floating Point Variable
•The Range of variable is very high
•Still there are only 2N distinct values for N Bits
•The accuracy of variable varies with values
•Spacing between numbers is not constant
•IEEE 754 supports four precisions
•Half precision, 16 Bit; Single precision, 32 Bit
•Double precision, 64 Bit; Quadruple precision, 128 Bit
Single Precision example
11. 7/17/201011
www.rajeshsharma.co.in
Floating Point Arithmetic
•Floating point arithmetic is very easy to use
•Floating point arithmetic: problems
•It may not be distributive (processor implementation)*
• a * (b + c) != a*b + a*c
•Same mathematical operations on different processors
may not produce exactly same result*
•Operations on large numbers has more error.
•Mantissa limits the resolution
•Exponent limits the largest possible number
*PS: Different floating point implementations of
one DSP algorithm may not be Bit Exact.
12. 7/17/201012
www.rajeshsharma.co.in
Number line
• Double Precision: 64 bit ; Mantissa = 52 Bits; exponent 11 Bits
0.0001*10-2
=0.000001
0.0001*107
=1000
Difference
1.7688*10-21.7687*10-2
1.7688*1071.7687*107
Next
Number
Number
Variable spacing example:
4 digit representation
14. 7/17/201014
www.rajeshsharma.co.in
Float to Fix: Motivation
•To achieve higher speed of operation at lower cost
•To reduce silicon area of hardware -> cost
•Power saving during code execution
•Fixed point implementations may be directly used in
many DSPs
•Floating point implementation on different processor
may not be bit-exact
•Depending on given input range
•it may be beneficial to use fixed point.
15. 7/17/201015
www.rajeshsharma.co.in
Fixed Point Numbers
•The binary point is fixed for a given format
•Integer numbers
•Fractional numbers, Q format fractional format
•Range is directly limited by the number of Bits
•The spacing between numbers is constant
•Q format :written as MQN format or M.N
•K bits for Integer part
•N bits for fractional part
•M = K + Sign bit = K + 1
18. 7/17/201018
www.rajeshsharma.co.in
How to Convert
•A fractional number can be converted to fixed point number for a
given signed Q format MQN
•Set all the N LSB bits to 1 and M Bits to 0; call it maxN
•e.g. for 1Q31 format, max31 = 0X7FFFFFFF
•For 1Q15->0X7FFF = 32767; for 3Q13->0X1FFF= 8191
•Now multiply the given fractional number with this number
2
2
4026
16422
27FE
10238
1FFF
8191
EC01
-5119
0BFF
3071
5Q11
HEX/Dec
0009
9
Over
Flow
Over
Flow
7FFF
32767
B001
-20479
2FFF
12287
1Q15
HEX/Dec
0.00032.0051.251-0.6250.375Number
------------
Format
19. 7/17/201019
www.rajeshsharma.co.in
Setting Q Format
•The variable to be converted is explored for
•Range of the variable value
•Absolute Maximum / Minimum values of variable
•Resolution of variable: smallest non zero number
•Accuracy: Maximum error for a representation
•Accuracy required highly depends on the arithmetic involved.
•It also depends on the reference for final comparison
•Calculate accuracy from tolerable %error in representation.
20. 7/17/201020
www.rajeshsharma.co.in
Exploration Process
•If the variable is a part of code
•Run the code for various test cases
•Find the previously described values (range, resolution)
•Range & Accuracy can be also found from arithmetic involved
•Find the Resolution ∆
•If the variable value comes from RO table then
•These values values can be found directly
21. 7/17/201021
www.rajeshsharma.co.in
Setting Q Format: MQN
•Now calculate the Dynamic range required
•Determine the K Integer bits from MaxInt of the Range
•DR bits = B = K + N
•Total Bits = B + 1(S) => 1(S) + K + N; => M + N
•S is for Sign Bit
bitsB
Var
Var
BitsRangeDynamic
Var
Var
RatioRangeDynamic
res
res
)(log)(
)(
max
2
max
==
=
))((log
)(
2
max
MaxIntceilK
VarceilMaxInt
=
=
22. 7/17/201022
www.rajeshsharma.co.in
RO Table Example
•Const float Table[5]={
•(5.0979557966e-001),
•(6.0134489352e-001),
•(8.9997625993e-001),
•(2.5629159405e+000)
•(-4.4628159401e+000)
•}
•Range: ±4.4628159401e+000 (signed representation)
•Resolution: 5.0979557966e-001 = Variable minimum value
•Resolution = Table[1]-Table[0] = 0.09154931386
•MaxInt = 5
We will set the Q format
For this RO table and
compare two choices of
resolution
23. 7/17/201023
www.rajeshsharma.co.in
Example cont…
•DR = (4.4628159401) / 0.09154931386 = 48.74
•DR Bits B = 6 ; K = 3 Bits (as MaxInt = 5)
•M = 4 Bits, N = 3 Bits;
•Total Bits = 6(DR) + 1(S) => 1(S) + 3(K) + 3(N)
•The conversion format MQN is 4Q3
•K = 4 Bits ; N = 3 Bits; max3 = 7;
•5.0979557966e-001 * max3 = 3 ;
•3 / max3 is equivalent to = 0.42857142857….
•Representation Error = 15.93% (very huge error)
This error is not acceptable
24. 7/17/201024
www.rajeshsharma.co.in
Example cont…
•The best would be to correctly represent the last digit of
•0.50979557966 (accuracy till 11th digit) =
•i.e. Resolution = 0.00000000006
•DR = (4.4628159401) / 0.00000000006 = 74380265668
•DR Bits = 36 bits
•Total Bits = 36(DR) + 1(S) => 1(S) + 3(K) + 33(N)
•max33 = 0x1FFFFFFFF = 8589934591
• 5.0979557966e-001 * max33 = 4379110684 ;
•4379110684 /max33 = 0.509795579652976….
•Representation Error = 1.37e-9%
•Q Format is set as 4Q33
25. 7/17/201025
www.rajeshsharma.co.in
Final Table in 4Q33 Format
• PolyPhase4ptDCT4toDCT3[]={
•4379110684, Error = 1.37e-9%
•5165513301, Error = 1.87e-8%
•7730737206, Error = 3.25e-9%
•-38335297017, Error = 1.53e-9% }
•This is very good representation,
•But still many things matters e.g.
•Precision of arithmetic involved
•Bits available for representation, e.g. 32 Bits
The 37 bits representation error
is very small and may not be
needed for our application.
We may try with 32 Bits and check
if it fits into application or not??
26. 7/17/201026
www.rajeshsharma.co.in
Shifting Binary Point
•Right Shift to a fractional number in MQN format
•Binary point shift to right
•MQN >> n (M + n)Q( N - n)
•Left Shift to a fractional number in MQN format
•Binary point shift to left
•MQN << n (M - n)Q( N + n)
•For example: 2Q30 << 1 is 1Q31
Binary point shifting only indicate the change in the
maximum/minimum value representable with resulting format
29. 7/17/201029
www.rajeshsharma.co.in
Overflow Vs Carry Flag
•Carry Flag is used for unsigned numbers
•It is set when…
•Addition/subtraction of two unsigned operands doesn’t fit into
the given Bits
•Overflow flag is for signed arithmetic
•It is set when
•When Add/sub/mult of two numbers does not give the expected
sign bit at the result
30. 7/17/201030
www.rajeshsharma.co.in
Fractional Addition
•Two fixed point numbers a & b can be added directly if
• Binary point at same place for both numbers, “a & b”
•Resulting format of fractional addition will have
• Binary point at same place for both numbers, input & output
31. 7/17/201031
www.rajeshsharma.co.in
Fractional Addition
•Result format of output is same as that of input
•i.e. Binary point at same place for both numbers, input & output
•There is a possibility of growth of Integer Part
•Example accumulation of an array with data in MQN format
•for( i=0; I < L; i++ )
sum = sum + ar[i];
•Sum will have output format of (M+m)QN
))((log2 Lceilm =
32. 7/17/201032
www.rajeshsharma.co.in
Headroom/Guard Bits
•Overflow during a addition/subtraction can be handled
with headroom
• example addition…
• for( i=0; i<256; i++ )
sum = sum + ar[i];
There is possibility of 8 bit overflow
•Variable sum should have 8 bit headroom in MSBs to
accommodate the overflow
37. 7/17/201037
www.rajeshsharma.co.in
Normalizing Fractions
•A product of 2 two’s complement numbers has two sign
bits.
•result can be left shifted by 1 bit.
•If one of the format is 1QN i.e. N+1 Bits, then
•Result of left shift will have the format of other operand
•If other operand is 5Q11 then result of mult is 6Q(11+N)
•After 1 bit left shift, result is 5Q(11+N)
•i.e. 5Q11 + (N+1)LSBs
•To make most out of available bits,
•it is good to left shift the product
38. 7/17/201038
www.rajeshsharma.co.in
Rounding
• Why Rounding ………
•e.g. replacing 45.6782 with 45.68
•Rounding is necessary evil... (..remember quantization..)
•Many times it is unavoidable
•It introduces round-off errors
•Required in float to fixed conversion and..
•floating & fixed point arithmetic and..
•for function approximation on fixed point processors
To replace a Numerical valueTo replace a Numerical valueTo replace a Numerical valueTo replace a Numerical value
with approx. equal & shorter i.e. low precision representationwith approx. equal & shorter i.e. low precision representationwith approx. equal & shorter i.e. low precision representationwith approx. equal & shorter i.e. low precision representation
39. 7/17/201039
www.rajeshsharma.co.in
Rounding Methods
•When rounding a number x to q an integer
•Round to nearest : q is integer close to x
•Round towards zero (truncate): q is the integer part of x
•Round down (floor): q is largest integer that does not exceed x
•Round up (ceiling): q is smallest integer that is not less than x
•Tie Break during rounding
•When fractional part to be rounded is exactly half boundary
•e.g. rounding of 23.5 requires tie breaking
40. 7/17/201040
www.rajeshsharma.co.in
Tie Breaking
•Round half up: i.e. q = x + 0.5 (x is positive or negative )
•It is asymmetric rounding i.e. biased (with round to nearest)
•Round half away from zero:
•q = x + 0.5 ; x is positive ; q = x – 0.5 ; x is negative
•This method is free of bias
•Round half to even
•q is the even integer nearest to x
•This method is more unbiased
•+33.5 +34 ; +32.5 +32 ; -32.5 -32 ; -33.5 -34
•Known as: unbiased rounding , convergent rounding,
statistician's rounding, Dutch rounding, Gaussian rounding,
or bankers' rounding
41. 7/17/201041
www.rajeshsharma.co.in
Dithering
•Stochastic Rounding:
•Choose q randomly between x+0.5 and x-0.5
•It is also bias free because of random component
•Dithering: when to use this method
•when the signal being rounded / quantized is slowly varying
•All rounding methods produce monotonous round-off errors
•Moreover monotonous round-off error because of particular
rounding methods being used
•All of them introduces non-linear response in a system
•Harmonics in the filter response because of rounding method
42. 7/17/201042
www.rajeshsharma.co.in
Block Floating Point Processing
•When Dynamic Range of a signal is very high
•When the input is fluctuating between very high and very
small values.
•When very small values are as important as larger
values…
•Instruction like CLZ & NORM are helpful.
A block of data values share a common
radix point for processing
43. 7/17/201043
www.rajeshsharma.co.in
Exercise1
•Assuming you are working on a 32 bit DSP…
•Convert following table into fixed point….
• const float gain_table[8] ={
• (8.2387952833e-002),
• (9.2387952833e-001),
• (7.0710676573e-001),
• (3.8268340208e-001)
• (1.9615705587e+000),
• (1.6629392064e+000),
• (1.1111404206e+000),
• (3.9018056901e+001)
•};
46. 7/17/201046
www.rajeshsharma.co.in
Exercise4
•Following IIR filter is used for filtering 16 bit PCM ip
•Convert it into fixed point and ensure that the filter
response is linear….
Z-1
Z-1
+
b0
b1
b2
x[n]
x[n-1]
y[n]
x[n-2]
Z-1
Z-1a1
a2
y[n-1]
y[n-2]
y[n] = b0*x[n] + b1*x[n-1] + b2*x[n-2] - a1*y[n-1] - a2*y[n-2]