Fixed Point Conversion

Floating Point to
Fixed Point Conversion
Audio DSP Ksharing: Rajesh Sharma
www.rajeshsharma.co.in sharma.rajesh@gmail.com

7/17/20102
www.rajeshsharma.co.in
Numbers
•Real Numbers –
•Rational, irrational, positive, negative, zero…
•An infinite decimal representation such as
2.487154145934798874502350989093….
•Any point on a infinitely long number line
•Integer Numbers-
•whole numbers, no fractional part
•Discrete points on number line… e.g. -10, -4, 0, 7, 9

7/17/20103
Number Representation
•Computer
•Finite Word Length is a major issue
•Processor Register Length: 8Bit, 16Bit, 24Bit, 32Bit, 64Bit..
•A limitation on the possible representations in a register
•Floating Point Numbers:
•IEEE754 Standard
•Pseudo Float
•Fixed Point Numbers: No Defined Standard
•Fractional Numbers: Q Format
•Integer Numbers

7/17/20104
Few Terms
•Format- Representing numbers in binary system.
•e.g. for signed number we have…
•Float: Mantissa Bits & Exponent Bits
•Fixed: Integer Bits & Fractional Bits
•Total Bits = Mantissa / Integer + Exponent /fractional + sign bit
•Precision - Effective Word Length = N
•Accuracy – Maximum error between actual number and its
representation.
•Resolution- smallest non zero magnitude representable with given
format
e.g. representation is accurate till 8th decimal place
e.g. representation has maximum of 0.001% error

7/17/20105
Dynamic Range
• While Processing/Storing variables: we need to find
•Varmax absolute maximum value a variable can have
•Varmin absolute minimum value a variable can have
•Range of a Variable: ℜ : -Varmax to +Varmax
•Resolution of a Variable: ∆ : Varres
•Resolution can be determined from…..
•Absolute Minimum value a variable can have: Varmin
•How precisely two consecutive values of a variable may differ
•what is the accuracy required (e.g. till 8th decimal place)

7/17/20106
Dynamic Range
•The ratio of the largest and smallest possible values of the
variable quantity
•For Fixed point, resolution of the representation becomes the
smallest value of the variable.
•Bits for signed representation = B = N + 1
•Bits for unsigned representation = B = N
BitsN
Var
Var
BitsRangeDynamic
Var
VarVar
RatioRangeDynamic
res
res
)(log)(
)(
max
2
maxmax
==
=
∆
=
Dynamic Range is more often expressed
as Ratio 5000:1 or as dB 75 dB
Or in Bits 12 Bits

7/17/20107
Dynamic Range: Audio system
•Audio system specification uses unsigned representation
•The ratio of amplitude of loudest possible undistorted sine wave to rms
noise amplitude
•e.g. of microphone, loudspeaker, ADC, DAC etc..
•For display devices it is often called contrast ratio
76.1*02.6)(
log20)( max
10
+=






∆
=
NdBRangeDynamic
Var
dBRangeDynamic
For Fixed point conversion normallyFor Fixed point conversion normallyFor Fixed point conversion normallyFor Fixed point conversion normally
Signed representation methodSigned representation methodSigned representation methodSigned representation method
is used to calculate dynamic rangeis used to calculate dynamic rangeis used to calculate dynamic rangeis used to calculate dynamic range

7/17/20108
Finite Word Length
•Computers- µP, DSP, µController
•Store results in Registers having finite word length
•N Bit storage can have only 2N possibilities
•How to represent a variable ?
•Floating point format
•Fixed point format
•Choice is governed by…
Dynamic Range of Variable (Most Imp…)
Power consumption by specific type of ALU
Cost of implementation of ALU..
Other possibilities….

7/17/20109
Floating Point
•The decimal point or binary point is floating
•Economical for very large/small number storage
•FPU silicon implementation is costly
•Usually represented as
•Two ways of finding values….
•S : sign Bit value; B- Bias ; e – exponent; f - Mantissa
frfr BesBes
.12)1(or.02)1( )()( −−
−=−=
IEEE754, Single Precision: s- 1; Bit; B-127; e-8Bits; f-23Bits

7/17/201010
Floating Point Variable
•The Range of variable is very high
•Still there are only 2N distinct values for N Bits
•The accuracy of variable varies with values
•Spacing between numbers is not constant
•IEEE 754 supports four precisions
•Half precision, 16 Bit; Single precision, 32 Bit
•Double precision, 64 Bit; Quadruple precision, 128 Bit
Single Precision example

7/17/201011
Floating Point Arithmetic
•Floating point arithmetic is very easy to use
•Floating point arithmetic: problems
•It may not be distributive (processor implementation)*
• a * (b + c) != a*b + a*c
•Same mathematical operations on different processors
may not produce exactly same result*
•Operations on large numbers has more error.
•Mantissa limits the resolution
•Exponent limits the largest possible number
*PS: Different floating point implementations of
one DSP algorithm may not be Bit Exact.

7/17/201012
Number line
• Double Precision: 64 bit ; Mantissa = 52 Bits; exponent 11 Bits
0.0001*10-2
=0.000001
0.0001*107
=1000
Difference
1.7688*10-21.7687*10-2
1.7688*1071.7687*107
Next
Number
Number
Variable spacing example:
4 digit representation

7/17/201013
Number line
Microsoft
Equation 3.0
adevalues/dec224
This is for decimal Representation
For binary representation it is 224 values/octave

7/17/201014
Float to Fix: Motivation
•To achieve higher speed of operation at lower cost
•To reduce silicon area of hardware -> cost
•Power saving during code execution
•Fixed point implementations may be directly used in
many DSPs
•Floating point implementation on different processor
may not be bit-exact
•Depending on given input range
•it may be beneficial to use fixed point.

7/17/201015
Fixed Point Numbers
•The binary point is fixed for a given format
•Integer numbers
•Fractional numbers, Q format fractional format
•Range is directly limited by the number of Bits
•The spacing between numbers is constant
•Q format :written as MQN format or M.N
•K bits for Integer part
•N bits for fractional part
•M = K + Sign bit = K + 1

7/17/201016
Fractional Format
• signed Q format 13.3
13Q313Q313Q313Q3
1Q3 number1Q3 number1Q3 number1Q3 number

7/17/201017
Fractional Formats: Range (example)
•

7/17/201018
How to Convert
•A fractional number can be converted to fixed point number for a
given signed Q format MQN
•Set all the N LSB bits to 1 and M Bits to 0; call it maxN
•e.g. for 1Q31 format, max31 = 0X7FFFFFFF
•For 1Q15->0X7FFF = 32767; for 3Q13->0X1FFF= 8191
•Now multiply the given fractional number with this number
2
2
4026
16422
27FE
10238
1FFF
8191
EC01
-5119
0BFF
3071
5Q11
HEX/Dec
0009
9
Over
Flow
Over
Flow
7FFF
32767
B001
-20479
2FFF
12287
1Q15
HEX/Dec
0.00032.0051.251-0.6250.375Number
------------
Format

7/17/201019
Setting Q Format
•The variable to be converted is explored for
•Range of the variable value
•Absolute Maximum / Minimum values of variable
•Resolution of variable: smallest non zero number
•Accuracy: Maximum error for a representation
•Accuracy required highly depends on the arithmetic involved.
•It also depends on the reference for final comparison
•Calculate accuracy from tolerable %error in representation.

7/17/201020
Exploration Process
•If the variable is a part of code
•Run the code for various test cases
•Find the previously described values (range, resolution)
•Range & Accuracy can be also found from arithmetic involved
•Find the Resolution ∆
•If the variable value comes from RO table then
•These values values can be found directly

7/17/201021
Setting Q Format: MQN
•Now calculate the Dynamic range required
•Determine the K Integer bits from MaxInt of the Range
•DR bits = B = K + N
•Total Bits = B + 1(S) => 1(S) + K + N; => M + N
•S is for Sign Bit
bitsB
Var
Var
BitsRangeDynamic
Var
Var
RatioRangeDynamic
res
res
)(log)(
)(
max
2
max
==
=
))((log
)(
2
max
MaxIntceilK
VarceilMaxInt
=
=

7/17/201022
RO Table Example
•Const float Table[5]={
•(5.0979557966e-001),
•(6.0134489352e-001),
•(8.9997625993e-001),
•(2.5629159405e+000)
•(-4.4628159401e+000)
•}
•Range: ±4.4628159401e+000 (signed representation)
•Resolution: 5.0979557966e-001 = Variable minimum value
•Resolution = Table[1]-Table[0] = 0.09154931386
•MaxInt = 5
We will set the Q format
For this RO table and
compare two choices of
resolution

7/17/201023
Example cont…
•DR = (4.4628159401) / 0.09154931386 = 48.74
•DR Bits B = 6 ; K = 3 Bits (as MaxInt = 5)
•M = 4 Bits, N = 3 Bits;
•Total Bits = 6(DR) + 1(S) => 1(S) + 3(K) + 3(N)
•The conversion format MQN is 4Q3
•K = 4 Bits ; N = 3 Bits; max3 = 7;
•5.0979557966e-001 * max3 = 3 ;
•3 / max3 is equivalent to = 0.42857142857….
•Representation Error = 15.93% (very huge error)
This error is not acceptable

7/17/201024
Example cont…
•The best would be to correctly represent the last digit of
•0.50979557966 (accuracy till 11th digit) =
•i.e. Resolution = 0.00000000006
•DR = (4.4628159401) / 0.00000000006 = 74380265668
•DR Bits = 36 bits
•Total Bits = 36(DR) + 1(S) => 1(S) + 3(K) + 33(N)
•max33 = 0x1FFFFFFFF = 8589934591
• 5.0979557966e-001 * max33 = 4379110684 ;
•4379110684 /max33 = 0.509795579652976….
•Representation Error = 1.37e-9%
•Q Format is set as 4Q33

7/17/201025
Final Table in 4Q33 Format
• PolyPhase4ptDCT4toDCT3[]={
•4379110684, Error = 1.37e-9%
•5165513301, Error = 1.87e-8%
•7730737206, Error = 3.25e-9%
•-38335297017, Error = 1.53e-9% }
•This is very good representation,
•But still many things matters e.g.
•Precision of arithmetic involved
•Bits available for representation, e.g. 32 Bits
The 37 bits representation error
is very small and may not be
needed for our application.
We may try with 32 Bits and check
if it fits into application or not??

7/17/201026
Shifting Binary Point
•Right Shift to a fractional number in MQN format
•Binary point shift to right
•MQN >> n (M + n)Q( N - n)
•Left Shift to a fractional number in MQN format
•Binary point shift to left
•MQN << n (M - n)Q( N + n)
•For example: 2Q30 << 1 is 1Q31
Binary point shifting only indicate the change in the
maximum/minimum value representable with resulting format

7/17/201027
Exercise
•If array num[] contains 2Q13 format data
•for(i=0;i<200;i++){
•sum = sum + num[i]
•}
•What is the output format of sum ??

7/17/201029
Overflow Vs Carry Flag
•Carry Flag is used for unsigned numbers
•It is set when…
•Addition/subtraction of two unsigned operands doesn’t fit into
the given Bits
•Overflow flag is for signed arithmetic
•It is set when
•When Add/sub/mult of two numbers does not give the expected
sign bit at the result

7/17/201030
Fractional Addition
•Two fixed point numbers a & b can be added directly if
• Binary point at same place for both numbers, “a & b”
•Resulting format of fractional addition will have
• Binary point at same place for both numbers, input & output

7/17/201031
Fractional Addition
•Result format of output is same as that of input
•i.e. Binary point at same place for both numbers, input & output
•There is a possibility of growth of Integer Part
•Example accumulation of an array with data in MQN format
•for( i=0; I < L; i++ )
sum = sum + ar[i];
•Sum will have output format of (M+m)QN
))((log2 Lceilm =

7/17/201032
Headroom/Guard Bits
•Overflow during a addition/subtraction can be handled
with headroom
• example addition…
• for( i=0; i<256; i++ )
sum = sum + ar[i];
There is possibility of 8 bit overflow
•Variable sum should have 8 bit headroom in MSBs to
accommodate the overflow

7/17/201033
2’s Complement Multiplication

7/17/201034
Fractional Multiplication
•The product of two N bit numbers is 2N bits
•Example product of MQN and KQL formats
•Output format is: (M+K)Q(N+L) format

7/17/201035
Fractional Multiplication
•
1Q3 Numbers
2Q6 Numbers

7/17/201036
Fractional Multiplication Cont..
•

7/17/201037
Normalizing Fractions
•A product of 2 two’s complement numbers has two sign
bits.
•result can be left shifted by 1 bit.
•If one of the format is 1QN i.e. N+1 Bits, then
•Result of left shift will have the format of other operand
•If other operand is 5Q11 then result of mult is 6Q(11+N)
•After 1 bit left shift, result is 5Q(11+N)
•i.e. 5Q11 + (N+1)LSBs
•To make most out of available bits,
•it is good to left shift the product

7/17/201038
Rounding
• Why Rounding ………
•e.g. replacing 45.6782 with 45.68
•Rounding is necessary evil... (..remember quantization..)
•Many times it is unavoidable
•It introduces round-off errors
•Required in float to fixed conversion and..
•floating & fixed point arithmetic and..
•for function approximation on fixed point processors
To replace a Numerical valueTo replace a Numerical valueTo replace a Numerical valueTo replace a Numerical value
with approx. equal & shorter i.e. low precision representationwith approx. equal & shorter i.e. low precision representationwith approx. equal & shorter i.e. low precision representationwith approx. equal & shorter i.e. low precision representation

7/17/201039
Rounding Methods
•When rounding a number x to q an integer
•Round to nearest : q is integer close to x
•Round towards zero (truncate): q is the integer part of x
•Round down (floor): q is largest integer that does not exceed x
•Round up (ceiling): q is smallest integer that is not less than x
•Tie Break during rounding
•When fractional part to be rounded is exactly half boundary
•e.g. rounding of 23.5 requires tie breaking

7/17/201040
Tie Breaking
•Round half up: i.e. q = x + 0.5 (x is positive or negative )
•It is asymmetric rounding i.e. biased (with round to nearest)
•Round half away from zero:
•q = x + 0.5 ; x is positive ; q = x – 0.5 ; x is negative
•This method is free of bias
•Round half to even
•q is the even integer nearest to x
•This method is more unbiased
•+33.5 +34 ; +32.5 +32 ; -32.5 -32 ; -33.5 -34
•Known as: unbiased rounding , convergent rounding,
statistician's rounding, Dutch rounding, Gaussian rounding,
or bankers' rounding

7/17/201041
Dithering
•Stochastic Rounding:
•Choose q randomly between x+0.5 and x-0.5
•It is also bias free because of random component
•Dithering: when to use this method
•when the signal being rounded / quantized is slowly varying
•All rounding methods produce monotonous round-off errors
•Moreover monotonous round-off error because of particular
rounding methods being used
•All of them introduces non-linear response in a system
•Harmonics in the filter response because of rounding method

7/17/201042
Block Floating Point Processing
•When Dynamic Range of a signal is very high
•When the input is fluctuating between very high and very
small values.
•When very small values are as important as larger
values…
•Instruction like CLZ & NORM are helpful.
A block of data values share a common
radix point for processing

7/17/201043
Exercise1
•Assuming you are working on a 32 bit DSP…
•Convert following table into fixed point….
• const float gain_table[8] ={
• (8.2387952833e-002),
• (9.2387952833e-001),
• (7.0710676573e-001),
• (3.8268340208e-001)
• (1.9615705587e+000),
• (1.6629392064e+000),
• (1.1111404206e+000),
• (3.9018056901e+001)
•};

7/17/201044
Exercise2
•Assume that inp[ ] array contains 16 Bit PCM data
•Find the o/p format of out[ ] ?
•Convert following loop into fixed point….
•for (i=0;i<8;i++){
•out[i] = inp[i] * gain_table[i];
•}

7/17/201045
Exercise3
•Now given below is the DCTIII code, with gain applied at end…
•Convert this into a fixed point code…
•

7/17/201046
Exercise4
•Following IIR filter is used for filtering 16 bit PCM ip
•Convert it into fixed point and ensure that the filter
response is linear….
Z-1
Z-1
+
b0
b1
b2
x[n]
x[n-1]
y[n]
x[n-2]
Z-1
Z-1a1
a2
y[n-1]
y[n-2]
y[n] = b0*x[n] + b1*x[n-1] + b2*x[n-2] - a1*y[n-1] - a2*y[n-2]

7/17/201047
Exercise4: C code
•

7/17/201048
Exercise4: C code Filter Loop
•

7/17/201049
References
•Fixed Point Arithmetic: an Introduction
•By Randy Yates
•Application Note 33
•Fixed point arithmetic on the ARM
•Document number ARM DAI 0033A
•Wiki

Fixed Point Conversion

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Fixed Point Conversion

Similar to Fixed Point Conversion (20)

Recently uploaded

Recently uploaded (20)

Fixed Point Conversion