SlideShare a Scribd company logo
1 of 50
Download to read offline
Floating Point to
Fixed Point Conversion
Audio DSP Ksharing: Rajesh Sharma
www.rajeshsharma.co.in sharma.rajesh@gmail.com
7/17/20102
www.rajeshsharma.co.in
Numbers
•Real Numbers –
•Rational, irrational, positive, negative, zero…
•An infinite decimal representation such as
2.487154145934798874502350989093….
•Any point on a infinitely long number line
•Integer Numbers-
•whole numbers, no fractional part
•Discrete points on number line… e.g. -10, -4, 0, 7, 9
7/17/20103
www.rajeshsharma.co.in
Number Representation
•Computer
•Finite Word Length is a major issue
•Processor Register Length: 8Bit, 16Bit, 24Bit, 32Bit, 64Bit..
•A limitation on the possible representations in a register
•Floating Point Numbers:
•IEEE754 Standard
•Pseudo Float
•Fixed Point Numbers: No Defined Standard
•Fractional Numbers: Q Format
•Integer Numbers
7/17/20104
www.rajeshsharma.co.in
Few Terms
•Format- Representing numbers in binary system.
•e.g. for signed number we have…
•Float: Mantissa Bits & Exponent Bits
•Fixed: Integer Bits & Fractional Bits
•Total Bits = Mantissa / Integer + Exponent /fractional + sign bit
•Precision - Effective Word Length = N
•Accuracy – Maximum error between actual number and its
representation.
•Resolution- smallest non zero magnitude representable with given
format
e.g. representation is accurate till 8th decimal place
e.g. representation has maximum of 0.001% error
7/17/20105
www.rajeshsharma.co.in
Dynamic Range
• While Processing/Storing variables: we need to find
•Varmax absolute maximum value a variable can have
•Varmin absolute minimum value a variable can have
•Range of a Variable: ℜ : -Varmax to +Varmax
•Resolution of a Variable: ∆ : Varres
•Resolution can be determined from…..
•Absolute Minimum value a variable can have: Varmin
•How precisely two consecutive values of a variable may differ
•what is the accuracy required (e.g. till 8th decimal place)
7/17/20106
www.rajeshsharma.co.in
Dynamic Range
•The ratio of the largest and smallest possible values of the
variable quantity
•For Fixed point, resolution of the representation becomes the
smallest value of the variable.
•Bits for signed representation = B = N + 1
•Bits for unsigned representation = B = N
BitsN
Var
Var
BitsRangeDynamic
Var
VarVar
RatioRangeDynamic
res
res
)(log)(
)(
max
2
maxmax
==
=
∆
=
Dynamic Range is more often expressed
as Ratio 5000:1 or as dB 75 dB
Or in Bits 12 Bits
7/17/20107
www.rajeshsharma.co.in
Dynamic Range: Audio system
•Audio system specification uses unsigned representation
•The ratio of amplitude of loudest possible undistorted sine wave to rms
noise amplitude
•e.g. of microphone, loudspeaker, ADC, DAC etc..
•For display devices it is often called contrast ratio
76.1*02.6)(
log20)( max
10
+=






∆
=
NdBRangeDynamic
Var
dBRangeDynamic
For Fixed point conversion normallyFor Fixed point conversion normallyFor Fixed point conversion normallyFor Fixed point conversion normally
Signed representation methodSigned representation methodSigned representation methodSigned representation method
is used to calculate dynamic rangeis used to calculate dynamic rangeis used to calculate dynamic rangeis used to calculate dynamic range
7/17/20108
www.rajeshsharma.co.in
Finite Word Length
•Computers- µP, DSP, µController
•Store results in Registers having finite word length
•N Bit storage can have only 2N possibilities
•How to represent a variable ?
•Floating point format
•Fixed point format
•Choice is governed by…
Dynamic Range of Variable (Most Imp…)
Power consumption by specific type of ALU
Cost of implementation of ALU..
Other possibilities….
7/17/20109
www.rajeshsharma.co.in
Floating Point
•The decimal point or binary point is floating
•Economical for very large/small number storage
•FPU silicon implementation is costly
•Usually represented as
•Two ways of finding values….
•S : sign Bit value; B- Bias ; e – exponent; f - Mantissa
frfr BesBes
.12)1(or.02)1( )()( −−
−=−=
IEEE754, Single Precision: s- 1; Bit; B-127; e-8Bits; f-23Bits
7/17/201010
www.rajeshsharma.co.in
Floating Point Variable
•The Range of variable is very high
•Still there are only 2N distinct values for N Bits
•The accuracy of variable varies with values
•Spacing between numbers is not constant
•IEEE 754 supports four precisions
•Half precision, 16 Bit; Single precision, 32 Bit
•Double precision, 64 Bit; Quadruple precision, 128 Bit
Single Precision example
7/17/201011
www.rajeshsharma.co.in
Floating Point Arithmetic
•Floating point arithmetic is very easy to use
•Floating point arithmetic: problems
•It may not be distributive (processor implementation)*
• a * (b + c) != a*b + a*c
•Same mathematical operations on different processors
may not produce exactly same result*
•Operations on large numbers has more error.
•Mantissa limits the resolution
•Exponent limits the largest possible number
*PS: Different floating point implementations of
one DSP algorithm may not be Bit Exact.
7/17/201012
www.rajeshsharma.co.in
Number line
• Double Precision: 64 bit ; Mantissa = 52 Bits; exponent 11 Bits
0.0001*10-2
=0.000001
0.0001*107
=1000
Difference
1.7688*10-21.7687*10-2
1.7688*1071.7687*107
Next
Number
Number
Variable spacing example:
4 digit representation
7/17/201013
www.rajeshsharma.co.in
Number line
Microsoft
Equation 3.0
adevalues/dec224
This is for decimal Representation
For binary representation it is 224 values/octave
7/17/201014
www.rajeshsharma.co.in
Float to Fix: Motivation
•To achieve higher speed of operation at lower cost
•To reduce silicon area of hardware -> cost
•Power saving during code execution
•Fixed point implementations may be directly used in
many DSPs
•Floating point implementation on different processor
may not be bit-exact
•Depending on given input range
•it may be beneficial to use fixed point.
7/17/201015
www.rajeshsharma.co.in
Fixed Point Numbers
•The binary point is fixed for a given format
•Integer numbers
•Fractional numbers, Q format fractional format
•Range is directly limited by the number of Bits
•The spacing between numbers is constant
•Q format :written as MQN format or M.N
•K bits for Integer part
•N bits for fractional part
•M = K + Sign bit = K + 1
7/17/201016
www.rajeshsharma.co.in
Fractional Format
• signed Q format 13.3
13Q313Q313Q313Q3
1Q3 number1Q3 number1Q3 number1Q3 number
7/17/201017
www.rajeshsharma.co.in
Fractional Formats: Range (example)
•
7/17/201018
www.rajeshsharma.co.in
How to Convert
•A fractional number can be converted to fixed point number for a
given signed Q format MQN
•Set all the N LSB bits to 1 and M Bits to 0; call it maxN
•e.g. for 1Q31 format, max31 = 0X7FFFFFFF
•For 1Q15->0X7FFF = 32767; for 3Q13->0X1FFF= 8191
•Now multiply the given fractional number with this number
2
2
4026
16422
27FE
10238
1FFF
8191
EC01
-5119
0BFF
3071
5Q11
HEX/Dec
0009
9
Over
Flow
Over
Flow
7FFF
32767
B001
-20479
2FFF
12287
1Q15
HEX/Dec
0.00032.0051.251-0.6250.375Number
------------
Format
7/17/201019
www.rajeshsharma.co.in
Setting Q Format
•The variable to be converted is explored for
•Range of the variable value
•Absolute Maximum / Minimum values of variable
•Resolution of variable: smallest non zero number
•Accuracy: Maximum error for a representation
•Accuracy required highly depends on the arithmetic involved.
•It also depends on the reference for final comparison
•Calculate accuracy from tolerable %error in representation.
7/17/201020
www.rajeshsharma.co.in
Exploration Process
•If the variable is a part of code
•Run the code for various test cases
•Find the previously described values (range, resolution)
•Range & Accuracy can be also found from arithmetic involved
•Find the Resolution ∆
•If the variable value comes from RO table then
•These values values can be found directly
7/17/201021
www.rajeshsharma.co.in
Setting Q Format: MQN
•Now calculate the Dynamic range required
•Determine the K Integer bits from MaxInt of the Range
•DR bits = B = K + N
•Total Bits = B + 1(S) => 1(S) + K + N; => M + N
•S is for Sign Bit
bitsB
Var
Var
BitsRangeDynamic
Var
Var
RatioRangeDynamic
res
res
)(log)(
)(
max
2
max
==
=
))((log
)(
2
max
MaxIntceilK
VarceilMaxInt
=
=
7/17/201022
www.rajeshsharma.co.in
RO Table Example
•Const float Table[5]={
•(5.0979557966e-001),
•(6.0134489352e-001),
•(8.9997625993e-001),
•(2.5629159405e+000)
•(-4.4628159401e+000)
•}
•Range: ±4.4628159401e+000 (signed representation)
•Resolution: 5.0979557966e-001 = Variable minimum value
•Resolution = Table[1]-Table[0] = 0.09154931386
•MaxInt = 5
We will set the Q format
For this RO table and
compare two choices of
resolution
7/17/201023
www.rajeshsharma.co.in
Example cont…
•DR = (4.4628159401) / 0.09154931386 = 48.74
•DR Bits B = 6 ; K = 3 Bits (as MaxInt = 5)
•M = 4 Bits, N = 3 Bits;
•Total Bits = 6(DR) + 1(S) => 1(S) + 3(K) + 3(N)
•The conversion format MQN is 4Q3
•K = 4 Bits ; N = 3 Bits; max3 = 7;
•5.0979557966e-001 * max3 = 3 ;
•3 / max3 is equivalent to = 0.42857142857….
•Representation Error = 15.93% (very huge error)
This error is not acceptable
7/17/201024
www.rajeshsharma.co.in
Example cont…
•The best would be to correctly represent the last digit of
•0.50979557966 (accuracy till 11th digit) =
•i.e. Resolution = 0.00000000006
•DR = (4.4628159401) / 0.00000000006 = 74380265668
•DR Bits = 36 bits
•Total Bits = 36(DR) + 1(S) => 1(S) + 3(K) + 33(N)
•max33 = 0x1FFFFFFFF = 8589934591
• 5.0979557966e-001 * max33 = 4379110684 ;
•4379110684 /max33 = 0.509795579652976….
•Representation Error = 1.37e-9%
•Q Format is set as 4Q33
7/17/201025
www.rajeshsharma.co.in
Final Table in 4Q33 Format
• PolyPhase4ptDCT4toDCT3[]={
•4379110684, Error = 1.37e-9%
•5165513301, Error = 1.87e-8%
•7730737206, Error = 3.25e-9%
•-38335297017, Error = 1.53e-9% }
•This is very good representation,
•But still many things matters e.g.
•Precision of arithmetic involved
•Bits available for representation, e.g. 32 Bits
The 37 bits representation error
is very small and may not be
needed for our application.
We may try with 32 Bits and check
if it fits into application or not??
7/17/201026
www.rajeshsharma.co.in
Shifting Binary Point
•Right Shift to a fractional number in MQN format
•Binary point shift to right
•MQN >> n (M + n)Q( N - n)
•Left Shift to a fractional number in MQN format
•Binary point shift to left
•MQN << n (M - n)Q( N + n)
•For example: 2Q30 << 1 is 1Q31
Binary point shifting only indicate the change in the
maximum/minimum value representable with resulting format
7/17/201027
www.rajeshsharma.co.in
Exercise
•If array num[] contains 2Q13 format data
•for(i=0;i<200;i++){
•sum = sum + num[i]
•}
•What is the output format of sum ??
Fixed Point Arithmetic
7/17/201029
www.rajeshsharma.co.in
Overflow Vs Carry Flag
•Carry Flag is used for unsigned numbers
•It is set when…
•Addition/subtraction of two unsigned operands doesn’t fit into
the given Bits
•Overflow flag is for signed arithmetic
•It is set when
•When Add/sub/mult of two numbers does not give the expected
sign bit at the result
7/17/201030
www.rajeshsharma.co.in
Fractional Addition
•Two fixed point numbers a & b can be added directly if
• Binary point at same place for both numbers, “a & b”
•Resulting format of fractional addition will have
• Binary point at same place for both numbers, input & output
7/17/201031
www.rajeshsharma.co.in
Fractional Addition
•Result format of output is same as that of input
•i.e. Binary point at same place for both numbers, input & output
•There is a possibility of growth of Integer Part
•Example accumulation of an array with data in MQN format
•for( i=0; I < L; i++ )
sum = sum + ar[i];
•Sum will have output format of (M+m)QN
))((log2 Lceilm =
7/17/201032
www.rajeshsharma.co.in
Headroom/Guard Bits
•Overflow during a addition/subtraction can be handled
with headroom
• example addition…
• for( i=0; i<256; i++ )
sum = sum + ar[i];
There is possibility of 8 bit overflow
•Variable sum should have 8 bit headroom in MSBs to
accommodate the overflow
7/17/201033
www.rajeshsharma.co.in
2’s Complement Multiplication
7/17/201034
www.rajeshsharma.co.in
Fractional Multiplication
•The product of two N bit numbers is 2N bits
•Example product of MQN and KQL formats
•Output format is: (M+K)Q(N+L) format
7/17/201035
www.rajeshsharma.co.in
Fractional Multiplication
•
1Q3 Numbers
2Q6 Numbers
7/17/201036
www.rajeshsharma.co.in
Fractional Multiplication Cont..
•
7/17/201037
www.rajeshsharma.co.in
Normalizing Fractions
•A product of 2 two’s complement numbers has two sign
bits.
•result can be left shifted by 1 bit.
•If one of the format is 1QN i.e. N+1 Bits, then
•Result of left shift will have the format of other operand
•If other operand is 5Q11 then result of mult is 6Q(11+N)
•After 1 bit left shift, result is 5Q(11+N)
•i.e. 5Q11 + (N+1)LSBs
•To make most out of available bits,
•it is good to left shift the product
7/17/201038
www.rajeshsharma.co.in
Rounding
• Why Rounding ………
•e.g. replacing 45.6782 with 45.68
•Rounding is necessary evil... (..remember quantization..)
•Many times it is unavoidable
•It introduces round-off errors
•Required in float to fixed conversion and..
•floating & fixed point arithmetic and..
•for function approximation on fixed point processors
To replace a Numerical valueTo replace a Numerical valueTo replace a Numerical valueTo replace a Numerical value
with approx. equal & shorter i.e. low precision representationwith approx. equal & shorter i.e. low precision representationwith approx. equal & shorter i.e. low precision representationwith approx. equal & shorter i.e. low precision representation
7/17/201039
www.rajeshsharma.co.in
Rounding Methods
•When rounding a number x to q an integer
•Round to nearest : q is integer close to x
•Round towards zero (truncate): q is the integer part of x
•Round down (floor): q is largest integer that does not exceed x
•Round up (ceiling): q is smallest integer that is not less than x
•Tie Break during rounding
•When fractional part to be rounded is exactly half boundary
•e.g. rounding of 23.5 requires tie breaking
7/17/201040
www.rajeshsharma.co.in
Tie Breaking
•Round half up: i.e. q = x + 0.5 (x is positive or negative )
•It is asymmetric rounding i.e. biased (with round to nearest)
•Round half away from zero:
•q = x + 0.5 ; x is positive ; q = x – 0.5 ; x is negative
•This method is free of bias
•Round half to even
•q is the even integer nearest to x
•This method is more unbiased
•+33.5 +34 ; +32.5 +32 ; -32.5 -32 ; -33.5 -34
•Known as: unbiased rounding , convergent rounding,
statistician's rounding, Dutch rounding, Gaussian rounding,
or bankers' rounding
7/17/201041
www.rajeshsharma.co.in
Dithering
•Stochastic Rounding:
•Choose q randomly between x+0.5 and x-0.5
•It is also bias free because of random component
•Dithering: when to use this method
•when the signal being rounded / quantized is slowly varying
•All rounding methods produce monotonous round-off errors
•Moreover monotonous round-off error because of particular
rounding methods being used
•All of them introduces non-linear response in a system
•Harmonics in the filter response because of rounding method
7/17/201042
www.rajeshsharma.co.in
Block Floating Point Processing
•When Dynamic Range of a signal is very high
•When the input is fluctuating between very high and very
small values.
•When very small values are as important as larger
values…
•Instruction like CLZ & NORM are helpful.
A block of data values share a common
radix point for processing
7/17/201043
www.rajeshsharma.co.in
Exercise1
•Assuming you are working on a 32 bit DSP…
•Convert following table into fixed point….
• const float gain_table[8] ={
• (8.2387952833e-002),
• (9.2387952833e-001),
• (7.0710676573e-001),
• (3.8268340208e-001)
• (1.9615705587e+000),
• (1.6629392064e+000),
• (1.1111404206e+000),
• (3.9018056901e+001)
•};
7/17/201044
www.rajeshsharma.co.in
Exercise2
•Assume that inp[ ] array contains 16 Bit PCM data
•Find the o/p format of out[ ] ?
•Convert following loop into fixed point….
•for (i=0;i<8;i++){
•out[i] = inp[i] * gain_table[i];
•}
7/17/201045
www.rajeshsharma.co.in
Exercise3
•Now given below is the DCTIII code, with gain applied at end…
•Convert this into a fixed point code…
•
7/17/201046
www.rajeshsharma.co.in
Exercise4
•Following IIR filter is used for filtering 16 bit PCM ip
•Convert it into fixed point and ensure that the filter
response is linear….
Z-1
Z-1
+
b0
b1
b2
x[n]
x[n-1]
y[n]
x[n-2]
Z-1
Z-1a1
a2
y[n-1]
y[n-2]
y[n] = b0*x[n] + b1*x[n-1] + b2*x[n-2] - a1*y[n-1] - a2*y[n-2]
7/17/201047
www.rajeshsharma.co.in
Exercise4: C code
•
7/17/201048
www.rajeshsharma.co.in
Exercise4: C code Filter Loop
•
7/17/201049
www.rajeshsharma.co.in
References
•Fixed Point Arithmetic: an Introduction
•By Randy Yates
•Application Note 33
•Fixed point arithmetic on the ARM
•Document number ARM DAI 0033A
•Wiki
THANK YOU

More Related Content

What's hot (20)

Unipolar Encoding Techniques: NRZ & RZ
Unipolar Encoding Techniques: NRZ & RZUnipolar Encoding Techniques: NRZ & RZ
Unipolar Encoding Techniques: NRZ & RZ
 
verilog code
verilog codeverilog code
verilog code
 
Speech coding techniques
Speech coding techniquesSpeech coding techniques
Speech coding techniques
 
data representation
 data representation data representation
data representation
 
Verilog coding of mux 8 x1
Verilog coding of mux  8 x1Verilog coding of mux  8 x1
Verilog coding of mux 8 x1
 
Hamming codes
Hamming codesHamming codes
Hamming codes
 
Sequential multiplication
Sequential multiplicationSequential multiplication
Sequential multiplication
 
Pulse code modulation and Demodulation
Pulse code modulation and DemodulationPulse code modulation and Demodulation
Pulse code modulation and Demodulation
 
Subband Coding
Subband CodingSubband Coding
Subband Coding
 
Array multiplier
Array multiplierArray multiplier
Array multiplier
 
Hamming code system
Hamming code systemHamming code system
Hamming code system
 
Channel coding
Channel coding  Channel coding
Channel coding
 
Encoding Techniques
Encoding TechniquesEncoding Techniques
Encoding Techniques
 
Pn sequence
Pn sequencePn sequence
Pn sequence
 
Lecture2
Lecture2Lecture2
Lecture2
 
Verilog HDL - 3
Verilog HDL - 3Verilog HDL - 3
Verilog HDL - 3
 
Ripple Carry Adder
Ripple Carry AdderRipple Carry Adder
Ripple Carry Adder
 
Lecture 05 pic io port programming
Lecture 05 pic io port programmingLecture 05 pic io port programming
Lecture 05 pic io port programming
 
floating point multiplier
floating point multiplierfloating point multiplier
floating point multiplier
 
Convolution Codes
Convolution CodesConvolution Codes
Convolution Codes
 

Similar to Fixed Point Conversion

Implementation of character translation integer and floating point values
Implementation of character translation integer and floating point valuesImplementation of character translation integer and floating point values
Implementation of character translation integer and floating point valuesغزالة
 
Bitwise Operations(1).pdf
Bitwise Operations(1).pdfBitwise Operations(1).pdf
Bitwise Operations(1).pdfDalvinCalvin
 
Only floating point lecture 7 (1)
Only floating point lecture 7 (1)Only floating point lecture 7 (1)
Only floating point lecture 7 (1)talhashahid40
 
Modern block cipher
Modern block cipherModern block cipher
Modern block cipherUdit Mishra
 
Finite word length effects
Finite word length effectsFinite word length effects
Finite word length effectsPeriyanayagiS
 
error_detection_correction.pptx
error_detection_correction.pptxerror_detection_correction.pptx
error_detection_correction.pptxssuser50f4fd1
 
project ppt on anti counterfeiting technique for credit card transaction system
project ppt on anti counterfeiting technique for credit card transaction systemproject ppt on anti counterfeiting technique for credit card transaction system
project ppt on anti counterfeiting technique for credit card transaction systemRekha dudiya
 
6_2018_11_23!09_24_56_PM (1).pptx
6_2018_11_23!09_24_56_PM (1).pptx6_2018_11_23!09_24_56_PM (1).pptx
6_2018_11_23!09_24_56_PM (1).pptxHebaEng
 
Digital Electronics – Unit I.pdf
Digital Electronics – Unit I.pdfDigital Electronics – Unit I.pdf
Digital Electronics – Unit I.pdfKannan Kanagaraj
 
Digital Communication: Channel Coding
Digital Communication: Channel CodingDigital Communication: Channel Coding
Digital Communication: Channel CodingDr. Sanjay M. Gulhane
 
Lecture6 Chapter4- Design Magnitude Comparator Circuit, Introduction to Decod...
Lecture6 Chapter4- Design Magnitude Comparator Circuit, Introduction to Decod...Lecture6 Chapter4- Design Magnitude Comparator Circuit, Introduction to Decod...
Lecture6 Chapter4- Design Magnitude Comparator Circuit, Introduction to Decod...UmerKhan147799
 
Error Detection N Correction
Error Detection N CorrectionError Detection N Correction
Error Detection N CorrectionAnkan Adhikari
 
Lecture-2(2): Number System & Conversion
Lecture-2(2): Number System & ConversionLecture-2(2): Number System & Conversion
Lecture-2(2): Number System & ConversionMubashir Ali
 

Similar to Fixed Point Conversion (20)

Implementation of character translation integer and floating point values
Implementation of character translation integer and floating point valuesImplementation of character translation integer and floating point values
Implementation of character translation integer and floating point values
 
Bitwise Operations(1).pdf
Bitwise Operations(1).pdfBitwise Operations(1).pdf
Bitwise Operations(1).pdf
 
Only floating point lecture 7 (1)
Only floating point lecture 7 (1)Only floating point lecture 7 (1)
Only floating point lecture 7 (1)
 
Digital Logic
Digital LogicDigital Logic
Digital Logic
 
Modern block cipher
Modern block cipherModern block cipher
Modern block cipher
 
Finite word length effects
Finite word length effectsFinite word length effects
Finite word length effects
 
Digital Electronics Notes.pdf
Digital Electronics Notes.pdfDigital Electronics Notes.pdf
Digital Electronics Notes.pdf
 
error_detection_correction.pptx
error_detection_correction.pptxerror_detection_correction.pptx
error_detection_correction.pptx
 
project ppt on anti counterfeiting technique for credit card transaction system
project ppt on anti counterfeiting technique for credit card transaction systemproject ppt on anti counterfeiting technique for credit card transaction system
project ppt on anti counterfeiting technique for credit card transaction system
 
Implementation
ImplementationImplementation
Implementation
 
6_2018_11_23!09_24_56_PM (1).pptx
6_2018_11_23!09_24_56_PM (1).pptx6_2018_11_23!09_24_56_PM (1).pptx
6_2018_11_23!09_24_56_PM (1).pptx
 
Digital Electronics – Unit I.pdf
Digital Electronics – Unit I.pdfDigital Electronics – Unit I.pdf
Digital Electronics – Unit I.pdf
 
Unit-1.pptx
Unit-1.pptxUnit-1.pptx
Unit-1.pptx
 
Digital Communication: Channel Coding
Digital Communication: Channel CodingDigital Communication: Channel Coding
Digital Communication: Channel Coding
 
Lecture6 Chapter4- Design Magnitude Comparator Circuit, Introduction to Decod...
Lecture6 Chapter4- Design Magnitude Comparator Circuit, Introduction to Decod...Lecture6 Chapter4- Design Magnitude Comparator Circuit, Introduction to Decod...
Lecture6 Chapter4- Design Magnitude Comparator Circuit, Introduction to Decod...
 
Error Detection N Correction
Error Detection N CorrectionError Detection N Correction
Error Detection N Correction
 
DLD_Lecture_notes2.ppt
DLD_Lecture_notes2.pptDLD_Lecture_notes2.ppt
DLD_Lecture_notes2.ppt
 
Lecture-2(2): Number System & Conversion
Lecture-2(2): Number System & ConversionLecture-2(2): Number System & Conversion
Lecture-2(2): Number System & Conversion
 
DESIGN OF COMBINATIONAL LOGIC
DESIGN OF COMBINATIONAL LOGICDESIGN OF COMBINATIONAL LOGIC
DESIGN OF COMBINATIONAL LOGIC
 
Source coding
Source codingSource coding
Source coding
 

Recently uploaded

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Fixed Point Conversion

  • 1. Floating Point to Fixed Point Conversion Audio DSP Ksharing: Rajesh Sharma www.rajeshsharma.co.in sharma.rajesh@gmail.com
  • 2. 7/17/20102 www.rajeshsharma.co.in Numbers •Real Numbers – •Rational, irrational, positive, negative, zero… •An infinite decimal representation such as 2.487154145934798874502350989093…. •Any point on a infinitely long number line •Integer Numbers- •whole numbers, no fractional part •Discrete points on number line… e.g. -10, -4, 0, 7, 9
  • 3. 7/17/20103 www.rajeshsharma.co.in Number Representation •Computer •Finite Word Length is a major issue •Processor Register Length: 8Bit, 16Bit, 24Bit, 32Bit, 64Bit.. •A limitation on the possible representations in a register •Floating Point Numbers: •IEEE754 Standard •Pseudo Float •Fixed Point Numbers: No Defined Standard •Fractional Numbers: Q Format •Integer Numbers
  • 4. 7/17/20104 www.rajeshsharma.co.in Few Terms •Format- Representing numbers in binary system. •e.g. for signed number we have… •Float: Mantissa Bits & Exponent Bits •Fixed: Integer Bits & Fractional Bits •Total Bits = Mantissa / Integer + Exponent /fractional + sign bit •Precision - Effective Word Length = N •Accuracy – Maximum error between actual number and its representation. •Resolution- smallest non zero magnitude representable with given format e.g. representation is accurate till 8th decimal place e.g. representation has maximum of 0.001% error
  • 5. 7/17/20105 www.rajeshsharma.co.in Dynamic Range • While Processing/Storing variables: we need to find •Varmax absolute maximum value a variable can have •Varmin absolute minimum value a variable can have •Range of a Variable: ℜ : -Varmax to +Varmax •Resolution of a Variable: ∆ : Varres •Resolution can be determined from….. •Absolute Minimum value a variable can have: Varmin •How precisely two consecutive values of a variable may differ •what is the accuracy required (e.g. till 8th decimal place)
  • 6. 7/17/20106 www.rajeshsharma.co.in Dynamic Range •The ratio of the largest and smallest possible values of the variable quantity •For Fixed point, resolution of the representation becomes the smallest value of the variable. •Bits for signed representation = B = N + 1 •Bits for unsigned representation = B = N BitsN Var Var BitsRangeDynamic Var VarVar RatioRangeDynamic res res )(log)( )( max 2 maxmax == = ∆ = Dynamic Range is more often expressed as Ratio 5000:1 or as dB 75 dB Or in Bits 12 Bits
  • 7. 7/17/20107 www.rajeshsharma.co.in Dynamic Range: Audio system •Audio system specification uses unsigned representation •The ratio of amplitude of loudest possible undistorted sine wave to rms noise amplitude •e.g. of microphone, loudspeaker, ADC, DAC etc.. •For display devices it is often called contrast ratio 76.1*02.6)( log20)( max 10 +=       ∆ = NdBRangeDynamic Var dBRangeDynamic For Fixed point conversion normallyFor Fixed point conversion normallyFor Fixed point conversion normallyFor Fixed point conversion normally Signed representation methodSigned representation methodSigned representation methodSigned representation method is used to calculate dynamic rangeis used to calculate dynamic rangeis used to calculate dynamic rangeis used to calculate dynamic range
  • 8. 7/17/20108 www.rajeshsharma.co.in Finite Word Length •Computers- µP, DSP, µController •Store results in Registers having finite word length •N Bit storage can have only 2N possibilities •How to represent a variable ? •Floating point format •Fixed point format •Choice is governed by… Dynamic Range of Variable (Most Imp…) Power consumption by specific type of ALU Cost of implementation of ALU.. Other possibilities….
  • 9. 7/17/20109 www.rajeshsharma.co.in Floating Point •The decimal point or binary point is floating •Economical for very large/small number storage •FPU silicon implementation is costly •Usually represented as •Two ways of finding values…. •S : sign Bit value; B- Bias ; e – exponent; f - Mantissa frfr BesBes .12)1(or.02)1( )()( −− −=−= IEEE754, Single Precision: s- 1; Bit; B-127; e-8Bits; f-23Bits
  • 10. 7/17/201010 www.rajeshsharma.co.in Floating Point Variable •The Range of variable is very high •Still there are only 2N distinct values for N Bits •The accuracy of variable varies with values •Spacing between numbers is not constant •IEEE 754 supports four precisions •Half precision, 16 Bit; Single precision, 32 Bit •Double precision, 64 Bit; Quadruple precision, 128 Bit Single Precision example
  • 11. 7/17/201011 www.rajeshsharma.co.in Floating Point Arithmetic •Floating point arithmetic is very easy to use •Floating point arithmetic: problems •It may not be distributive (processor implementation)* • a * (b + c) != a*b + a*c •Same mathematical operations on different processors may not produce exactly same result* •Operations on large numbers has more error. •Mantissa limits the resolution •Exponent limits the largest possible number *PS: Different floating point implementations of one DSP algorithm may not be Bit Exact.
  • 12. 7/17/201012 www.rajeshsharma.co.in Number line • Double Precision: 64 bit ; Mantissa = 52 Bits; exponent 11 Bits 0.0001*10-2 =0.000001 0.0001*107 =1000 Difference 1.7688*10-21.7687*10-2 1.7688*1071.7687*107 Next Number Number Variable spacing example: 4 digit representation
  • 13. 7/17/201013 www.rajeshsharma.co.in Number line Microsoft Equation 3.0 adevalues/dec224 This is for decimal Representation For binary representation it is 224 values/octave
  • 14. 7/17/201014 www.rajeshsharma.co.in Float to Fix: Motivation •To achieve higher speed of operation at lower cost •To reduce silicon area of hardware -> cost •Power saving during code execution •Fixed point implementations may be directly used in many DSPs •Floating point implementation on different processor may not be bit-exact •Depending on given input range •it may be beneficial to use fixed point.
  • 15. 7/17/201015 www.rajeshsharma.co.in Fixed Point Numbers •The binary point is fixed for a given format •Integer numbers •Fractional numbers, Q format fractional format •Range is directly limited by the number of Bits •The spacing between numbers is constant •Q format :written as MQN format or M.N •K bits for Integer part •N bits for fractional part •M = K + Sign bit = K + 1
  • 16. 7/17/201016 www.rajeshsharma.co.in Fractional Format • signed Q format 13.3 13Q313Q313Q313Q3 1Q3 number1Q3 number1Q3 number1Q3 number
  • 18. 7/17/201018 www.rajeshsharma.co.in How to Convert •A fractional number can be converted to fixed point number for a given signed Q format MQN •Set all the N LSB bits to 1 and M Bits to 0; call it maxN •e.g. for 1Q31 format, max31 = 0X7FFFFFFF •For 1Q15->0X7FFF = 32767; for 3Q13->0X1FFF= 8191 •Now multiply the given fractional number with this number 2 2 4026 16422 27FE 10238 1FFF 8191 EC01 -5119 0BFF 3071 5Q11 HEX/Dec 0009 9 Over Flow Over Flow 7FFF 32767 B001 -20479 2FFF 12287 1Q15 HEX/Dec 0.00032.0051.251-0.6250.375Number ------------ Format
  • 19. 7/17/201019 www.rajeshsharma.co.in Setting Q Format •The variable to be converted is explored for •Range of the variable value •Absolute Maximum / Minimum values of variable •Resolution of variable: smallest non zero number •Accuracy: Maximum error for a representation •Accuracy required highly depends on the arithmetic involved. •It also depends on the reference for final comparison •Calculate accuracy from tolerable %error in representation.
  • 20. 7/17/201020 www.rajeshsharma.co.in Exploration Process •If the variable is a part of code •Run the code for various test cases •Find the previously described values (range, resolution) •Range & Accuracy can be also found from arithmetic involved •Find the Resolution ∆ •If the variable value comes from RO table then •These values values can be found directly
  • 21. 7/17/201021 www.rajeshsharma.co.in Setting Q Format: MQN •Now calculate the Dynamic range required •Determine the K Integer bits from MaxInt of the Range •DR bits = B = K + N •Total Bits = B + 1(S) => 1(S) + K + N; => M + N •S is for Sign Bit bitsB Var Var BitsRangeDynamic Var Var RatioRangeDynamic res res )(log)( )( max 2 max == = ))((log )( 2 max MaxIntceilK VarceilMaxInt = =
  • 22. 7/17/201022 www.rajeshsharma.co.in RO Table Example •Const float Table[5]={ •(5.0979557966e-001), •(6.0134489352e-001), •(8.9997625993e-001), •(2.5629159405e+000) •(-4.4628159401e+000) •} •Range: ±4.4628159401e+000 (signed representation) •Resolution: 5.0979557966e-001 = Variable minimum value •Resolution = Table[1]-Table[0] = 0.09154931386 •MaxInt = 5 We will set the Q format For this RO table and compare two choices of resolution
  • 23. 7/17/201023 www.rajeshsharma.co.in Example cont… •DR = (4.4628159401) / 0.09154931386 = 48.74 •DR Bits B = 6 ; K = 3 Bits (as MaxInt = 5) •M = 4 Bits, N = 3 Bits; •Total Bits = 6(DR) + 1(S) => 1(S) + 3(K) + 3(N) •The conversion format MQN is 4Q3 •K = 4 Bits ; N = 3 Bits; max3 = 7; •5.0979557966e-001 * max3 = 3 ; •3 / max3 is equivalent to = 0.42857142857…. •Representation Error = 15.93% (very huge error) This error is not acceptable
  • 24. 7/17/201024 www.rajeshsharma.co.in Example cont… •The best would be to correctly represent the last digit of •0.50979557966 (accuracy till 11th digit) = •i.e. Resolution = 0.00000000006 •DR = (4.4628159401) / 0.00000000006 = 74380265668 •DR Bits = 36 bits •Total Bits = 36(DR) + 1(S) => 1(S) + 3(K) + 33(N) •max33 = 0x1FFFFFFFF = 8589934591 • 5.0979557966e-001 * max33 = 4379110684 ; •4379110684 /max33 = 0.509795579652976…. •Representation Error = 1.37e-9% •Q Format is set as 4Q33
  • 25. 7/17/201025 www.rajeshsharma.co.in Final Table in 4Q33 Format • PolyPhase4ptDCT4toDCT3[]={ •4379110684, Error = 1.37e-9% •5165513301, Error = 1.87e-8% •7730737206, Error = 3.25e-9% •-38335297017, Error = 1.53e-9% } •This is very good representation, •But still many things matters e.g. •Precision of arithmetic involved •Bits available for representation, e.g. 32 Bits The 37 bits representation error is very small and may not be needed for our application. We may try with 32 Bits and check if it fits into application or not??
  • 26. 7/17/201026 www.rajeshsharma.co.in Shifting Binary Point •Right Shift to a fractional number in MQN format •Binary point shift to right •MQN >> n (M + n)Q( N - n) •Left Shift to a fractional number in MQN format •Binary point shift to left •MQN << n (M - n)Q( N + n) •For example: 2Q30 << 1 is 1Q31 Binary point shifting only indicate the change in the maximum/minimum value representable with resulting format
  • 27. 7/17/201027 www.rajeshsharma.co.in Exercise •If array num[] contains 2Q13 format data •for(i=0;i<200;i++){ •sum = sum + num[i] •} •What is the output format of sum ??
  • 29. 7/17/201029 www.rajeshsharma.co.in Overflow Vs Carry Flag •Carry Flag is used for unsigned numbers •It is set when… •Addition/subtraction of two unsigned operands doesn’t fit into the given Bits •Overflow flag is for signed arithmetic •It is set when •When Add/sub/mult of two numbers does not give the expected sign bit at the result
  • 30. 7/17/201030 www.rajeshsharma.co.in Fractional Addition •Two fixed point numbers a & b can be added directly if • Binary point at same place for both numbers, “a & b” •Resulting format of fractional addition will have • Binary point at same place for both numbers, input & output
  • 31. 7/17/201031 www.rajeshsharma.co.in Fractional Addition •Result format of output is same as that of input •i.e. Binary point at same place for both numbers, input & output •There is a possibility of growth of Integer Part •Example accumulation of an array with data in MQN format •for( i=0; I < L; i++ ) sum = sum + ar[i]; •Sum will have output format of (M+m)QN ))((log2 Lceilm =
  • 32. 7/17/201032 www.rajeshsharma.co.in Headroom/Guard Bits •Overflow during a addition/subtraction can be handled with headroom • example addition… • for( i=0; i<256; i++ ) sum = sum + ar[i]; There is possibility of 8 bit overflow •Variable sum should have 8 bit headroom in MSBs to accommodate the overflow
  • 34. 7/17/201034 www.rajeshsharma.co.in Fractional Multiplication •The product of two N bit numbers is 2N bits •Example product of MQN and KQL formats •Output format is: (M+K)Q(N+L) format
  • 37. 7/17/201037 www.rajeshsharma.co.in Normalizing Fractions •A product of 2 two’s complement numbers has two sign bits. •result can be left shifted by 1 bit. •If one of the format is 1QN i.e. N+1 Bits, then •Result of left shift will have the format of other operand •If other operand is 5Q11 then result of mult is 6Q(11+N) •After 1 bit left shift, result is 5Q(11+N) •i.e. 5Q11 + (N+1)LSBs •To make most out of available bits, •it is good to left shift the product
  • 38. 7/17/201038 www.rajeshsharma.co.in Rounding • Why Rounding ……… •e.g. replacing 45.6782 with 45.68 •Rounding is necessary evil... (..remember quantization..) •Many times it is unavoidable •It introduces round-off errors •Required in float to fixed conversion and.. •floating & fixed point arithmetic and.. •for function approximation on fixed point processors To replace a Numerical valueTo replace a Numerical valueTo replace a Numerical valueTo replace a Numerical value with approx. equal & shorter i.e. low precision representationwith approx. equal & shorter i.e. low precision representationwith approx. equal & shorter i.e. low precision representationwith approx. equal & shorter i.e. low precision representation
  • 39. 7/17/201039 www.rajeshsharma.co.in Rounding Methods •When rounding a number x to q an integer •Round to nearest : q is integer close to x •Round towards zero (truncate): q is the integer part of x •Round down (floor): q is largest integer that does not exceed x •Round up (ceiling): q is smallest integer that is not less than x •Tie Break during rounding •When fractional part to be rounded is exactly half boundary •e.g. rounding of 23.5 requires tie breaking
  • 40. 7/17/201040 www.rajeshsharma.co.in Tie Breaking •Round half up: i.e. q = x + 0.5 (x is positive or negative ) •It is asymmetric rounding i.e. biased (with round to nearest) •Round half away from zero: •q = x + 0.5 ; x is positive ; q = x – 0.5 ; x is negative •This method is free of bias •Round half to even •q is the even integer nearest to x •This method is more unbiased •+33.5 +34 ; +32.5 +32 ; -32.5 -32 ; -33.5 -34 •Known as: unbiased rounding , convergent rounding, statistician's rounding, Dutch rounding, Gaussian rounding, or bankers' rounding
  • 41. 7/17/201041 www.rajeshsharma.co.in Dithering •Stochastic Rounding: •Choose q randomly between x+0.5 and x-0.5 •It is also bias free because of random component •Dithering: when to use this method •when the signal being rounded / quantized is slowly varying •All rounding methods produce monotonous round-off errors •Moreover monotonous round-off error because of particular rounding methods being used •All of them introduces non-linear response in a system •Harmonics in the filter response because of rounding method
  • 42. 7/17/201042 www.rajeshsharma.co.in Block Floating Point Processing •When Dynamic Range of a signal is very high •When the input is fluctuating between very high and very small values. •When very small values are as important as larger values… •Instruction like CLZ & NORM are helpful. A block of data values share a common radix point for processing
  • 43. 7/17/201043 www.rajeshsharma.co.in Exercise1 •Assuming you are working on a 32 bit DSP… •Convert following table into fixed point…. • const float gain_table[8] ={ • (8.2387952833e-002), • (9.2387952833e-001), • (7.0710676573e-001), • (3.8268340208e-001) • (1.9615705587e+000), • (1.6629392064e+000), • (1.1111404206e+000), • (3.9018056901e+001) •};
  • 44. 7/17/201044 www.rajeshsharma.co.in Exercise2 •Assume that inp[ ] array contains 16 Bit PCM data •Find the o/p format of out[ ] ? •Convert following loop into fixed point…. •for (i=0;i<8;i++){ •out[i] = inp[i] * gain_table[i]; •}
  • 45. 7/17/201045 www.rajeshsharma.co.in Exercise3 •Now given below is the DCTIII code, with gain applied at end… •Convert this into a fixed point code… •
  • 46. 7/17/201046 www.rajeshsharma.co.in Exercise4 •Following IIR filter is used for filtering 16 bit PCM ip •Convert it into fixed point and ensure that the filter response is linear…. Z-1 Z-1 + b0 b1 b2 x[n] x[n-1] y[n] x[n-2] Z-1 Z-1a1 a2 y[n-1] y[n-2] y[n] = b0*x[n] + b1*x[n-1] + b2*x[n-2] - a1*y[n-1] - a2*y[n-2]
  • 49. 7/17/201049 www.rajeshsharma.co.in References •Fixed Point Arithmetic: an Introduction •By Randy Yates •Application Note 33 •Fixed point arithmetic on the ARM •Document number ARM DAI 0033A •Wiki