Upcoming SlideShare
×

# Computation with Fixed Point versus Floating Point

2,750 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
2,750
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
36
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Computation with Fixed Point versus Floating Point

1. 1. http://www.xplace.com/ArchitecTerra info@architecterra.com P.O.B 10124, Petah Tiqwa 49001, Israel Fax: +972-3-9214577 [Computing with Floating Point versus Fixed Point] © 2013, ArchitecTerra Ltd. PROPRIETARY – referencing is mandatory Page 1/4 ARCHITECTERRA 1 INTRODUCTION This paper is intended for brief introduction into the world of fixed and floating point computations. The DSP, and not only, developers are often stay in front of dilemma – which computational model is the best for algorithm implementation. And immediately second question is coming – what does it mean “the best”? The answer is not so trivial – floating point computation might be less efficient for certain processing platform since the number of FLOPS is less than number of MOPS for integer computing. In the same way, the modern GPU or vector processors may provide terrific FLOPS performance … but only with so called half precision, when single precision is at least twice less efficient and double precision is four times less efficient if is supported at all. After the recent statement one may think, why all this efficiency is needed without precision? And again – the answer is not so obvious. What this precision means? Should it be ideal or acceptable? What is the distance between acceptable and ideal? How one may know it without running thousands tests? The presented article provides simple and intuitive tools for answering most of the questions. The efficiency and comparing processors architectures are not subjects of this paper – it is really simple to decide when computation accuracy tradeoffs become clear. 2 FLOATING POINT VS. FIXED POINT The general representation of the floating point value looks like: 1.m x 2Exp (Equation 1) While fixed point value is represented like: I.f (Equation 2) Where:  m – is mantissa  Exp – is exponent  I – is integer part  f – is fractional part It might be said that floating point value has its integer part always “1”, while fixed point “I” field may have any length and value, and floating point mantissa “m” is similar to fixed point fraction “f”. Thought some representation similarity between fixed and floating point there are very significant differences. 2.1 FLOATING POINT FORMAT For demonstration purposes 32-bit single precision format is chosen. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 S Exp m The figure above represents a single precision floating point value format. The fractional point is “located” between “Exp” and “m” fields. “Exp” and “m” are always positive values. The floating point format conversion formula is: (-1)S *(1.m*2(Exp-127) ) (Equation 3)
2. 2. http://www.xplace.com/ArchitecTerra info@architecterra.com P.O.B 10124, Petah Tiqwa 49001, Israel Fax: +972-3-9214577 [Computing with Floating Point versus Fixed Point] © 2013, ArchitecTerra Ltd. PROPRIETARY – referencing is mandatory Page 2/4 ARCHITECTERRA The integer “1” is virtual and since predefined, it isn’t necessary to book for it a dedicated bit. 2.2 FIXED POINT FORMAT For correct comparison 32-bit values are chosen for demonstration. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 S 2’s complement The main difference of fixed point format from the floating point format is the fractional point location – it is undefined for fixed point values. Since the fractional point is virtual, the SW developer may imagine it at any position and conversion formula will look like: I + (f*2-(1+log (f)) ) (Equation 4) Because of missing any standardization in fixed point formats, the fixed point computation is done by agreement which is valid for certain implementation or working team. Texas Instruments introduced the terminology for fixed point formatting, known as Q-format. It is designated as Qm.n, like: 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 S m m m m m m m m m n n n n n n n n n n n n n n n n n n n n n n The figure above represents the format Q9.22. Thus in the representation Qm.n “m” means a number of bits of integer part and “n” means a number of bits of fractional part. The overall bit length of the Qm.n value is (m+n+1) bits where additional bit is dedicated for sign. Finally, it is easily seen, that fixed point value is in fact a regular integer number in which the virtual fractional point location should serve only the computation precision agreement. In DSP computing the simplified format of fixed point values is used as such an agreement. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 S n n n n n n n n n n n n n n n n/a S n n n n n n n n n n n n n n n n n n n n n n n n/a S n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n This simplified Qn format doesn’t have an integer part – only fractional. The fractional point is located between the sign bit and following the most significant bit. In the figure above three mostly usable formats are represented respectively: Q15, Q23 and Q31. Important to pay attention, that any Qn representation keeps the fractional point at the fixed location – it serves the computation compatibility. 3 THEORY OF OPERATION Operating with fixed point values and floating point values – two aspects should be understood: precision and dynamic range. In the dynamic range competition a clear and undoubted advantage belongs to floating point format versus fixed point format. Comparing 32-bit formats it is easy to see the difference:
3. 3. http://www.xplace.com/ArchitecTerra info@architecterra.com P.O.B 10124, Petah Tiqwa 49001, Israel Fax: +972-3-9214577 [Computing with Floating Point versus Fixed Point] © 2013, ArchitecTerra Ltd. PROPRIETARY – referencing is mandatory Page 3/4 ARCHITECTERRA Floating point dynamic range is: -2-126 ≤ V < 2128 approximately for normalized values Q31 fixed point dynamic range is: -1+2-31 < V < 1-2-31 But for precision the situation is changing. Indeed, the finest precision of the floating point number is defined by the LSb of mantissa, i.e. 2-23 x2Exp-127 , while for the fixed point values this precision is equal to 2-31 . How to compare these numbers? Does it mean that fixed point has an advantage? Yes, but in certain circumstances. May be floating point anyway is more applicable? And there is the same answer. Finally it depends on nature of operations which should be done for the algorithm implementation. The right answer is that the fixed point values provide narrower, but more precise computation window within comparable dynamic range, which is within or even aside dynamic range of the floating point values. It’s not so intuitive, but next equation explains everything: (-231 +1)*2N < (V*231 ) *2N < (231 -1) *2N (Equation 5) For fixed point computation it is only important, that all the values in a set or group mutually should not exceed fixed point dynamic range, otherwise the lowest will be interpreted as zeroes. The power factor “N” is called a scaling factor by which all the computed values are brought to the chosen format, currently Q31. Finally it might be treated as some kind of virtual floating point. Due to very wide dynamic range, the floating point format is ideal for multiplying very big and very little values without losing the information. Alternatively, if fixed point multiplicands values distance exceed 231 , than the lower multiplicand will be simply interpreted as 0. The fixed point format may provide more precision in addition and subtraction. The last statement should be explained deeper. Let’s try to add 1 and 1-24 . Both values are easily represented in floating point like: 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 But after add operation between these two values the smaller value will be lost – because 23-bits mantissa doesn’t have any option to acquire 24th bit. The same example works fine for fixed point. At first of all both values are scaled to fit Q31 agreement by scaling factor 230 . 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 After conversion, an add operation between these two 32-bit integers gives a result with ideal precision. The scaling factor 230 means fractional point location between bits 30 and 29. 4 COMPUTATION ERROR ESTIMATION Now, when precision differences are explained, it is easy to estimate computational error. Generally, the error is not exceeding half of least significand either of floating point mantissa or fixed point representation. So:  For 32-bit floating point format the mistake may not be greater than half of 2-23 x2Exp-127 or 2-24 x2Exp-127 where “Exp” is the exponent of the operation result.
4. 4. http://www.xplace.com/ArchitecTerra info@architecterra.com P.O.B 10124, Petah Tiqwa 49001, Israel Fax: +972-3-9214577 [Computing with Floating Point versus Fixed Point] © 2013, ArchitecTerra Ltd. PROPRIETARY – referencing is mandatory Page 4/4 ARCHITECTERRA  For Q31 fixed point format the mistake will not be greater than half of 2-31 or 2-32 per operation. There are several important rules, which should be considered every time the format choice is applied for computing: 1. Addition and subtraction operations should be avoided when one of the operands is outside the dynamic range of another. If it happens, the computation error is equal to value of lower operand. 2. The floating point format is not able to represent fractional values starting from 224 value for 31-bit format. For instance, half precision floating point has similar limitation already at 211 . 3. The floating point operations may be insufficiently precise within given or chosen dynamic range, but these operations are standard, while fixed point computation requires special HW or SW support. The fixed point arithmetic rules are different from integers arithmetic though are based on them. ArchitecTerra Ltd. is your address for DSP/RISC platforms SW/FW efficiency improving, implementation and platform-dedicated optimization, including low-level Assembly techniques. ArchitecTerra provides consulting and training of efficient working with SIMD and MIMD architectures, introduction into DSP and RISC processors architecture. Contact information: info@architecterra.com