Successfully reported this slideshow.
Upcoming SlideShare
×

# (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

7,214 views

Published on

Naoki Shibata : Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation, Journal of Computer Science on Research and Development, Proceedings of the International Supercomputing Conference ISC10., Volume 25, Numbers 1-2, pp. 25-32, 2010, DOI: 10.1007/s00450-010-0108-2 (May. 2010).

http://freshmeat.net/projects/sleef

Data-parallel architectures like SIMD (Single Instruction Multiple Data) or SIMT (Single Instruction Multiple Thread) have been adopted in many recent CPU and GPU architectures. Although some SIMD and SIMT instruction sets include double-precision arithmetic and bitwise operations, there are no instructions dedicated to evaluating elementary functions like trigonometric functions in double precision. Thus, these functions have to be evaluated one by one using an FPU or using a software library. However, traditional algorithms for evaluating these elementary functions involve heavy use of conditional branches and/or table look-ups, which are not suitable for SIMD computation. In this paper, efficient methods are proposed for evaluating the sine, cosine, arc tangent, exponential and logarithmic functions in double precision without table look-ups, scattering from, or gathering into SIMD registers, or conditional branches. We implemented these methods using the Intel SSE2 instruction set to evaluate their accuracy and speed. The results showed that the average error was less than 0.67 ulp, and the maximum error was 6 ulps. The computation speed was faster than the FPUs on Intel Core 2 and Core i7 processors.

Published in: Technology
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### (Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation

1. 1. Efficient Evaluation Methods of Elementary Functions Suitable for SIMD Computation Naoki Shibata Shiga University
2. 2. Overview <ul><li>Methods for evaluating following functions using SIMD instructions in double precision </li></ul><ul><ul><li>sin, cos, tan </li></ul></ul><ul><ul><li>log, exp </li></ul></ul><ul><ul><li>asin, acos, atan </li></ul></ul><ul><li>Fast </li></ul><ul><ul><li>Two times as fast as FPU evaluation </li></ul></ul><ul><li>Accurate </li></ul><ul><ul><li>Maximum error is within 6 ulps from the true value </li></ul></ul>
3. 3. Overview (contd.) <ul><li>Advantages against existing methods </li></ul><ul><ul><li>No conditional branches </li></ul></ul><ul><ul><li>No gathering/scattering </li></ul></ul><ul><ul><li>No table look-ups </li></ul></ul>x y SIMD operation SIMD operation SIMD operation SIMD operation sin x sin y Apply exactly same series of SIMD operations to x and y , and we get sin x and sin y .
4. 4. Outline <ul><li>Overview </li></ul><ul><li>Background </li></ul><ul><ul><li>Cares needed for SIMD optimization </li></ul></ul><ul><ul><li>Related works </li></ul></ul><ul><li>Proposed method </li></ul><ul><ul><li>Trigonometric functions </li></ul></ul><ul><ul><li>Inverse trigonometric functions </li></ul></ul><ul><ul><li>Exponential function </li></ul></ul><ul><ul><li>Logarithmic function </li></ul></ul><ul><li>Evaluation </li></ul><ul><ul><li>Accuracy </li></ul></ul><ul><ul><li>Speed </li></ul></ul><ul><ul><li>Code size </li></ul></ul><ul><li>Conclusion </li></ul>
5. 5. Background <ul><li>SIMD instructions are now pervasive </li></ul><ul><ul><li>SSE in x86 processors </li></ul></ul><ul><ul><li>Altivec in Power / PowerPC processors </li></ul></ul><ul><ul><li>NEON in ARM processors </li></ul></ul><ul><ul><li>Cell Broadband Engine </li></ul></ul><ul><ul><li>Many GPU models </li></ul></ul><ul><li>Length of SIMD registers is going to be extended </li></ul><ul><ul><li>256 bits in Sandy Bridge </li></ul></ul><ul><ul><li>512 bits in Larrabee (or Knights Ferry?) GPUs </li></ul></ul>
6. 6. Background (contd.) <ul><li>SIMD inst. set does not include instructions for evaluating elmentary functions. </li></ul><ul><ul><li>sin, cos, tan, log, exp, asin, acos, atan </li></ul></ul><ul><li>Two possibilities </li></ul><ul><ul><li>FPU </li></ul></ul><ul><ul><ul><li>Elementary functions are available in limited architectures. </li></ul></ul></ul><ul><ul><li>Software library </li></ul></ul><ul><ul><ul><li>Many are not optimized for SIMD calculation </li></ul></ul></ul>
7. 7. Cares needed in SIMD optimization <ul><li>We need special cares for SIMD optimization </li></ul><ul><ul><li>Memory access and conditional branches are slow </li></ul></ul><ul><ul><ul><li>with modern processor models with long pipeline </li></ul></ul></ul><ul><ul><li>Table look-up is slow </li></ul></ul><ul><ul><li>Gathering and scattering operations are slow </li></ul></ul><ul><li>Some traditionally slow operations are not slow anymore </li></ul><ul><ul><li>Division and Square root can now be evaluated using one instruction each </li></ul></ul><ul><ul><li>No extra register needed </li></ul></ul><ul><li>We need new, specialized algorithms to make efficient uses of SIMD unit </li></ul>
8. 8. Gathering and scattering operations are slow <ul><li>We need to compose/decompose each element of vector register </li></ul><ul><ul><li>At least one instruction is needed for each element </li></ul></ul><ul><ul><li>SIMD ALU may be idle during this operation </li></ul></ul><ul><li>Register spills may happen </li></ul><ul><ul><li>This requires extra execution of instructions </li></ul></ul><ul><ul><li>This may cause memory access </li></ul></ul>
9. 9. Table look-up is slow <ul><li>Table look-up is frequently used in traditional implementation for evaluating elementary functions </li></ul><ul><ul><li>Both in HW and SW </li></ul></ul>x y x y Scatter Look-up Look-up x' y' x y Gather Table This part must be repeated for each element
10. 10. Division and SQRT are not too slow <ul><li>69 clocks of latency and throughput for 1 execution of SQRTPD instruction[*] </li></ul><ul><li>69 clocks of latency and throughput for 1 execution of DIVPD instruction[*] </li></ul><ul><li>Memory access latency is 165 clocks </li></ul><ul><ul><li>(C2Q 9300, measured by CPU-Z) </li></ul></ul><ul><li>The good point is that they do not require extra instruction execution and registers </li></ul>[*] Intel 64 and IA-32 Architectures Optimization Reference Manual
11. 11. Challenge <ul><li>Usually, evaluation of elementary functions is performed using calculation with extra precision </li></ul><ul><ul><li>e.g. x86 FPU does evaluation using 80bit calculation, and round the result into 64bit. </li></ul></ul><ul><li>With extra-precision calculation, obtaining accurate output is not very hard. </li></ul><ul><li>But in the proposed method, 64bit-precision calculation is only available </li></ul><ul><li>Thus, error in each step of evaluation leads to the error in the output </li></ul>So, we cannot tolerate error in each step
12. 12. Related Works <ul><li>GNU C Library and FPU emulator in Linux OS utilize conditional branches </li></ul><ul><ul><li>Thus, not suitable for SIMD computation </li></ul></ul><ul><li>There are many researches on evaluating elementary functions on hardware </li></ul><ul><ul><li>Many of them utilize table look-ups </li></ul></ul><ul><ul><li>Not suitable for SIMD computation </li></ul></ul><ul><li>There are several multiple-precision floating-point computation libraries </li></ul><ul><ul><li>The design is very different from our implementation </li></ul></ul>
13. 13. Outline <ul><li>Overview </li></ul><ul><li>Background </li></ul><ul><ul><li>Cares needed for SIMD optimization </li></ul></ul><ul><ul><li>Related works </li></ul></ul><ul><li>Proposed method </li></ul><ul><ul><li>Trigonometric functions </li></ul></ul><ul><ul><li>Inverse trigonometric functions </li></ul></ul><ul><ul><li>Exponential function </li></ul></ul><ul><ul><li>Logarithmic function </li></ul></ul><ul><li>Evaluation </li></ul><ul><ul><li>Accuracy </li></ul></ul><ul><ul><li>Speed </li></ul></ul><ul><ul><li>Code size </li></ul></ul><ul><li>Conclusion </li></ul>
14. 14. Trigonometric functions <ul><li>Finds sin x and cos x at the same time </li></ul><ul><li>Consists of two steps </li></ul><ul><ul><li>Step 1 : Argument is reduced within 0 and  /4 </li></ul></ul><ul><ul><ul><li>utilizing the symmetry and periodicity </li></ul></ul></ul><ul><ul><li>Step 2 : Evaluate sin and cos functions on the reduced argument </li></ul></ul><ul><li>tan x can be evaluated by simply calculating (sin x / cos x ) </li></ul>
15. 15. <ul><li>Given argument d , find s and an integer q so that 0 <= s <  /4 </li></ul><ul><li>Is it as easy as just dividing d with  /4? – No! </li></ul><ul><li>Finding q is easy, but s may be inaccurate because of cancellation error </li></ul><ul><ul><li>when d is a large number close to a multiple of  /4 </li></ul></ul><ul><li>Some implementations like FPU on x86 processors exhibits this problem </li></ul>Reducing argument within 0 and  /4
16. 16. Cancellation error <ul><li>Cancellation error happens when we subtract one number from another where </li></ul><ul><ul><li>Two numbers are very close </li></ul></ul><ul><ul><li>Both numbers have limited precision, and already rounded </li></ul></ul><ul><li>Suppose that we are calculating 1 – x , with 5 digits of precision. </li></ul><ul><ul><li>True value of x is 0.999985 which is rounded to 0.99999 </li></ul></ul><ul><ul><li>True value : 1 – 0.999985 = 0.000015 </li></ul></ul><ul><ul><li>Rounded : 1 – 0.99999 = 0.00001 </li></ul></ul><ul><ul><li>In this case, the results has only 1 digit of accuracy. Loss of accuracy is caused by cancellation </li></ul></ul>
17. 17. Solution for this problem <ul><li>Basic idea : calculate this part with extra precision utilizing properties of IEEE 754 </li></ul><ul><li>q is assumed to be expressed in 26 bits, which is a half of mantissa </li></ul><ul><li> /4 is split into three parts </li></ul><ul><ul><li>so that all of the bits in the lower half of the mantissa are zero, </li></ul></ul>
18. 18. Solution for this problem (contd.) <ul><li>Multiplying these numbers with q does not produce an error </li></ul><ul><ul><li>because, lower half of these numbers are zero </li></ul></ul><ul><li>Subtracting the resulting numbers in descending order from d does not produce cancellation error </li></ul><ul><ul><li>Results of each step are all accurate </li></ul></ul><ul><li>In this way, we can obtain s accurately. </li></ul>
19. 19. <ul><li>Simply summing up the terms of the Taylor series does not produce accurate result </li></ul><ul><ul><li>Because of accumulation of errors </li></ul></ul><ul><li>Using another kind of reduction is common </li></ul><ul><ul><li>Utilizing triple angle formula of sine </li></ul></ul><ul><ul><li>e.g. First divide x by 3, evaluate Taylor series, and then apply triple angle formula </li></ul></ul><ul><ul><li>This speeds up calculation by reducing x </li></ul></ul>Step 2 in evaluating trigonometric functions
20. 20. <ul><li>This method is not good for our case. </li></ul><ul><ul><li>When sin x is small, difference between two terms are so large and it produces rounding error </li></ul></ul><ul><li>So, we use double angle formula of cosine, instead </li></ul><ul><li>sin x can be found by the following formula </li></ul>Step 2 (contd.)
21. 21. Step 2 (contd.) <ul><li>But, we have another problem </li></ul><ul><ul><li>When x is close to 0 , cos 2 x gets close to 1 , and we have cancellation error </li></ul></ul><ul><li>So, we use double angle formula of ( 1 – cos x ) </li></ul><ul><li>Let f ( x ) be ( 1 - cos x ), and we get </li></ul><ul><li>And, these do not produce cancellation or rounding errors </li></ul>
22. 22. Inverse trigonometric functions <ul><li>We are going to find atan x , and obtain asin x and acos x from atan x </li></ul><ul><li>Again, just evaluating terms of Taylor series produces accumulation of errors </li></ul><ul><li>There seems no good way of argument reduction is for inverse trigonometric functions </li></ul><ul><ul><li>We propose a new way of argument reduction for atan </li></ul></ul>
23. 23. New argument reduction for evaluating arc tangent (1/3) <ul><li>Suppose x = atan d , thus tan x = d </li></ul><ul><li>If d <= 1 , we evaluate atan function on the argument of 1 / d instead of d </li></ul><ul><li>Then subtract the result from  /2 </li></ul><ul><li>Now, we can assume that d > 1 and  /4 < x <  /2 </li></ul>
24. 24. New argument reduction for evaluating arc tangent (2/3) <ul><li>We use following formulas </li></ul><ul><li>We use (13) to calculate cotangent of argument d </li></ul><ul><li>Then we repeatedly use (14) to find cotangent of ( d / 2 n ) </li></ul><ul><ul><li>which is actually &quot;enlarging&quot; cot d </li></ul></ul><ul><li>By (13), enlarging cot d corresponds to reducing tan d </li></ul>(13) (14)
25. 25. New argument reduction for evaluating arc tangent (3/3) <ul><li>Suppose that atan 2 =  , tan  = 2 </li></ul><ul><li>Apply (13) to get </li></ul><ul><li>Apply (14) to get </li></ul><ul><li>Apply (13) to get </li></ul><ul><li>= 0.618 is less than 2, thus we reduced the argument </li></ul>(13) (14) After argument reduction, we calculate Taylor series of atan
26. 26. asin and acos functions <ul><li>We have problem if we simply use the following functions </li></ul><ul><ul><li>These produce cancellation errors when | x | is close to 1 </li></ul></ul><ul><li>This problem can be avoided by small modifications of the formulas </li></ul>
27. 27. Exponential function <ul><li>Consists of two steps </li></ul><ul><ul><li>Similar to trigonometric functions </li></ul></ul><ul><li>Step 1 : Argument is reduced within 0 and log e 2 </li></ul><ul><li>Step 2 : Further reduce the argument, and evaluate Taylor series </li></ul>
28. 28. Step 1 <ul><li>We find s and an integer q so that 0 < s <= log e 2 </li></ul><ul><li>This step is very similar to the first step for trigonometric functions </li></ul><ul><li>The same problem arises, and we solve the problem in the same way </li></ul>
29. 29. Step 2 <ul><li>We can use the following formula to further reduce argument </li></ul><ul><li>And the evaluate Taylor series </li></ul><ul><li>But, if x is close to 0, we will have rounding errors </li></ul><ul><ul><li>Since difference between 1 and other terms are large </li></ul></ul><ul><li>So, we find (exp x ) – 1, instead of exp x </li></ul><ul><li>And the remaining part is similar to sin and cos </li></ul>
30. 30. Logarithmic function <ul><li>We use the following series, rather than Taylor series </li></ul><ul><li>This series converges faster than Taylor series </li></ul><ul><li>Just evaluating this series is enough. </li></ul><ul><li>This series is well known. </li></ul>
31. 31. Outline <ul><li>Overview </li></ul><ul><li>Background </li></ul><ul><ul><li>Cares needed for SIMD optimization </li></ul></ul><ul><ul><li>Related works </li></ul></ul><ul><li>Proposed method </li></ul><ul><ul><li>Trigonometric functions </li></ul></ul><ul><ul><li>Inverse trigonometric functions </li></ul></ul><ul><ul><li>Exponential function </li></ul></ul><ul><ul><li>Logarithmic function </li></ul></ul><ul><li>Evaluation </li></ul><ul><ul><li>Accuracy </li></ul></ul><ul><ul><li>Speed </li></ul></ul><ul><ul><li>Code size </li></ul></ul><ul><li>Conclusion </li></ul>
32. 32. Evaluation <ul><li>Accuracy </li></ul><ul><ul><li>We compared the output by proposed method and output by MPFR library </li></ul></ul><ul><ul><li>We measured the evaluation accuracy within a few ranges for each function </li></ul></ul><ul><li>Speed </li></ul><ul><ul><li>We compared the speed of the proposed methods, an FPU and MPFR Library </li></ul></ul><ul><ul><li>We used Core i7 920 (2.66GHz), Core 2 Duo E7200 (2.53GHz) </li></ul></ul>
33. 33. Accuracy Accuracy results for sin, cos and tan Accuracy results for atan Accuracy results for asin and acos
34. 34. Accuracy (contd.) Accuracy results for exp Accuracy results for log In all cases, error does not exceed 6 ulps
35. 35. Speed Proposed method is about two times as fast as FPU calculation
36. 36. Code size <ul><li>The total code size is very small </li></ul><ul><li>Suitable for Cell B.E. </li></ul><ul><ul><li>which has only 256K bytes of directly accessible scratch pad memory in each SPE. </li></ul></ul>
37. 37. Conclusion <ul><li>We proposed efficient methods for evaluating elementary functions in double precision </li></ul><ul><li>It does not include table lookups, scattering from, or gathering into SIMD registers, or conditional branches. </li></ul><ul><li>The average and maximum errors were less than 0.67 ulp and 6 ulps, respectively. </li></ul><ul><li>The evaluation speed was faster than the FPUs on Intel Core 2 and Core i7 processors. </li></ul>
38. 38. Thank you <ul><li>An implementation of the proposed method is now available as public domain software. </li></ul><ul><li>Contact </li></ul>http://freshmeat.net/projects/sleef