Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Competing For Analytics by mdideepak 593 views
- Article Samurai Q&A [WEBINAR] by Noble Samurai Pty... 1833 views
- Spectrum Consulting Social Media Wo... by Raghunath Ramaswamy 6846 views
- 10 Common Mistakes to Avoid in Drup... by Root Info Solutions 456 views
- Visual Resume, Emmanuel Lemoine by Emmanuel Lemoine 687 views
- Apple Gets Allies in Court, FIFA Vo... by LinkedIn Editors'... 66888 views

2,506 views

Published on

to semidefinite programming

Published in:
Technology

No Downloads

Total views

2,506

On SlideShare

0

From Embeds

0

Number of Embeds

5

Shares

0

Downloads

15

Comments

0

Likes

3

No embeds

No notes for slide

- 1. . A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semideﬁnite programming. Nakata Maho∗† (maho@riken.jp∗ ), Yasuyoshi Takao†† , Noda Shigeho† , Himeno Ryutaro† RIKEN, Advanced Center for Computing and Communication† , JFE Tech†† International Conference on Networking and Computing 2012/12/5 @ Okinawa 14:45-15:15 Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 2. OverviewIntroduction of this research in a slide.Importance of high precision arithmetic.The double-double precision: a cheap and easy solution forquadruple precision and its details.Matrix-matrix multiplication (Rgemm) in MPACK (highprecision version of BLAS and LAPACK).Implementation of a fast Rgemm on C2050 GPU : 150 timesfaster than CPU.Application: acceleration of semideﬁnite programming solver“SDPA-DD” : 10 times faster than CPU.Summary. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 3. Introduction of this research in a slide.Matrix-matrix multiplication double-double precision NVIDIA C2050, GPU GPU=CPUx150, Peak performance: 26GFLOPS 25 20 GFLOPS 15 10 QuadMul−Sloppy, QuadAdd−Cray Kernel QuadMul−Sloppy, QuadAdd−Cray Total QuadMul−FMA, QuadAdd−Cray Kernel QuadMul−FMA, QuadAdd−Cray Total 5 QuadMul−Sloppy, QuadAdd−IEEE Kernel QuadMul−Sloppy, QuadAdd−IEEE Total QuadMul−FMA, QuadAdd−IEEE Kernel QuadMul−FMA, QuadAdd−IEEE Total 0 0 1000 2000 3000 4000 5000 6000 § Dimension ¤ + Application : Semideﬁnite Programming GPU=CPUx10 ¦ ¥ Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 4. Introduction of this research in a slide.Matrix-matrix multiplication double-double precision NVIDIA C2050, GPU GPU=CPUx150, Peak performance: 26GFLOPS 25 20 GFLOPS 15 10 QuadMul−Sloppy, QuadAdd−Cray Kernel QuadMul−Sloppy, QuadAdd−Cray Total QuadMul−FMA, QuadAdd−Cray Kernel QuadMul−FMA, QuadAdd−Cray Total 5 QuadMul−Sloppy, QuadAdd−IEEE Kernel QuadMul−Sloppy, QuadAdd−IEEE Total QuadMul−FMA, QuadAdd−IEEE Kernel QuadMul−FMA, QuadAdd−IEEE Total 0 0 1000 2000 3000 4000 5000 6000 § Dimension ¤ + Application : Semideﬁnite Programming GPU=CPUx10 ¦ ¥ Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 5. More accuracy is needed towards PETA and EXA scale computing The EXA scale computing : 1023 FLOP!!! for just one week calculation. Scientiﬁc computing may suffer from the accuracy. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 6. More accuracy is needed towards PETA and EXA scale computing The EXA scale computing : 1023 FLOP!!! for just one week calculation. Scientiﬁc computing may suffer from the accuracy. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 7. More accuracy is needed towards PETA and EXA scale computing The EXA scale computing : 1023 FLOP!!! for just one week calculation. Scientiﬁc computing may suffer from the accuracy. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 8. More accuracy is needed towards PETA and EXA scale computing Iterative methods in double precision calculation sometimes do not even converge. [Hasegawa 2007] Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 9. More accuracy is needed towards PETA and EXA scale computing Iterative methods in double precision calculation sometimes do not even converge. [Hasegawa 2007] Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 10. More accuracy is needed towards PETA and EXA scale computing Semideﬁnite programming (SDP): condition number diverges at the optimum. Therefore, one may be very hard to obtain an accurate solution [Nakata et al 2008], [Nakata 2009], [Waki-Nakata-Muramatsu] The 1-norm and the estimated 1-norm condition number of shur complement matrix 1e+20 1-cond 1-norm 1e+15 1e+10 100000 1 1e-05 1e-10 0 10 20 30 40 50 60 70 80 90 Nakata Maho # of iter. implementation of A fast matrix-matrix product in double-double preci
- 11. More accuracy is needed towards PETA and EXA scale computing Semideﬁnite programming (SDP): condition number diverges at the optimum. Therefore, one may be very hard to obtain an accurate solution [Nakata et al 2008], [Nakata 2009], [Waki-Nakata-Muramatsu] The 1-norm and the estimated 1-norm condition number of shur complement matrix 1e+20 1-cond 1-norm 1e+15 1e+10 100000 1 1e-05 1e-10 0 10 20 30 40 50 60 70 80 90 Nakata Maho # of iter. implementation of A fast matrix-matrix product in double-double preci
- 12. More accuracy is needed towards PETA and EXA scale computing Semideﬁnite programming (SDP): condition number diverges at the optimum. Therefore, one may be very hard to obtain an accurate solution [Nakata et al 2008], [Nakata 2009], [Waki-Nakata-Muramatsu] The 1-norm and the estimated 1-norm condition number of shur complement matrix 1e+20 1-cond 1-norm 1e+15 1e+10 100000 1 1e-05 1e-10 0 10 20 30 40 50 60 70 80 90 Nakata Maho # of iter. implementation of A fast matrix-matrix product in double-double preci
- 13. More accuracy is needed towards PETA and EXA scale computing Semideﬁnite programming (SDP): condition number diverges at the optimum. Therefore, one may be very hard to obtain an accurate solution [Nakata et al 2008], [Nakata 2009], [Waki-Nakata-Muramatsu] The 1-norm and the estimated 1-norm condition number of shur complement matrix 1e+20 1-cond 1-norm 1e+15 1e+10 100000 1 1e-05 1e-10 0 10 20 30 40 50 60 70 80 90 Nakata Maho # of iter. implementation of A fast matrix-matrix product in double-double preci
- 14. Acceleration high precision operation on GPU is a good idea Double-double precision is a cheap and fast solution for high precision accurate enough for many purposes : almost as accurate as quadruple precision. fast: operations are done only by 8 ∼ 24 double precision operations. operation intensive: requires memory bandwidth than FLOPS. Implementing on GPU is a good idea fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼ 200GFLOPS. cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000. do not require complex operations: suitable for GPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 15. Acceleration high precision operation on GPU is a good idea Double-double precision is a cheap and fast solution for high precision accurate enough for many purposes : almost as accurate as quadruple precision. fast: operations are done only by 8 ∼ 24 double precision operations. operation intensive: requires memory bandwidth than FLOPS. Implementing on GPU is a good idea fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼ 200GFLOPS. cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000. do not require complex operations: suitable for GPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 16. Acceleration high precision operation on GPU is a good idea Double-double precision is a cheap and fast solution for high precision accurate enough for many purposes : almost as accurate as quadruple precision. fast: operations are done only by 8 ∼ 24 double precision operations. operation intensive: requires memory bandwidth than FLOPS. Implementing on GPU is a good idea fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼ 200GFLOPS. cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000. do not require complex operations: suitable for GPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 17. Acceleration high precision operation on GPU is a good idea Double-double precision is a cheap and fast solution for high precision accurate enough for many purposes : almost as accurate as quadruple precision. fast: operations are done only by 8 ∼ 24 double precision operations. operation intensive: requires memory bandwidth than FLOPS. Implementing on GPU is a good idea fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼ 200GFLOPS. cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000. do not require complex operations: suitable for GPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 18. Acceleration high precision operation on GPU is a good idea Double-double precision is a cheap and fast solution for high precision accurate enough for many purposes : almost as accurate as quadruple precision. fast: operations are done only by 8 ∼ 24 double precision operations. operation intensive: requires memory bandwidth than FLOPS. Implementing on GPU is a good idea fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼ 200GFLOPS. cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000. do not require complex operations: suitable for GPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 19. Acceleration high precision operation on GPU is a good idea Double-double precision is a cheap and fast solution for high precision accurate enough for many purposes : almost as accurate as quadruple precision. fast: operations are done only by 8 ∼ 24 double precision operations. operation intensive: requires memory bandwidth than FLOPS. Implementing on GPU is a good idea fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼ 200GFLOPS. cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000. do not require complex operations: suitable for GPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 20. Acceleration high precision operation on GPU is a good idea Double-double precision is a cheap and fast solution for high precision accurate enough for many purposes : almost as accurate as quadruple precision. fast: operations are done only by 8 ∼ 24 double precision operations. operation intensive: requires memory bandwidth than FLOPS. Implementing on GPU is a good idea fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼ 200GFLOPS. cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000. do not require complex operations: suitable for GPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 21. Acceleration high precision operation on GPU is a good idea Double-double precision is a cheap and fast solution for high precision accurate enough for many purposes : almost as accurate as quadruple precision. fast: operations are done only by 8 ∼ 24 double precision operations. operation intensive: requires memory bandwidth than FLOPS. Implementing on GPU is a good idea fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∼ 200GFLOPS. cheap: NVIDIA C2050 $2000, Workstation : $5000 ∼ $10000. do not require complex operations: suitable for GPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 22. The double-double precision: handy and easy quadruple precision “754-2008 IEEE Standard for Floating-Point Arithmetic” The binary64 (aka double precision) format has 16 decimal signiﬁcant digits Widely used and very fast. Core i7 920: ∼40GFLOPS; RADEON HD7970 ∼1000GFLOPS, K computer: ∼ over 10PFLOPS) § ¤ Rounding error may occur for every arithmetic operation. ¦ ¥ Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 23. The double-double precision: handy and easy quadruple precisionThe double-double precision number a is expressed by two doubleprecision numbers a hi , a lo. a = (a hi , a lo). Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 24. The double-double precision: handy and easy quadruple precision § ¤ Knuth’s Theorem ¥ ¦Error-free transformation of two ﬂoating point numbers a, b, a + b = (a ⊕ b) + e where ⊕ is addition including rounding errors, + is addition, e isﬂoating point number § ¤ We can evaluate rounding error exactly for addition! ¦ ¥ Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 25. The double-double precision: handy and easy quadruple precision § ¤ Dekker’s Theorem ¥ ¦Error-free transformation of two ﬂoating point numbers a, b, a × b = (a ⊗ b) + e⊗ is multiplication operator with rounding errors, × is multiplicationoperator, e is ﬂoating point number. § ¤ We can evaluate rounding error exactly for multiplication! ¦ ¥ Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 26. The double-double precision: handy and easy quadruple precisionBased on Knuth’s Theorem, we can deﬁne “Quick-Two-Sum (a, b)”where a, b are ﬂoating point numbers, and ⊕, are operatorsincluding rounding errors. and when and when |a| ≥ |b|, we cancalculate exactly s = (a ⊕ b), e = a + b − (a ⊕ b) in threeoperations. 1 ( Quick-Two-Sum (a, b): 1. s← a⊕b . e ← b (s a) 2 3. return(s, e) 0 ) § ¤ (s, e) = Quick-Two-Sum (a, b) ¥ ¦ Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 27. The double-double precision: handy and easy quadruple precisionBased on Knuth’s Theorem, we can deﬁne “Quick-Two-Sum (a, b)”where a, b are ﬂoating point numbers, and ⊕, are operatorsincluding rounding errors. and we can calculate exactlys = (a ⊕ b), e = a + b − (a ⊕ b) in six operations. 9 6 Two-Sum (a, b): 1. s← a⊕b . v←s a 2 3. e ← (a (s v)) ⊕ (b v) 4. return(s, e) 8 7 § ¤ (s, e) = Two-Sum (a, b) ¥ ¦ Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 28. The double-double precision: handy and easy quadruple precisionBasics:Dekker’s TheoremThere exists an algorithm which calculate s = (a ⊗ b) ande = a × b − (a ⊗ b), where ⊗ is multiplication operator withrounding errors, using following “Split(a)” in four operations and“Two-Prod(a,b)” in 17 operations. 9 69 6 Two-prod (a, b): Split (a): . p← a⊗b 1 1. t ← (227 + 1) ⊗ a . (a , a ) ← Split(a) 2 hi lo . a hi ← t (t a) 2 . (b hi , b lo) ← Split(b) 33. a lo ← a a hi . e ← ((a hi ⊗ b hi p) ⊕ a hi ⊗ 44. return(a hi , a lo) b lo ⊕ a lo ⊗ b hi ) ⊕ a lo ⊗ b lo8 7 . return( p, e) 5 8 7 § ¤ (s, e) =Two-Prod(a, b) ¥ ¦ Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 29. The double-double precision: handy and easy quadruple precisionAddition in double-double operation can be done in 20FLOPS byfollowing “QuadAdd-IEEE”9 6 QuadAdd-IEEE (a, b): 1. (s hi , e hi ) = Two-Sum(a hi , b hi ) 2. (s lo, e lo) = Two-Sum(a lo, b lo) 3. e hi = e hi ⊕ s lo 4. (s lo, e lo) = Quick-Two-Sum(s hi , e hi ) 5. e hi = e hi ⊕ s lo . (s hi , e lo) = Quick-Two-Sum(s hi , e hi ) 67. return(c)8 7 Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 30. The double-double precision: handy and easy quadruple precisionMultiplication in double-double operation can be done in 24FLOPSby following “QuadMul”.9 6 QuadMul (a, b): 1. ( phi , plo) = Two-Prod(a hi , b hi ) 2. plo = plo ⊕ (a hi ⊗ b lo ⊕ a lo ⊗ b hi ) 3. (c hi , c lo) = Quick-Two-Sum(phi , plo)4. return(c)8 7 Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 31. The double-double precision: handy and easy quadruple precisionThe FMA (fused multiply-add) operation calculates a×b+cin one command. Doing a × b + c exactly, then round todouble-precision. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 32. The double-double precision: handy and easy quadruple precisionFaster: using FMA instruction Two-Prod becomes 3 operations (17op. w/o FMA), and QuadMul(-FMA) can be done in only 10operations (24 ops w/o FMA)1 ( Two-prod-FMA (a, b): 1. p← a⊗b . e ← FMA(a × b − p) 23. return(p, e)0 ) Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 33. The double-double precision: handy and easy quadruple precisionFaster: lower accuracy operations 9 69 6 QuadMul-Sloppy (a, b): QuadAdd-Cray (a, b): 1. p = (a hi ⊗ b lo) 1. (c hi , c lo) = 2. q = (a lo ⊗ b hi ) Two-Sum(a hi , b hi ) . t = p⊕ q 3 2. c lo = c lo ⊕ (a lo ⊕ b lo) 4. c hi = FMA(a hi × b hi + t) 3. (c hi , c lo) = 5. e = FMA(a hi × b hi − c hi ) Quick-Two-Sum(c hi , c lo) 6. c lo = e ⊕ t 4. return(c)8 7 7. return(c) 8 7 Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 34. The double-double precision: handy and easy quadruple precisionSummary: Operations count in each double-double arithmetic Algorithm # of operations Quick-Two-Sum 3 Two-Sum 6 Split 4 Two-Prod 17 Two-Prod-FMA 3∗ QuadAdd-IEEE 20 QuadAdd-Cray 11 QuadMul 24 QuadMul-FMA 10∗ QuadMul-FMA-Sloppy 8∗∗ 2FLOPScount for FMA.We used QuadAdd-IEEE and QuadMul-FMA when not explicitlystated Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 35. The double-double precision: handy and easy quadruple precisionQD libraryFeatures: Class of C++.The double-double precision: “dd real”.Free software. Author: Yozo Hida, Xiaoye S. Li, David H. BaileyDownload:http://crd.lbl.gov/˜dhbailey/mpdist/Paper:http://crd.lbl.gov/˜dhbailey/dhbpapers/arith15.pdfYozo Hida, Xiaoye S. Li, David H. Bailey, “Quad-Double Arithmetic:Algorithms, Implementation, and Application”, Technical ReportLBNL-46996, Lawrence Berkeley National Laboratory, 2000. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 36. Implementation on GPU and performance evaluationWe accelerated matrix-matrix multiplication routine called“Rgemm”. Prototype deﬁnition of Rgemm $void Rgemm(const char *transa, const char *transb,mpackint m, mpackint n, mpackint k, dd_real alpha,dd_real * A, mpackint lda, dd_real * B, mpackint ldb,dd_real beta, dd_real * C, mpackint ldc)& % “MPACK”by M. Nakata, Multiple pre- cision version of BLAS, LAPACK(de facto standard linear algebra pack- age). http://mplapack.sourceforge.net/ “Rgemm” corresponds to “dgemm” and “sgemm” of BLAS) Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 37. Implementation on GPU and performance evaluationRelated study D. Mukunoki and D. Takahashi : Implementation of double-double matrix matrix multiplication on GPU, HPCS, p. 148-156, (2011). → Matrix size should be multiple of 64 and slower than our implementation Nakasato, N.:, “A Fast GEMM Implementation On a Cypress GPU, Performance Modeling, Benchmark and Simulation of High Performance Computing Systems”, Louisiana, USA, 2010. → Matrix size should be multiple of 64 and faster than our implementation § ¤ Both implementations are not practical → we implemented for ¦ ¥ general use. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 38. Implementation on GPU and evaluation NVIDIA C2050 Architecture Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 39. Implementation on GPU and evaluationBlock algorithm. We divide matrices to small blocks like b K, b M,b N. We used b M = b K = 16 and b N = 64. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 40. Implementation on GPU and evaluationBasic algorithm: 1. Transfer A,B,C matrices on CPU memory to GPU Global memory. 2. Blocking: Ab: 16 × 16 and Bb : 16 × 64: most efﬁcient. 3. Apply 16 × 16 = 256 thread blocks to each elements Each (i, j)-th thread in thread block calculated i-th row of Ab and j, j + 16, j + 32, j + 48-th column (four columns at the same time) of Bb. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 41. Implementation on GPU and evaluationOperation of each thread in detail: 1. Multiply beta to c0, c1, c2, c3 of C matrix which correspond to i-th column of Ab and j, j + 16, j + 32, j + 48-th row of Bb. 2. Read the ﬁrst block Ab and Bb from global memory to shared memory. Each thread of blocks read its elements. 3. Calculate inner product of row vector ai of Ab and column bi of Bb bi , bi+16 , bi+32 , bi+48 as p0 , p1 , p2 , p3 4. Update c0, c1, c2, c3 like c0 ← c0 + α p0. 5. Read next blocks Ab, Bb and repeat 3, 4, until no further blocks are available. 6. Update C-matrix by c0, c1, c2, c3. 7. Finally transfer C-matrix from GPU Global memory to CPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 42. Implementation on GPU and evaluationThe performance of matrix-matrix operation in double-doubleprecision. Square matrix (m = n = k), we varied m. Max. kernelperformance was 16.4GFLOPS. 16.1GFLOPS CPU-GPU transferincluded. 16 14 12 GFLOPS 10 8 6 4 2 NN−Kernel NN−Total 0 0 1000 2000 3000 4000 5000 6000 Dimension Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 43. Implementation on GPU and evaluationThe performance of matrix-matrix operation in double-doubleprecision with matrix transposes. Square matrix (m = n = k), wevaried m. No performance loss with matrix transposes areobserved. 16 14 12 NN−Kernel GFLOPS 10 NN−Total 8 NT−Kernel 6 NT−Total TN−Kernel 4 TN−Total 2 TT−Kernel TT−Total 0 0 1000 2000 3000 4000 5000 6000 Dimension Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 44. Implementation on GPU and evaluationWe observed no performance loss with matrix transposes, thereason was we make use of texture memory instead. Global memory and Texture memory are essentially the same. However, performance loss was small without coalescing memory access using texture memory. Also, relatively easy to hide the latency of memory transfer in double-double precision since operation intensive (cf. QuadAdd-IEEE req’ 20FLOPS, QuadMul-FMA req 10 FLOPS). Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 45. Implementation on GPU and evaluation“Pointer Redirecting” from “Accelerating GPU kernels for denselinear algebra”, Rajib Nath, Stanimire Tomov, and Jack Dongarra Large performance loss (∼ 35%) are observed for matrix size out of multiple of 64. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 46. Implementation on GPU and evaluation“Pointer redirecting” from “Accelerating GPU kernels for denselinear algebra”, Rajib Nath, Stanimire Tomov, and Jack Dongarra Simple algorithm: if pointer is out of the block, then return the value of the nearest edge. Very simple program. Small amount of performance loss. § ¤ Breakthrough!! ¦ ¥ Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 47. Implementation on GPU and evaluationPerformance loss was reduced from 35% to 6% !! 16.4 Kernel 16.2 Total 16 15.8 GFLOPS 15.6 15.4 15.2 15 14.8 14.6 2050 2100 2150 2200 2250 Dimension Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 48. Implementation on GPU and evaluationPerformance blurred only 0.1% by repeated calculations. 15.5575 15.557 15.5565 GFLOPS(Total) 15.556 15.5555 15.555 15.5545 15.554 15.5535 10 20 30 40 50 60 70 80 90 100 −th measure Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 49. Implementation on GPU and evaluationUsing less accurate operations, we attained 26.4GFLOPS. 25 20 GFLOPS 15 10 QuadMul−Sloppy, QuadAdd−Cray Kernel QuadMul−Sloppy, QuadAdd−Cray Total QuadMul−FMA, QuadAdd−Cray Kernel QuadMul−FMA, QuadAdd−Cray Total 5 QuadMul−Sloppy, QuadAdd−IEEE Kernel QuadMul−Sloppy, QuadAdd−IEEE Total QuadMul−FMA, QuadAdd−IEEE Kernel QuadMul−FMA, QuadAdd−IEEE Total 0 0 1000 2000 3000 4000 5000 6000 Dimension Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 50. Implementation on GPU and evaluationUsing less accurate operations, we attained 26.4GFLOPS. “CPU”denotes measured on Xeon 3470 + DDR3-1066. Algorithm Performance QuadAdd-Cray, QuadMul-Sloppy kernel 26.4GFLOPS QuadAdd-Cray, QuadMul-Sloppy total 25.7GFLOPS QuadAdd-Cray, QuadMul kernel 23.0GFLOPS QuadAdd-Cray, QuadMul total 22.4GFLOPS QuadAdd-IEEE, QuadMul-Sloppy kernel 18.1GFLOPS QuadAdd-IEEE, QuadMul-Sloppy total 17.8GFLOPS QuadAdd-IEEE, QuadMul kernel 16.4GFLOPS QuadAdd-IEEE, QuadMul total 16.1GFLOPS QuadAdd-IEEE, QuadMul CPU 100MFLOPS QuadAdd-IEEE, QuadMul OpenMP CPU 400MFLOPS Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 51. Implementation on GPU and evaluation16.1GFLOPS = ??2.4% (or 46.2%) of peak performance(QuadAdd-IEEE, QuadMul-FMA)Average ﬂop per sec:QuadAdd-IEEE 20op. QuadMul-FMA10op., in Rgemm, same # of mul and add op appear. (20 + 10 − 1)/2 = 14.5Approx theoretical peak should be... 515GFLOPS/14.5 = 35.5GFLOPSHowever, on C2050, peak performance is calculated full useof FMA and our calculation is not this case, thus... 515GFLOPS/14.5/2 = 17.8GFLOPS Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 52. Application:x10 acceleration for Semideﬁnite programming solver“SDPA-DD”. Application Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 53. Application:x10 acceleration for Semideﬁnite programming solver“SDPA-DD”. Semideﬁnite programming: Primal min: A0 • X s.t.: Ai • X = bi (i = 1, 2, · · · , m) X 0 ∑m Dual max: bi zi i=1 ∑ m s.t.: Ai zi + Y = A0 i=1 Y 0 Ai : n × n symm. mat., X n × n symm. variable mat., bi : m-dim ∑ vector,Y n × n symm. variable mat, X • Y := Xi j Yi j . X 0 : X semideﬁnite: eigenvalues are lager than or equal to 0. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 54. Application:x10 acceleration for Semideﬁnite programming solver“SDPA-DD”. Nature of optimally. . Theorem (Complementary slackness theorem) . When (X∗ , Y ∗ , z∗ ) are feasible solution and interior point then they satisfy the conditions of SDP of primal and dual, then necessary and sufﬁcient condition for optimally of (X∗ , Y ∗ , z∗ ) is: . X ∗ • Y ∗ = 0. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 55. Application:x10 acceleration for Semideﬁnite programming solver“SDPA-DD”. When X∗ , Y ∗ is optimal, X∗ • Y ∗ = 0. Then, rank X∗ + rankY ∗ ≤ n (1) also follows. § ¤ At least one of X∗ , Y ∗ is singular ¥ ¦ Usually both of X∗ , Y ∗ are singular: → unstable and/or less accurate at the optimal. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 56. How to solve SDP:Interior point primal-dual path following methodWorld’s best implementations SDPA and SDPARA are available bythe SDPA group led by Prof. Fujisawa. Step 0: Setting the initial points: x0 , X0 , Y 0 , X0 0, Y 0 0. letting h = 0, choose parameter γ ∈ (0, 1). Step 1: Calculate Shur complementary matrix B ∈ S n. Bi j = ((X h )−1 Fi Y h ) • F j Step 2: Solving linear equation Bdx = r, and calculate dX, dY by solution dx, then we obtain next step (dx, dX, dY) Step 3: Determine step size α keeping positive-semideﬁniteness of matrices. α = max{α ∈ [0, 1] : X h + αdX 0, Y h + αdY 0}. Step 4: Update the current point. (x h+1 , X h+1 , Y h+1 ) = (x h , X h , Y h ) + γα(dx, dX, dY). Step 5: If (x h+1 , X h+1 , Y h+1 ) satisﬁes some requirements, then iteration ends. Otherwise, go back to the Step 1 and increment h = h + 1. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 57. Shur complement matrix becomes singular B is called “Shur complementary matrix” We solve linear equation Bdx = r to determine the next step. This linear equation becomes singular!§ ¤Multiple precision arithmetic is needed for accurate solutions!¦ ¥ The 1-norm and the estimated 1-norm condition number of shur complement matrix 1e+20 1-cond 1-norm 1e+15 1e+10 100000 1 1e-05 1e-10 0 10 20 30 40 50 60 70 80 90 # of iter. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 58. Application:x10 acceleration for Semideﬁnite programming solver“SDPA-DD”. Benchmark result: lager problem from SDPLIB (problem archive) CPU: Xeon 3470, DDR3 -1066 Problem CPU(sec) GPU(sec) acceleration equalG51 6531.9 573.2 11.4 gpp500-1 902.0 72.2 12.5 gpp500-4 638.0 74.8 8.5 maxG32 36284.4 4373.1 8.3 maxG55 521575.4 53413.1 9.8 mcp500-4 539.1 65.2 8.3 qpG11 16114.7 1408.0 11.4 qpG51 39678.9 3299.2 12.0 ss30 310.7 138.6 2.2 theta5 3250.0 239.8 13.6 theta6 9028.2 623.6 14.5 thetaG51 49161.5 4870.4 10.1 Nakata Maho A fast implementation of matrix-matrix product in double-double preci
- 59. Summary § ¤ http://mplapack.sourceforge.net/ ¦ ¥Matrix-matrix multiplication double-double precision NVIDIA C2050, GPU CPU x150, Peak performance: 26GFLOPS 25 20 GFLOPS 15 10 QuadMul−Sloppy, QuadAdd−Cray Kernel QuadMul−Sloppy, QuadAdd−Cray Total QuadMul−FMA, QuadAdd−Cray Kernel QuadMul−FMA, QuadAdd−Cray Total 5 QuadMul−Sloppy, QuadAdd−IEEE Kernel QuadMul−Sloppy, QuadAdd−IEEE Total QuadMul−FMA, QuadAdd−IEEE Kernel QuadMul−FMA, QuadAdd−IEEE Total 0 0 1000 2000 3000 4000 5000 6000 Dimension Nakata Maho A fast implementation of matrix-matrix product in double-double preci

No public clipboards found for this slide

Be the first to comment