SlideShare a Scribd company logo
1 of 59
Download to read offline
.
    A fast implementation of matrix-matrix product in
     double-double precision on NVIDIA C2050 and
        application to semideïŹnite programming
.


                              Nakata Maho∗†
                      (maho@riken.jp∗ ),
      Yasuyoshi Takao†† , Noda Shigeho† , Himeno Ryutaro†

           RIKEN, Advanced Center for Computing and Communication† ,
                                 JFE Tech††

    International Conference on Networking and Computing 2012/12/5 @ Okinawa
                                    14:45-15:15




                            Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Overview



Introduction of this research in a slide.
Importance of high precision arithmetic.
The double-double precision: a cheap and easy solution for
quadruple precision and its details.
Matrix-matrix multiplication (Rgemm) in MPACK (high
precision version of BLAS and LAPACK).
Implementation of a fast Rgemm on C2050 GPU : 150 times
faster than CPU.
Application: acceleration of semideïŹnite programming solver
“SDPA-DD” : 10 times faster than CPU.
Summary.



                    Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Introduction of this research in a slide.

Matrix-matrix multiplication                             double-double precision




   NVIDIA C2050, GPU           GPU=CPUx150, Peak performance: 26GFLOPS
                                                25


                                                20




                                       GFLOPS
                                                15


                                                10                 QuadMul−Sloppy, QuadAdd−Cray Kernel
                                                                    QuadMul−Sloppy, QuadAdd−Cray Total
                                                                     QuadMul−FMA, QuadAdd−Cray Kernel
                                                                      QuadMul−FMA, QuadAdd−Cray Total
                                                 5                 QuadMul−Sloppy, QuadAdd−IEEE Kernel
                                                                    QuadMul−Sloppy, QuadAdd−IEEE Total
                                                                    QuadMul−FMA, QuadAdd−IEEE Kernel
                                                                      QuadMul−FMA, QuadAdd−IEEE Total
                                                 0
                                                     0      1000       2000     3000      4000     5000   6000

  §                                         Dimension
                                                      €
   + Application : SemideïŹnite Programming GPU=CPUx10
  Š                                                   „
                         Nakata Maho     A fast implementation of matrix-matrix product in double-double preci
Introduction of this research in a slide.

Matrix-matrix multiplication                             double-double precision




   NVIDIA C2050, GPU           GPU=CPUx150, Peak performance: 26GFLOPS
                                                25


                                                20




                                       GFLOPS
                                                15


                                                10                 QuadMul−Sloppy, QuadAdd−Cray Kernel
                                                                    QuadMul−Sloppy, QuadAdd−Cray Total
                                                                     QuadMul−FMA, QuadAdd−Cray Kernel
                                                                      QuadMul−FMA, QuadAdd−Cray Total
                                                 5                 QuadMul−Sloppy, QuadAdd−IEEE Kernel
                                                                    QuadMul−Sloppy, QuadAdd−IEEE Total
                                                                    QuadMul−FMA, QuadAdd−IEEE Kernel
                                                                      QuadMul−FMA, QuadAdd−IEEE Total
                                                 0
                                                     0      1000       2000     3000      4000     5000   6000

  §                                         Dimension
                                                      €
   + Application : SemideïŹnite Programming GPU=CPUx10
  Š                                                   „
                         Nakata Maho     A fast implementation of matrix-matrix product in double-double preci
More accuracy is needed towards PETA and EXA scale
                     computing



  The EXA scale computing : 1023 FLOP!!! for just one week
  calculation.
  ScientiïŹc computing may suffer from the accuracy.




                    Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
More accuracy is needed towards PETA and EXA scale
                     computing



  The EXA scale computing : 1023 FLOP!!! for just one week
  calculation.
  ScientiïŹc computing may suffer from the accuracy.




                    Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
More accuracy is needed towards PETA and EXA scale
                     computing



  The EXA scale computing : 1023 FLOP!!! for just one week
  calculation.
  ScientiïŹc computing may suffer from the accuracy.




                    Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
More accuracy is needed towards PETA and EXA scale
                     computing

  Iterative methods in double precision calculation sometimes
  do not even converge. [Hasegawa 2007]




                     Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
More accuracy is needed towards PETA and EXA scale
                     computing

  Iterative methods in double precision calculation sometimes
  do not even converge. [Hasegawa 2007]




                     Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
More accuracy is needed towards PETA and EXA scale
                     computing

  SemideïŹnite programming (SDP): condition number diverges
  at the optimum.
  Therefore, one may be very hard to obtain an accurate
  solution
  [Nakata et al 2008], [Nakata 2009], [Waki-Nakata-Muramatsu]
                      The 1-norm and the estimated 1-norm condition number of shur complement matrix
          1e+20
                                                                                         1-cond
                                                                                         1-norm


          1e+15



          1e+10



         100000



              1



          1e-05



          1e-10
                  0     10        20        30         40          50    60      70            80      90
                                 Nakata Maho             # of iter. implementation of
                                                            A fast                         matrix-matrix product in double-double preci
More accuracy is needed towards PETA and EXA scale
                     computing

  SemideïŹnite programming (SDP): condition number diverges
  at the optimum.
  Therefore, one may be very hard to obtain an accurate
  solution
  [Nakata et al 2008], [Nakata 2009], [Waki-Nakata-Muramatsu]
                      The 1-norm and the estimated 1-norm condition number of shur complement matrix
          1e+20
                                                                                         1-cond
                                                                                         1-norm


          1e+15



          1e+10



         100000



              1



          1e-05



          1e-10
                  0     10        20        30         40          50    60      70            80      90
                                 Nakata Maho             # of iter. implementation of
                                                            A fast                         matrix-matrix product in double-double preci
More accuracy is needed towards PETA and EXA scale
                     computing

  SemideïŹnite programming (SDP): condition number diverges
  at the optimum.
  Therefore, one may be very hard to obtain an accurate
  solution
  [Nakata et al 2008], [Nakata 2009], [Waki-Nakata-Muramatsu]
                      The 1-norm and the estimated 1-norm condition number of shur complement matrix
          1e+20
                                                                                         1-cond
                                                                                         1-norm


          1e+15



          1e+10



         100000



              1



          1e-05



          1e-10
                  0     10        20        30         40          50    60      70            80      90
                                 Nakata Maho             # of iter. implementation of
                                                            A fast                         matrix-matrix product in double-double preci
More accuracy is needed towards PETA and EXA scale
                     computing

  SemideïŹnite programming (SDP): condition number diverges
  at the optimum.
  Therefore, one may be very hard to obtain an accurate
  solution
  [Nakata et al 2008], [Nakata 2009], [Waki-Nakata-Muramatsu]
                      The 1-norm and the estimated 1-norm condition number of shur complement matrix
          1e+20
                                                                                         1-cond
                                                                                         1-norm


          1e+15



          1e+10



         100000



              1



          1e-05



          1e-10
                  0     10        20        30         40          50    60      70            80      90
                                 Nakata Maho             # of iter. implementation of
                                                            A fast                         matrix-matrix product in double-double preci
Acceleration high precision operation on GPU is a good idea



     Double-double precision is a cheap and fast solution for high
     precision
         accurate enough for many purposes : almost as accurate as
         quadruple precision.
         fast: operations are done only by 8 ∌ 24 double precision
         operations.
         operation intensive: requires memory bandwidth than FLOPS.
     Implementing on GPU is a good idea
         fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∌
         200GFLOPS.
         cheap: NVIDIA C2050 $2000, Workstation : $5000 ∌ $10000.
         do not require complex operations: suitable for GPU.




                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Acceleration high precision operation on GPU is a good idea



     Double-double precision is a cheap and fast solution for high
     precision
         accurate enough for many purposes : almost as accurate as
         quadruple precision.
         fast: operations are done only by 8 ∌ 24 double precision
         operations.
         operation intensive: requires memory bandwidth than FLOPS.
     Implementing on GPU is a good idea
         fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∌
         200GFLOPS.
         cheap: NVIDIA C2050 $2000, Workstation : $5000 ∌ $10000.
         do not require complex operations: suitable for GPU.




                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Acceleration high precision operation on GPU is a good idea



     Double-double precision is a cheap and fast solution for high
     precision
         accurate enough for many purposes : almost as accurate as
         quadruple precision.
         fast: operations are done only by 8 ∌ 24 double precision
         operations.
         operation intensive: requires memory bandwidth than FLOPS.
     Implementing on GPU is a good idea
         fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∌
         200GFLOPS.
         cheap: NVIDIA C2050 $2000, Workstation : $5000 ∌ $10000.
         do not require complex operations: suitable for GPU.




                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Acceleration high precision operation on GPU is a good idea



     Double-double precision is a cheap and fast solution for high
     precision
         accurate enough for many purposes : almost as accurate as
         quadruple precision.
         fast: operations are done only by 8 ∌ 24 double precision
         operations.
         operation intensive: requires memory bandwidth than FLOPS.
     Implementing on GPU is a good idea
         fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∌
         200GFLOPS.
         cheap: NVIDIA C2050 $2000, Workstation : $5000 ∌ $10000.
         do not require complex operations: suitable for GPU.




                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Acceleration high precision operation on GPU is a good idea



     Double-double precision is a cheap and fast solution for high
     precision
         accurate enough for many purposes : almost as accurate as
         quadruple precision.
         fast: operations are done only by 8 ∌ 24 double precision
         operations.
         operation intensive: requires memory bandwidth than FLOPS.
     Implementing on GPU is a good idea
         fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∌
         200GFLOPS.
         cheap: NVIDIA C2050 $2000, Workstation : $5000 ∌ $10000.
         do not require complex operations: suitable for GPU.




                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Acceleration high precision operation on GPU is a good idea



     Double-double precision is a cheap and fast solution for high
     precision
         accurate enough for many purposes : almost as accurate as
         quadruple precision.
         fast: operations are done only by 8 ∌ 24 double precision
         operations.
         operation intensive: requires memory bandwidth than FLOPS.
     Implementing on GPU is a good idea
         fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∌
         200GFLOPS.
         cheap: NVIDIA C2050 $2000, Workstation : $5000 ∌ $10000.
         do not require complex operations: suitable for GPU.




                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Acceleration high precision operation on GPU is a good idea



     Double-double precision is a cheap and fast solution for high
     precision
         accurate enough for many purposes : almost as accurate as
         quadruple precision.
         fast: operations are done only by 8 ∌ 24 double precision
         operations.
         operation intensive: requires memory bandwidth than FLOPS.
     Implementing on GPU is a good idea
         fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∌
         200GFLOPS.
         cheap: NVIDIA C2050 $2000, Workstation : $5000 ∌ $10000.
         do not require complex operations: suitable for GPU.




                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Acceleration high precision operation on GPU is a good idea



     Double-double precision is a cheap and fast solution for high
     precision
         accurate enough for many purposes : almost as accurate as
         quadruple precision.
         fast: operations are done only by 8 ∌ 24 double precision
         operations.
         operation intensive: requires memory bandwidth than FLOPS.
     Implementing on GPU is a good idea
         fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∌
         200GFLOPS.
         cheap: NVIDIA C2050 $2000, Workstation : $5000 ∌ $10000.
         do not require complex operations: suitable for GPU.




                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision


   “754-2008 IEEE Standard for Floating-Point Arithmetic”
   The binary64 (aka double precision) format has 16 decimal
   signiïŹcant digits




   Widely used and very fast. Core i7 920: ∌40GFLOPS;
   RADEON HD7970 ∌1000GFLOPS, K computer: ∌ over
   10PFLOPS)
   §                                                        €
   Rounding error may occur for every arithmetic operation.
   Š                                                        „


                      Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision


The double-double precision number a is expressed by two double
precision numbers a hi , a lo.




                         a = (a hi , a lo).


                      Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision


                    §                                        €
                    Knuth’s Theorem „
                    Š


Error-free transformation of two ïŹ‚oating point numbers a, b,


                   a + b = (a ⊕ b) + e
 where ⊕ is addition including rounding errors, + is addition, e is
ïŹ‚oating point number
       §                                                      €
        We can evaluate rounding error exactly for addition!
       Š                                                      „



                         Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision


                   §                                           €
                   Dekker’s Theorem „
                   Š


Error-free transformation of two ïŹ‚oating point numbers a, b,


                  a × b = (a ⊗ b) + e
⊗ is multiplication operator with rounding errors, × is multiplication
operator, e is ïŹ‚oating point number.
    §                                                          €
    We can evaluate rounding error exactly for multiplication!
    Š                                                          „



                         Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision


Based on Knuth’s Theorem, we can deïŹne “Quick-Two-Sum (a, b)”
where a, b are ïŹ‚oating point numbers, and ⊕, are operators
including rounding errors. and when and when |a| ≄ |b|, we can
calculate exactly s = (a ⊕ b), e = a + b − (a ⊕ b) in three
operations.
                1                                    (
                 Quick-Two-Sum (a, b):
                    1. s← a⊕b
               . e ← b (s a)
                 2

              3. return(s, e)
              0                               )
              §                             €
              (s, e) = Quick-Two-Sum (a, b) „
              Š



                      Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision


Based on Knuth’s Theorem, we can deïŹne “Quick-Two-Sum (a, b)”
where a, b are ïŹ‚oating point numbers, and ⊕, are operators
including rounding errors. and we can calculate exactly
s = (a ⊕ b), e = a + b − (a ⊕ b) in six operations.
                9                                   6
                   Two-Sum (a, b):
                   1. s← a⊕b
               . v←s a
                 2

              3. e ← (a (s v)) ⊕ (b v)
              4. return(s, e)
              8                                                  7
                 §                       €
                 (s, e) = Two-Sum (a, b) „
                 Š



                      Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision

Basics:Dekker’s Theorem
There exists an algorithm which calculate s = (a ⊗ b) and
e = a × b − (a ⊗ b), where ⊗ is multiplication operator with
rounding errors, using following “Split(a)” in four operations and
“Two-Prod(a,b)” in 17 operations.
                               9                                                                   6
9                             6       Two-prod (a, b):
      Split (a):                   . p← a⊗b
                                   1

   1. t ← (227 + 1) ⊗ a            . (a , a ) ← Split(a)
                                   2
                                                hi      lo
 . a hi ← t (t a)
   2                                   . (b hi , b lo) ← Split(b)
                                       3

3. a lo ← a a hi                       . e ← ((a hi ⊗ b hi p) ⊕ a hi ⊗
                                       4

4. return(a hi , a lo)                      b lo ⊕ a lo ⊗ b hi ) ⊕ a lo ⊗ b lo
8                             7
                                 . return( p, e)
                                 5
                              8                                                                    7
                      §                      €
                      (s, e) =Two-Prod(a, b) „
                      Š
                         Nakata Maho       A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision


Addition in double-double operation can be done in 20FLOPS by
following “QuadAdd-IEEE”
9                                                      6
       QuadAdd-IEEE (a, b):
    1. (s hi , e hi ) = Two-Sum(a hi , b hi )
   2. (s lo, e lo) = Two-Sum(a lo, b lo)
   3. e hi = e hi ⊕ s lo
   4. (s lo, e lo) = Quick-Two-Sum(s hi , e hi )
   5. e hi = e hi ⊕ s lo
 . (s hi , e lo) = Quick-Two-Sum(s hi , e hi )
   6

7. return(c)
8                                                                                7



                           Nakata Maho     A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision




Multiplication in double-double operation can be done in 24FLOPS
by following “QuadMul”.
9                                                       6
      QuadMul (a, b):
    1. ( phi , plo) = Two-Prod(a hi , b hi )
   2. plo = plo ⊕ (a hi ⊗ b lo ⊕ a lo ⊗ b hi )
   3. (c hi , c lo) = Quick-Two-Sum(phi , plo)
4. return(c)
8                                                                             7




                          Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision




The FMA (fused multiply-add) operation calculates

                             a×b+c

in one command. Doing a × b + c exactly, then round to
double-precision.




                       Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision




Faster: using FMA instruction Two-Prod becomes 3 operations (17
op. w/o FMA), and QuadMul(-FMA) can be done in only 10
operations (24 ops w/o FMA)
1                                                   (
     Two-prod-FMA (a, b):
   1. p← a⊗b
 . e ← FMA(a × b − p)
   2

3. return(p, e)
0                                                                         )




                      Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision



Faster: lower accuracy operations 9                                                                     6
9                                 6 QuadMul-Sloppy (a, b):
     QuadAdd-Cray (a, b):         1. p = (a hi ⊗ b lo)
   1. (c hi , c lo) =
                                  2. q = (a lo ⊗ b hi )
      Two-Sum(a hi , b hi )        . t = p⊕ q
                                  3
   2. c lo = c lo ⊕ (a lo ⊕ b lo)
                                  4. c hi = FMA(a hi × b hi + t)
   3. (c hi , c lo) =
                                  5. e = FMA(a hi × b hi − c hi )
      Quick-Two-Sum(c hi , c lo)
                                  6. c lo = e ⊕ t
   4. return(c)
8                                 7
                                  7. return(c)
                                  8                                                                     7




                      Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision

Summary: Operations count in each double-double arithmetic
                 Algorithm               # of operations
               Quick-Two-Sum                    3
                 Two-Sum                        6
                    Split                       4
                 Two-Prod                      17
               Two-Prod-FMA                     3∗
              QuadAdd-IEEE                          20
               QuadAdd-Cray                         11
                 QuadMul                            24
               QuadMul-FMA                          10∗
            QuadMul-FMA-Sloppy                      8∗
∗ 2FLOPScount for FMA.
We used QuadAdd-IEEE and QuadMul-FMA when not explicitly
stated
                      Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
The double-double precision: handy and easy quadruple
                      precision


QD library
Features: Class of C++.The double-double precision: “dd real”.
Free software. Author: Yozo Hida, Xiaoye S. Li, David H. Bailey
Download:

http://crd.lbl.gov/˜dhbailey/mpdist/

Paper:

http://crd.lbl.gov/˜dhbailey/dhbpapers/arith15.pdf

Yozo Hida, Xiaoye S. Li, David H. Bailey, “Quad-Double Arithmetic:
Algorithms, Implementation, and Application”, Technical Report
LBNL-46996, Lawrence Berkeley National Laboratory, 2000.


                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and performance evaluation

We accelerated matrix-matrix multiplication routine called
“Rgemm”. Prototype deïŹnition of Rgemm
'                                                                                                   $
void Rgemm(const char *transa, const char *transb,
mpackint m, mpackint n, mpackint k, dd_real alpha,
dd_real * A, mpackint lda, dd_real * B, mpackint ldb,
dd_real beta, dd_real * C, mpackint ldc)
&                                                     %

                              “MPACK”by M. Nakata, Multiple pre-
                              cision version of BLAS, LAPACK(de
                              facto standard linear algebra pack-
                              age).

                              http://mplapack.sourceforge.net/

                              “Rgemm” corresponds to “dgemm”
                              and “sgemm” of BLAS)


                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and performance evaluation


Related study
    D. Mukunoki and D. Takahashi : Implementation of
    double-double matrix matrix multiplication on GPU, HPCS, p.
    148-156, (2011). → Matrix size should be multiple of 64 and
    slower than our implementation
   Nakasato, N.:, “A Fast GEMM Implementation On a Cypress
   GPU, Performance Modeling, Benchmark and Simulation of
   High Performance Computing Systems”, Louisiana, USA,
   2010. → Matrix size should be multiple of 64 and faster than
   our implementation
 §                                     €
 Both implementations are not practical → we implemented for
 Š                                     „
                          general use.



                      Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation

 NVIDIA C2050                               Architecture




           Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation

Block algorithm. We divide matrices to small blocks like b K, b M,
b N. We used b M = b K = 16 and b N = 64.




                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation

Basic algorithm:
 1. Transfer A,B,C matrices on CPU memory to GPU Global
    memory.
 2. Blocking: Ab: 16 × 16 and Bb : 16 × 64: most efïŹcient.
 3. Apply 16 × 16 = 256 thread blocks to each elements Each
    (i, j)-th thread in thread block calculated i-th row of Ab and
     j, j + 16, j + 32, j + 48-th column (four columns at the same
    time) of Bb.




                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation
Operation of each thread in detail:
  1.   Multiply beta to c0, c1, c2, c3 of C matrix which correspond to i-th column of
       Ab and j, j + 16, j + 32, j + 48-th row of Bb.
  2.   Read the ïŹrst block Ab and Bb from global memory to shared memory.
       Each thread of blocks read its elements.
  3.   Calculate inner product of row vector ai of Ab and column bi of Bb bi , bi+16
       , bi+32 , bi+48 as p0 , p1 , p2 , p3
  4.   Update c0, c1, c2, c3 like c0 ← c0 + α p0.
  5.   Read next blocks Ab, Bb and repeat 3, 4, until no further blocks are
       available.
  6.   Update C-matrix by c0, c1, c2, c3.
  7.   Finally transfer C-matrix from GPU Global memory to CPU.




                               Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation

The performance of matrix-matrix operation in double-double
precision. Square matrix (m = n = k), we varied m. Max. kernel
performance was 16.4GFLOPS. 16.1GFLOPS CPU-GPU transfer
included.
           16
           14
           12
  GFLOPS




           10
            8
            6
            4
            2                         NN−Kernel
                                       NN−Total
            0
                0   1000 2000 3000 4000 5000 6000
                            Dimension
                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation

The performance of matrix-matrix operation in double-double
precision with matrix transposes. Square matrix (m = n = k), we
varied m. No performance loss with matrix transposes are
observed.
           16
           14
           12
                                      NN−Kernel
  GFLOPS




           10
                                       NN−Total
            8                         NT−Kernel
            6                          NT−Total
                                      TN−Kernel
            4                          TN−Total
            2                         TT−Kernel
                                       TT−Total
            0
                0   1000 2000 3000 4000 5000 6000
                            Dimension
                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation

We observed no performance loss with matrix transposes, the
reason was we make use of texture memory instead.
    Global memory and Texture memory are essentially the same.
    However, performance loss was small without coalescing
    memory access using texture memory.




    Also, relatively easy to hide the latency of memory transfer in
    double-double precision since operation intensive (cf.
    QuadAdd-IEEE req’ 20FLOPS, QuadMul-FMA req 10
    FLOPS).
                        Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation


“Pointer Redirecting” from “Accelerating GPU kernels for dense
linear algebra”, Rajib Nath, Stanimire Tomov, and Jack Dongarra
    Large performance loss (∌ 35%) are observed for matrix size
    out of multiple of 64.




                       Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation

“Pointer redirecting” from “Accelerating GPU kernels for dense
linear algebra”, Rajib Nath, Stanimire Tomov, and Jack Dongarra
     Simple algorithm: if pointer is out of the block, then return the
     value of the nearest edge.




     Very simple program.
     Small amount of performance loss.
                        §               €
                         Breakthrough!!
                        Š               „
                         Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation

Performance loss was reduced from 35% to 6% !!

           16.4
                                         Kernel
           16.2                           Total
             16
           15.8
  GFLOPS




           15.6
           15.4
           15.2
             15
           14.8
           14.6
                  2050   2100      2150            2200             2250
                                Dimension

                         Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation

Performance blurred only 0.1% by repeated calculations.


                  15.5575
                   15.557
                  15.5565
  GFLOPS(Total)




                   15.556
                  15.5555
                   15.555
                  15.5545
                   15.554
                  15.5535
                            10 20 30 40 50 60 70 80 90 100
                                    −th measure
                             Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation

Using less accurate operations, we attained 26.4GFLOPS.



          25


          20
 GFLOPS




          15


          10                 QuadMul−Sloppy, QuadAdd−Cray Kernel
                              QuadMul−Sloppy, QuadAdd−Cray Total
                               QuadMul−FMA, QuadAdd−Cray Kernel
                                QuadMul−FMA, QuadAdd−Cray Total
          5                  QuadMul−Sloppy, QuadAdd−IEEE Kernel
                              QuadMul−Sloppy, QuadAdd−IEEE Total
                              QuadMul−FMA, QuadAdd−IEEE Kernel
                                QuadMul−FMA, QuadAdd−IEEE Total
          0
               0      1000       2000          3000        4000            5000           6000
                                        Dimension
                                 Nakata Maho      A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation


Using less accurate operations, we attained 26.4GFLOPS. “CPU”
denotes measured on Xeon 3470 + DDR3-1066.
                Algorithm                    Performance
  QuadAdd-Cray, QuadMul-Sloppy kernel 26.4GFLOPS
   QuadAdd-Cray, QuadMul-Sloppy total       25.7GFLOPS
     QuadAdd-Cray, QuadMul kernel           23.0GFLOPS
       QuadAdd-Cray, QuadMul total          22.4GFLOPS
 QuadAdd-IEEE, QuadMul-Sloppy kernel 18.1GFLOPS
  QuadAdd-IEEE, QuadMul-Sloppy total        17.8GFLOPS
     QuadAdd-IEEE, QuadMul kernel           16.4GFLOPS
      QuadAdd-IEEE, QuadMul total           16.1GFLOPS
     QuadAdd-IEEE, QuadMul CPU               100MFLOPS
 QuadAdd-IEEE, QuadMul OpenMP CPU 400MFLOPS



                      Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Implementation on GPU and evaluation


16.1GFLOPS = ??2.4% (or 46.2%) of peak performance
(QuadAdd-IEEE, QuadMul-FMA)
Average ïŹ‚op per sec:QuadAdd-IEEE 20op. QuadMul-FMA
10op., in Rgemm, same # of mul and add op appear.

                   (20 + 10 − 1)/2 = 14.5

Approx theoretical peak should be...

             515GFLOPS/14.5 = 35.5GFLOPS

However, on C2050, peak performance is calculated full use
of FMA and our calculation is not this case, thus...

            515GFLOPS/14.5/2 = 17.8GFLOPS


                   Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Application:x10 acceleration for SemideïŹnite programming
                    solver“SDPA-DD”.




                         Application




                    Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Application:x10 acceleration for SemideïŹnite programming
                    solver“SDPA-DD”.

 SemideïŹnite programming:

       Primal   min:             A0 ‱ X
                s.t.:          Ai ‱ X = bi                (i = 1, 2, · · · , m)
                                  X 0
                                 ∑m
         Dual   max:                 bi zi
                                 i=1
                         ∑
                         m
                 s.t.:         Ai zi + Y = A0
                         i=1
                                 Y     0

 Ai : n × n symm. mat., X n × n symm. variable mat., bi : m-dim
                                             ∑
 vector,Y n × n symm. variable mat, X ‱ Y := Xi j Yi j . X 0 : X
 semideïŹnite: eigenvalues are lager than or equal to 0.

                         Nakata Maho       A fast implementation of matrix-matrix product in double-double preci
Application:x10 acceleration for SemideïŹnite programming
                    solver“SDPA-DD”.




 Nature of optimally.
 .
 Theorem (Complementary slackness theorem)
 .
 When (X∗ , Y ∗ , z∗ ) are feasible solution and interior point then they
 satisfy the conditions of SDP of primal and dual, then necessary
 and sufïŹcient condition for optimally of (X∗ , Y ∗ , z∗ ) is:

 .                             X ∗ ‱ Y ∗ = 0.




                           Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
Application:x10 acceleration for SemideïŹnite programming
                    solver“SDPA-DD”.


 When X∗ , Y ∗ is optimal,

                                 X∗ ‱ Y ∗ = 0.

 Then,
                        rank X∗ + rankY ∗ ≀ n                                                  (1)
 also follows.
             §                                                                 €
             At least one of X∗ , Y ∗ is singular „
             Š
     Usually both of X∗ , Y ∗ are singular: → unstable and/or less
                      accurate at the optimal.




                             Nakata Maho   A fast implementation of matrix-matrix product in double-double preci
How to solve SDP:Interior point primal-dual path following
                        method

World’s best implementations SDPA and SDPARA are available by
the SDPA group led by Prof. Fujisawa.
     Step 0: Setting the initial points: x0 , X0 , Y 0 , X0         0, Y 0      0. letting h = 0,
               choose parameter γ ∈ (0, 1).
     Step 1: Calculate Shur complementary matrix B ∈ S n.

                                        Bi j = ((X h )−1 Fi Y h ) ‱ F j

     Step 2: Solving linear equation Bdx = r, and calculate dX, dY by
               solution dx, then we obtain next step (dx, dX, dY)
     Step 3: Determine step size α keeping positive-semideïŹniteness of
               matrices. α = max{α ∈ [0, 1] : X h + αdX                   0, Y h + αdY           0}.
     Step 4: Update the current point.
               (x h+1 , X h+1 , Y h+1 ) = (x h , X h , Y h ) + γα(dx, dX, dY).
     Step 5: If (x h+1 , X h+1 , Y h+1 ) satisïŹes some requirements, then iteration
               ends. Otherwise, go back to the Step 1 and increment h = h + 1.


                               Nakata Maho     A fast implementation of matrix-matrix product in double-double preci
Shur complement matrix becomes singular

  B is called “Shur complementary matrix”
  We solve linear equation Bdx = r to determine the next step.
  This linear equation becomes singular!
§                                                               €
Multiple precision arithmetic is needed for accurate solutions!
Š                                                               „
                       The 1-norm and the estimated 1-norm condition number of shur complement matrix
           1e+20
                                                                                           1-cond
                                                                                           1-norm


           1e+15



           1e+10



          100000



               1



           1e-05



           1e-10
                   0     10        20        30         40           50     60        70        80      90
                                                          # of iter.



                                  Nakata Maho                A fast implementation of matrix-matrix product in double-double preci
Application:x10 acceleration for SemideïŹnite programming
                    solver“SDPA-DD”.

 Benchmark result: lager problem from SDPLIB (problem archive)
 CPU: Xeon 3470, DDR3 -1066
          Problem   CPU(sec)         GPU(sec)            acceleration
         equalG51    6531.9           573.2                 11.4
         gpp500-1     902.0            72.2                 12.5
         gpp500-4     638.0            74.8                  8.5
          maxG32     36284.4          4373.1                 8.3
          maxG55    521575.4         53413.1                 9.8
         mcp500-4     539.1            65.2                  8.3
           qpG11     16114.7          1408.0                11.4
           qpG51     39678.9          3299.2                12.0
            ss30      310.7           138.6                  2.2
           theta5    3250.0           239.8                 13.6
           theta6    9028.2           623.6                 14.5
         thetaG51    49161.5          4870.4                10.1
                       Nakata Maho    A fast implementation of matrix-matrix product in double-double preci
Summary
                §                                 €
                 http://mplapack.sourceforge.net/
                Š                                 „
Matrix-matrix multiplication                             double-double precision




   NVIDIA C2050, GPU           CPU x150, Peak performance: 26GFLOPS
                                                25


                                                20
                                       GFLOPS



                                                15


                                                10                 QuadMul−Sloppy, QuadAdd−Cray Kernel
                                                                    QuadMul−Sloppy, QuadAdd−Cray Total
                                                                     QuadMul−FMA, QuadAdd−Cray Kernel
                                                                      QuadMul−FMA, QuadAdd−Cray Total
                                                 5                 QuadMul−Sloppy, QuadAdd−IEEE Kernel
                                                                    QuadMul−Sloppy, QuadAdd−IEEE Total
                                                                    QuadMul−FMA, QuadAdd−IEEE Kernel
                                                                      QuadMul−FMA, QuadAdd−IEEE Total
                                                 0
                                                     0      1000       2000      3000     4000     5000   6000
                                                                              Dimension
                         Nakata Maho             A fast implementation of matrix-matrix product in double-double preci

More Related Content

Viewers also liked

Social media strategy and ROI in 4 steps
Social media strategy and ROI in 4 stepsSocial media strategy and ROI in 4 steps
Social media strategy and ROI in 4 stepsSander Van Lingen
 
Mocloudos - Feather-weight Cloud OS developed within‹14 man-days
Mocloudos - Feather-weight Cloud OS developed within‹14 man-daysMocloudos - Feather-weight Cloud OS developed within‹14 man-days
Mocloudos - Feather-weight Cloud OS developed within‹14 man-daysMasaki Muranaka
 
Competing For Analytics
Competing For AnalyticsCompeting For Analytics
Competing For Analyticsmdideepak
 
medioambiente consumo
medioambiente consumomedioambiente consumo
medioambiente consumoChelo Mena
 
Daily Newsletter: 15th February, 2011
Daily Newsletter: 15th February, 2011Daily Newsletter: 15th February, 2011
Daily Newsletter: 15th February, 2011Fullerton Securities
 
Salzburg
SalzburgSalzburg
Salzburgcherryho
 
Who wants to be a millionaire facts about pollution
Who wants to be a millionaire facts about pollutionWho wants to be a millionaire facts about pollution
Who wants to be a millionaire facts about pollutionmargaserranoflo
 
Ù†ÙˆŰŹÙˆŰ§Ù†Ű§Ù†
Ù†ÙˆŰŹÙˆŰ§Ù†Ű§Ù†Ù†ÙˆŰŹÙˆŰ§Ù†Ű§Ù†
Ù†ÙˆŰŹÙˆŰ§Ù†Ű§Ù†mojir
 
Tips on how to get more followers on keek
Tips on how to get more followers on keekTips on how to get more followers on keek
Tips on how to get more followers on keekrock635
 
Social Media in Senior Care
Social Media in Senior CareSocial Media in Senior Care
Social Media in Senior CareLee Aase
 
Visual Resume, Emmanuel Lemoine
Visual Resume, Emmanuel LemoineVisual Resume, Emmanuel Lemoine
Visual Resume, Emmanuel LemoineEmmanuel Lemoine
 
Datan vankina
Datan vankinaDatan vankina
Datan vankinaJyrki Kasvi
 
Customer Service Business Challenges And Pegas Solution
Customer Service Business Challenges And Pegas SolutionCustomer Service Business Challenges And Pegas Solution
Customer Service Business Challenges And Pegas SolutionNicolas Cachoux
 

Viewers also liked (14)

Social media strategy and ROI in 4 steps
Social media strategy and ROI in 4 stepsSocial media strategy and ROI in 4 steps
Social media strategy and ROI in 4 steps
 
Mocloudos - Feather-weight Cloud OS developed within‹14 man-days
Mocloudos - Feather-weight Cloud OS developed within‹14 man-daysMocloudos - Feather-weight Cloud OS developed within‹14 man-days
Mocloudos - Feather-weight Cloud OS developed within‹14 man-days
 
Competing For Analytics
Competing For AnalyticsCompeting For Analytics
Competing For Analytics
 
Article Samurai Q&A [WEBINAR]
Article Samurai Q&A [WEBINAR]Article Samurai Q&A [WEBINAR]
Article Samurai Q&A [WEBINAR]
 
medioambiente consumo
medioambiente consumomedioambiente consumo
medioambiente consumo
 
Daily Newsletter: 15th February, 2011
Daily Newsletter: 15th February, 2011Daily Newsletter: 15th February, 2011
Daily Newsletter: 15th February, 2011
 
Salzburg
SalzburgSalzburg
Salzburg
 
Who wants to be a millionaire facts about pollution
Who wants to be a millionaire facts about pollutionWho wants to be a millionaire facts about pollution
Who wants to be a millionaire facts about pollution
 
Ù†ÙˆŰŹÙˆŰ§Ù†Ű§Ù†
Ù†ÙˆŰŹÙˆŰ§Ù†Ű§Ù†Ù†ÙˆŰŹÙˆŰ§Ù†Ű§Ù†
Ù†ÙˆŰŹÙˆŰ§Ù†Ű§Ù†
 
Tips on how to get more followers on keek
Tips on how to get more followers on keekTips on how to get more followers on keek
Tips on how to get more followers on keek
 
Social Media in Senior Care
Social Media in Senior CareSocial Media in Senior Care
Social Media in Senior Care
 
Visual Resume, Emmanuel Lemoine
Visual Resume, Emmanuel LemoineVisual Resume, Emmanuel Lemoine
Visual Resume, Emmanuel Lemoine
 
Datan vankina
Datan vankinaDatan vankina
Datan vankina
 
Customer Service Business Challenges And Pegas Solution
Customer Service Business Challenges And Pegas SolutionCustomer Service Business Challenges And Pegas Solution
Customer Service Business Challenges And Pegas Solution
 

Similar to A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for MahoutTed Dunning
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsJack (Jaegeun) Han
 
Guagua an iterative computing framework on hadoop
Guagua an iterative computing framework on hadoopGuagua an iterative computing framework on hadoop
Guagua an iterative computing framework on hadooppengshanzhang
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clusteringTed Dunning
 
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)Takahiro Harada
 
Solving channel coding simulation and optimization problems using GPU
Solving channel coding simulation and optimization problems using GPUSolving channel coding simulation and optimization problems using GPU
Solving channel coding simulation and optimization problems using GPUUsatyuk Vasiliy
 
Functional CNN in elixir
Functional CNN in elixirFunctional CNN in elixir
Functional CNN in elixirMasayoshi Tamura
 
Tesla @ NVIDIA investor day
Tesla @ NVIDIA investor dayTesla @ NVIDIA investor day
Tesla @ NVIDIA investor dayShanker Trivedi
 
MAP-E as IPv4 over IPv6 Technology
MAP-E as IPv4 over IPv6 TechnologyMAP-E as IPv4 over IPv6 Technology
MAP-E as IPv4 over IPv6 TechnologyAkira Nakagawa
 
JVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark applicationJVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark applicationTatsuhiro Chiba
 
OPAL-RT and ANSYS - HIL simulation
OPAL-RT and ANSYS - HIL simulationOPAL-RT and ANSYS - HIL simulation
OPAL-RT and ANSYS - HIL simulationOPAL-RT TECHNOLOGIES
 
[05][cuda 및 fermi 씜적화 Ʞ술] hryu optimization
[05][cuda 및 fermi 씜적화 Ʞ술] hryu optimization[05][cuda 및 fermi 씜적화 Ʞ술] hryu optimization
[05][cuda 및 fermi 씜적화 Ʞ술] hryu optimizationlaparuma
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9inside-BigData.com
 
Experiences with Power 9 at A*STAR CRC
Experiences with Power 9 at A*STAR CRCExperiences with Power 9 at A*STAR CRC
Experiences with Power 9 at A*STAR CRCGanesan Narayanasamy
 
021413 aix trends_jay_kruemcke
021413 aix trends_jay_kruemcke021413 aix trends_jay_kruemcke
021413 aix trends_jay_kruemckeJay Kruemcke
 
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...The Linux Foundation
 
MAP-E as IPv4 over IPv6 Technology - with some operational experiences
MAP-E as IPv4 over IPv6 Technology - with some operational experiencesMAP-E as IPv4 over IPv6 Technology - with some operational experiences
MAP-E as IPv4 over IPv6 Technology - with some operational experiencesAPNIC
 
State of the art: Server-side JavaScript - MoscowJS
State of the art: Server-side JavaScript - MoscowJSState of the art: Server-side JavaScript - MoscowJS
State of the art: Server-side JavaScript - MoscowJSAlexandre Morgaut
 
Muda Proposal
Muda ProposalMuda Proposal
Muda ProposalSyoyo Fujita
 

Similar to A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming (20)

Example Application of GPU
Example Application of GPUExample Application of GPU
Example Application of GPU
 
New Directions for Mahout
New Directions for MahoutNew Directions for Mahout
New Directions for Mahout
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
 
Guagua an iterative computing framework on hadoop
Guagua an iterative computing framework on hadoopGuagua an iterative computing framework on hadoop
Guagua an iterative computing framework on hadoop
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clustering
 
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)
 
Solving channel coding simulation and optimization problems using GPU
Solving channel coding simulation and optimization problems using GPUSolving channel coding simulation and optimization problems using GPU
Solving channel coding simulation and optimization problems using GPU
 
Functional CNN in elixir
Functional CNN in elixirFunctional CNN in elixir
Functional CNN in elixir
 
Tesla @ NVIDIA investor day
Tesla @ NVIDIA investor dayTesla @ NVIDIA investor day
Tesla @ NVIDIA investor day
 
MAP-E as IPv4 over IPv6 Technology
MAP-E as IPv4 over IPv6 TechnologyMAP-E as IPv4 over IPv6 Technology
MAP-E as IPv4 over IPv6 Technology
 
JVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark applicationJVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark application
 
OPAL-RT and ANSYS - HIL simulation
OPAL-RT and ANSYS - HIL simulationOPAL-RT and ANSYS - HIL simulation
OPAL-RT and ANSYS - HIL simulation
 
[05][cuda 및 fermi 씜적화 Ʞ술] hryu optimization
[05][cuda 및 fermi 씜적화 Ʞ술] hryu optimization[05][cuda 및 fermi 씜적화 Ʞ술] hryu optimization
[05][cuda 및 fermi 씜적화 Ʞ술] hryu optimization
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9
 
Experiences with Power 9 at A*STAR CRC
Experiences with Power 9 at A*STAR CRCExperiences with Power 9 at A*STAR CRC
Experiences with Power 9 at A*STAR CRC
 
021413 aix trends_jay_kruemcke
021413 aix trends_jay_kruemcke021413 aix trends_jay_kruemcke
021413 aix trends_jay_kruemcke
 
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
 
MAP-E as IPv4 over IPv6 Technology - with some operational experiences
MAP-E as IPv4 over IPv6 Technology - with some operational experiencesMAP-E as IPv4 over IPv6 Technology - with some operational experiences
MAP-E as IPv4 over IPv6 Technology - with some operational experiences
 
State of the art: Server-side JavaScript - MoscowJS
State of the art: Server-side JavaScript - MoscowJSState of the art: Server-side JavaScript - MoscowJS
State of the art: Server-side JavaScript - MoscowJS
 
Muda Proposal
Muda ProposalMuda Proposal
Muda Proposal
 

More from Maho Nakata

quantum chemistry on quantum computer handson by Q# (2019/8/4@MDR Hongo, Tokyo)
quantum chemistry on quantum computer handson by Q# (2019/8/4@MDR Hongo, Tokyo)quantum chemistry on quantum computer handson by Q# (2019/8/4@MDR Hongo, Tokyo)
quantum chemistry on quantum computer handson by Q# (2019/8/4@MDR Hongo, Tokyo)Maho Nakata
 
Lie-Trotter-Suzukićˆ†è§Łă€ç‰čă«ăƒ•ăƒ©ă‚Żă‚żăƒ«ćˆ†è§Łă«ă€ă„ăŠ
Lie-Trotter-Suzukićˆ†è§Łă€ç‰čă«ăƒ•ăƒ©ă‚Żă‚żăƒ«ćˆ†è§Łă«ă€ă„ăŠLie-Trotter-Suzukićˆ†è§Łă€ç‰čă«ăƒ•ăƒ©ă‚Żă‚żăƒ«ćˆ†è§Łă«ă€ă„ăŠ
Lie-Trotter-Suzukićˆ†è§Łă€ç‰čă«ăƒ•ăƒ©ă‚Żă‚żăƒ«ćˆ†è§Łă«ă€ă„ăŠMaho Nakata
 
LiHăźăƒăƒ†ăƒłă‚·ăƒŁăƒ«ă‚šăƒăƒ«ă‚źăƒŒæ›Č靱 ă‚’é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żă§èĄŒă† Q#+äœç›žæŽšćźšç·š
LiHăźăƒăƒ†ăƒłă‚·ăƒŁăƒ«ă‚šăƒăƒ«ă‚źăƒŒæ›Č靱 ă‚’é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żă§èĄŒă† Q#+äœç›žæŽšćźšç·šLiHăźăƒăƒ†ăƒłă‚·ăƒŁăƒ«ă‚šăƒăƒ«ă‚źăƒŒæ›Č靱 ă‚’é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żă§èĄŒă† Q#+äœç›žæŽšćźšç·š
LiHăźăƒăƒ†ăƒłă‚·ăƒŁăƒ«ă‚šăƒăƒ«ă‚źăƒŒæ›Č靱 ă‚’é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żă§èĄŒă† Q#+äœç›žæŽšćźšç·šMaho Nakata
 
Q#ă«ă‚ˆă‚‹é‡ć­ćŒ–ć­Šèšˆçź— : æ°ŽçŽ ćˆ†ć­ăźäœç›žæŽšćźšă«ă€ă„ăŠ
Q#ă«ă‚ˆă‚‹é‡ć­ćŒ–ć­Šèšˆçź— : æ°ŽçŽ ćˆ†ć­ăźäœç›žæŽšćźšă«ă€ă„ăŠQ#ă«ă‚ˆă‚‹é‡ć­ćŒ–ć­Šèšˆçź— : æ°ŽçŽ ćˆ†ć­ăźäœç›žæŽšćźšă«ă€ă„ăŠ
Q#ă«ă‚ˆă‚‹é‡ć­ćŒ–ć­Šèšˆçź— : æ°ŽçŽ ćˆ†ć­ăźäœç›žæŽšćźšă«ă€ă„ăŠMaho Nakata
 
é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żăźé‡ć­ćŒ–ć­Šèšˆçź—ăžăźćżœç”šăźçŸçŠ¶ăšć±•æœ›
é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żăźé‡ć­ćŒ–ć­Šèšˆçź—ăžăźćżœç”šăźçŸçŠ¶ăšć±•æœ›é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żăźé‡ć­ćŒ–ć­Šèšˆçź—ăžăźćżœç”šăźçŸçŠ¶ăšć±•æœ›
é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żăźé‡ć­ćŒ–ć­Šèšˆçź—ăžăźćżœç”šăźçŸçŠ¶ăšć±•æœ›Maho Nakata
 
qubită«ă‚ˆă‚‹æłąć‹•é–ąæ•°ăźè™šæ™‚é–“ç™șć±•ăźă‚·ăƒŸăƒ„ăƒŹăƒŒă‚·ăƒ§ăƒł: a review
qubită«ă‚ˆă‚‹æłąć‹•é–ąæ•°ăźè™šæ™‚é–“ç™șć±•ăźă‚·ăƒŸăƒ„ăƒŹăƒŒă‚·ăƒ§ăƒł: a reviewqubită«ă‚ˆă‚‹æłąć‹•é–ąæ•°ăźè™šæ™‚é–“ç™șć±•ăźă‚·ăƒŸăƒ„ăƒŹăƒŒă‚·ăƒ§ăƒł: a review
qubită«ă‚ˆă‚‹æłąć‹•é–ąæ•°ăźè™šæ™‚é–“ç™șć±•ăźă‚·ăƒŸăƒ„ăƒŹăƒŒă‚·ăƒ§ăƒł: a reviewMaho Nakata
 
Openfermionă‚’äœżăŁăŸćˆ†ć­ăźèšˆçź— part I
Openfermionă‚’äœżăŁăŸćˆ†ć­ăźèšˆçź— part IOpenfermionă‚’äœżăŁăŸćˆ†ć­ăźèšˆçź— part I
Openfermionă‚’äœżăŁăŸćˆ†ć­ăźèšˆçź— part IMaho Nakata
 
é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żă§é‡ć­ćŒ–ć­ŠăźfullCIăŒè¶…é«˜é€Ÿă«ăȘる(かも
é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żă§é‡ć­ćŒ–ć­ŠăźfullCIăŒè¶…é«˜é€Ÿă«ăȘる(ă‹ă‚‚é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żă§é‡ć­ćŒ–ć­ŠăźfullCIăŒè¶…é«˜é€Ÿă«ăȘる(かも
é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żă§é‡ć­ćŒ–ć­ŠăźfullCIăŒè¶…é«˜é€Ÿă«ăȘる(かもMaho Nakata
 
20180723 é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żăźé‡ć­ćŒ–ć­Šăžăźćżœç”š; Bravyi-KitaevćŸșćș•ăźćźŸèŁ…
20180723 é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żăźé‡ć­ćŒ–ć­Šăžăźćżœç”š; Bravyi-KitaevćŸșćș•ăźćźŸèŁ…20180723 é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żăźé‡ć­ćŒ–ć­Šăžăźćżœç”š; Bravyi-KitaevćŸșćș•ăźćźŸèŁ…
20180723 é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żăźé‡ć­ćŒ–ć­Šăžăźćżœç”š; Bravyi-KitaevćŸșćș•ăźćźŸèŁ…Maho Nakata
 
第11曞戆歐科歩 2017/9/17 Pubchemqcプロゾェクト
第11曞戆歐科歩 2017/9/17 Pubchemqcプロゾェクト第11曞戆歐科歩 2017/9/17 Pubchemqcプロゾェクト
第11曞戆歐科歩 2017/9/17 PubchemqcプロゾェクトMaho Nakata
 
Kobeworkshop pubchemqc project
Kobeworkshop pubchemqc projectKobeworkshop pubchemqc project
Kobeworkshop pubchemqc projectMaho Nakata
 
èšˆçź—ćŒ–ć­ŠćźŸçż’èŹ›ćș§:珏äșŒć›ž
 èšˆçź—ćŒ–ć­ŠćźŸçż’èŹ›ćș§:珏äșŒć›ž èšˆçź—ćŒ–ć­ŠćźŸçż’èŹ›ćș§:珏äșŒć›ž
èšˆçź—ćŒ–ć­ŠćźŸçż’èŹ›ćș§:珏äșŒć›žMaho Nakata
 
èšˆçź—ćŒ–ć­ŠćźŸçż’èŹ›ćș§:第侀曞
èšˆçź—ćŒ–ć­ŠćźŸçż’èŹ›ćș§:çŹŹäž€ć›žèšˆçź—ćŒ–ć­ŠćźŸçż’èŹ›ćș§:第侀曞
èšˆçź—ćŒ–ć­ŠćźŸçż’èŹ›ćș§:第侀曞Maho Nakata
 
HOKUSAIăźăƒ™ăƒłăƒăƒžăƒŒă‚Ż ç†ç ”ă‚·ăƒłăƒă‚žă‚Šăƒ  侭田戆
HOKUSAIăźăƒ™ăƒłăƒăƒžăƒŒă‚Ż ç†ç ”ă‚·ăƒłăƒă‚žă‚Šăƒ  侭田戆HOKUSAIăźăƒ™ăƒłăƒăƒžăƒŒă‚Ż ç†ç ”ă‚·ăƒłăƒă‚žă‚Šăƒ  侭田戆
HOKUSAIăźăƒ™ăƒłăƒăƒžăƒŒă‚Ż ç†ç ”ă‚·ăƒłăƒă‚žă‚Šăƒ  侭田戆Maho Nakata
 
ç‚șæ›żć–ćŒ•(FX)でたtickdataた抠淄ずMySQLで缡理
ç‚șæ›żć–ćŒ•(FX)でたtickdataた抠淄ずMySQLで缡理ç‚șæ›żć–ćŒ•(FX)でたtickdataた抠淄ずMySQLで缡理
ç‚șæ›żć–ćŒ•(FX)でたtickdataた抠淄ずMySQLで缡理Maho Nakata
 
ç‚șæ›żăźTickdataをDukascopyă‹ă‚‰ăƒ€ă‚Šăƒłăƒ­ăƒŒăƒ‰ă™ă‚‹
ç‚șæ›żăźTickdataをDukascopyă‹ă‚‰ăƒ€ă‚Šăƒłăƒ­ăƒŒăƒ‰ă™ă‚‹ç‚șæ›żăźTickdataをDukascopyă‹ă‚‰ăƒ€ă‚Šăƒłăƒ­ăƒŒăƒ‰ă™ă‚‹
ç‚șæ›żăźTickdataをDukascopyă‹ă‚‰ăƒ€ă‚Šăƒłăƒ­ăƒŒăƒ‰ă™ă‚‹Maho Nakata
 
HPCS2015 pythonă‚’ç”šă„ăŸé‡ć­ćŒ–ć­Šăƒ—ăƒ­ă‚°ăƒ©ăƒ ăźé–‹ç™șべ濜甹
HPCS2015 pythonă‚’ç”šă„ăŸé‡ć­ćŒ–ć­Šăƒ—ăƒ­ă‚°ăƒ©ăƒ ăźé–‹ç™șべ濜甹HPCS2015 pythonă‚’ç”šă„ăŸé‡ć­ćŒ–ć­Šăƒ—ăƒ­ă‚°ăƒ©ăƒ ăźé–‹ç™șべ濜甹
HPCS2015 pythonă‚’ç”šă„ăŸé‡ć­ćŒ–ć­Šăƒ—ăƒ­ă‚°ăƒ©ăƒ ăźé–‹ç™șべ濜甹Maho Nakata
 
HPCS2015 ć€§èŠæšĄé‡ć­ćŒ–ć­Šèšˆçź—ăƒ—ăƒ­ă‚°ăƒ©ăƒ SMASHぼ開ç™șべ慬開(çŸłæ‘)
HPCS2015 ć€§èŠæšĄé‡ć­ćŒ–ć­Šèšˆçź—ăƒ—ăƒ­ă‚°ăƒ©ăƒ SMASHぼ開ç™șべ慬開(çŸłæ‘)HPCS2015 ć€§èŠæšĄé‡ć­ćŒ–ć­Šèšˆçź—ăƒ—ăƒ­ă‚°ăƒ©ăƒ SMASHぼ開ç™șべ慬開(çŸłæ‘)
HPCS2015 ć€§èŠæšĄé‡ć­ćŒ–ć­Šèšˆçź—ăƒ—ăƒ­ă‚°ăƒ©ăƒ SMASHぼ開ç™șべ慬開(çŸłæ‘)Maho Nakata
 
The PubChemQC Project
The PubChemQC ProjectThe PubChemQC Project
The PubChemQC ProjectMaho Nakata
 
3DプăƒȘăƒłă‚żć°Žć…„èš˜ă€€ă‚żăƒłăƒ‘ă‚ŻèłȘăźæšĄćž‹ă‚’ăƒ—ăƒȘントする
3DプăƒȘăƒłă‚żć°Žć…„èš˜ă€€ă‚żăƒłăƒ‘ă‚ŻèłȘăźæšĄćž‹ă‚’ăƒ—ăƒȘントする3DプăƒȘăƒłă‚żć°Žć…„èš˜ă€€ă‚żăƒłăƒ‘ă‚ŻèłȘăźæšĄćž‹ă‚’ăƒ—ăƒȘントする
3DプăƒȘăƒłă‚żć°Žć…„èš˜ă€€ă‚żăƒłăƒ‘ă‚ŻèłȘăźæšĄćž‹ă‚’ăƒ—ăƒȘントするMaho Nakata
 

More from Maho Nakata (20)

quantum chemistry on quantum computer handson by Q# (2019/8/4@MDR Hongo, Tokyo)
quantum chemistry on quantum computer handson by Q# (2019/8/4@MDR Hongo, Tokyo)quantum chemistry on quantum computer handson by Q# (2019/8/4@MDR Hongo, Tokyo)
quantum chemistry on quantum computer handson by Q# (2019/8/4@MDR Hongo, Tokyo)
 
Lie-Trotter-Suzukićˆ†è§Łă€ç‰čă«ăƒ•ăƒ©ă‚Żă‚żăƒ«ćˆ†è§Łă«ă€ă„ăŠ
Lie-Trotter-Suzukićˆ†è§Łă€ç‰čă«ăƒ•ăƒ©ă‚Żă‚żăƒ«ćˆ†è§Łă«ă€ă„ăŠLie-Trotter-Suzukićˆ†è§Łă€ç‰čă«ăƒ•ăƒ©ă‚Żă‚żăƒ«ćˆ†è§Łă«ă€ă„ăŠ
Lie-Trotter-Suzukićˆ†è§Łă€ç‰čă«ăƒ•ăƒ©ă‚Żă‚żăƒ«ćˆ†è§Łă«ă€ă„ăŠ
 
LiHăźăƒăƒ†ăƒłă‚·ăƒŁăƒ«ă‚šăƒăƒ«ă‚źăƒŒæ›Č靱 ă‚’é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żă§èĄŒă† Q#+äœç›žæŽšćźšç·š
LiHăźăƒăƒ†ăƒłă‚·ăƒŁăƒ«ă‚šăƒăƒ«ă‚źăƒŒæ›Č靱 ă‚’é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żă§èĄŒă† Q#+äœç›žæŽšćźšç·šLiHăźăƒăƒ†ăƒłă‚·ăƒŁăƒ«ă‚šăƒăƒ«ă‚źăƒŒæ›Č靱 ă‚’é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żă§èĄŒă† Q#+äœç›žæŽšćźšç·š
LiHăźăƒăƒ†ăƒłă‚·ăƒŁăƒ«ă‚šăƒăƒ«ă‚źăƒŒæ›Č靱 ă‚’é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żă§èĄŒă† Q#+äœç›žæŽšćźšç·š
 
Q#ă«ă‚ˆă‚‹é‡ć­ćŒ–ć­Šèšˆçź— : æ°ŽçŽ ćˆ†ć­ăźäœç›žæŽšćźšă«ă€ă„ăŠ
Q#ă«ă‚ˆă‚‹é‡ć­ćŒ–ć­Šèšˆçź— : æ°ŽçŽ ćˆ†ć­ăźäœç›žæŽšćźšă«ă€ă„ăŠQ#ă«ă‚ˆă‚‹é‡ć­ćŒ–ć­Šèšˆçź— : æ°ŽçŽ ćˆ†ć­ăźäœç›žæŽšćźšă«ă€ă„ăŠ
Q#ă«ă‚ˆă‚‹é‡ć­ćŒ–ć­Šèšˆçź— : æ°ŽçŽ ćˆ†ć­ăźäœç›žæŽšćźšă«ă€ă„ăŠ
 
é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żăźé‡ć­ćŒ–ć­Šèšˆçź—ăžăźćżœç”šăźçŸçŠ¶ăšć±•æœ›
é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żăźé‡ć­ćŒ–ć­Šèšˆçź—ăžăźćżœç”šăźçŸçŠ¶ăšć±•æœ›é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żăźé‡ć­ćŒ–ć­Šèšˆçź—ăžăźćżœç”šăźçŸçŠ¶ăšć±•æœ›
é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żăźé‡ć­ćŒ–ć­Šèšˆçź—ăžăźćżœç”šăźçŸçŠ¶ăšć±•æœ›
 
qubită«ă‚ˆă‚‹æłąć‹•é–ąæ•°ăźè™šæ™‚é–“ç™șć±•ăźă‚·ăƒŸăƒ„ăƒŹăƒŒă‚·ăƒ§ăƒł: a review
qubită«ă‚ˆă‚‹æłąć‹•é–ąæ•°ăźè™šæ™‚é–“ç™șć±•ăźă‚·ăƒŸăƒ„ăƒŹăƒŒă‚·ăƒ§ăƒł: a reviewqubită«ă‚ˆă‚‹æłąć‹•é–ąæ•°ăźè™šæ™‚é–“ç™șć±•ăźă‚·ăƒŸăƒ„ăƒŹăƒŒă‚·ăƒ§ăƒł: a review
qubită«ă‚ˆă‚‹æłąć‹•é–ąæ•°ăźè™šæ™‚é–“ç™șć±•ăźă‚·ăƒŸăƒ„ăƒŹăƒŒă‚·ăƒ§ăƒł: a review
 
Openfermionă‚’äœżăŁăŸćˆ†ć­ăźèšˆçź— part I
Openfermionă‚’äœżăŁăŸćˆ†ć­ăźèšˆçź— part IOpenfermionă‚’äœżăŁăŸćˆ†ć­ăźèšˆçź— part I
Openfermionă‚’äœżăŁăŸćˆ†ć­ăźèšˆçź— part I
 
é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żă§é‡ć­ćŒ–ć­ŠăźfullCIăŒè¶…é«˜é€Ÿă«ăȘる(かも
é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żă§é‡ć­ćŒ–ć­ŠăźfullCIăŒè¶…é«˜é€Ÿă«ăȘる(ă‹ă‚‚é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żă§é‡ć­ćŒ–ć­ŠăźfullCIăŒè¶…é«˜é€Ÿă«ăȘる(かも
é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żă§é‡ć­ćŒ–ć­ŠăźfullCIăŒè¶…é«˜é€Ÿă«ăȘる(かも
 
20180723 é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żăźé‡ć­ćŒ–ć­Šăžăźćżœç”š; Bravyi-KitaevćŸșćș•ăźćźŸèŁ…
20180723 é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żăźé‡ć­ćŒ–ć­Šăžăźćżœç”š; Bravyi-KitaevćŸșćș•ăźćźŸèŁ…20180723 é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żăźé‡ć­ćŒ–ć­Šăžăźćżœç”š; Bravyi-KitaevćŸșćș•ăźćźŸèŁ…
20180723 é‡ć­ă‚łăƒłăƒ”ăƒ„ăƒŒă‚żăźé‡ć­ćŒ–ć­Šăžăźćżœç”š; Bravyi-KitaevćŸșćș•ăźćźŸèŁ…
 
第11曞戆歐科歩 2017/9/17 Pubchemqcプロゾェクト
第11曞戆歐科歩 2017/9/17 Pubchemqcプロゾェクト第11曞戆歐科歩 2017/9/17 Pubchemqcプロゾェクト
第11曞戆歐科歩 2017/9/17 Pubchemqcプロゾェクト
 
Kobeworkshop pubchemqc project
Kobeworkshop pubchemqc projectKobeworkshop pubchemqc project
Kobeworkshop pubchemqc project
 
èšˆçź—ćŒ–ć­ŠćźŸçż’èŹ›ćș§:珏äșŒć›ž
 èšˆçź—ćŒ–ć­ŠćźŸçż’èŹ›ćș§:珏äșŒć›ž èšˆçź—ćŒ–ć­ŠćźŸçż’èŹ›ćș§:珏äșŒć›ž
èšˆçź—ćŒ–ć­ŠćźŸçż’èŹ›ćș§:珏äșŒć›ž
 
èšˆçź—ćŒ–ć­ŠćźŸçż’èŹ›ćș§:第侀曞
èšˆçź—ćŒ–ć­ŠćźŸçż’èŹ›ćș§:çŹŹäž€ć›žèšˆçź—ćŒ–ć­ŠćźŸçż’èŹ›ćș§:第侀曞
èšˆçź—ćŒ–ć­ŠćźŸçż’èŹ›ćș§:第侀曞
 
HOKUSAIăźăƒ™ăƒłăƒăƒžăƒŒă‚Ż ç†ç ”ă‚·ăƒłăƒă‚žă‚Šăƒ  侭田戆
HOKUSAIăźăƒ™ăƒłăƒăƒžăƒŒă‚Ż ç†ç ”ă‚·ăƒłăƒă‚žă‚Šăƒ  侭田戆HOKUSAIăźăƒ™ăƒłăƒăƒžăƒŒă‚Ż ç†ç ”ă‚·ăƒłăƒă‚žă‚Šăƒ  侭田戆
HOKUSAIăźăƒ™ăƒłăƒăƒžăƒŒă‚Ż ç†ç ”ă‚·ăƒłăƒă‚žă‚Šăƒ  侭田戆
 
ç‚șæ›żć–ćŒ•(FX)でたtickdataた抠淄ずMySQLで缡理
ç‚șæ›żć–ćŒ•(FX)でたtickdataた抠淄ずMySQLで缡理ç‚șæ›żć–ćŒ•(FX)でたtickdataた抠淄ずMySQLで缡理
ç‚șæ›żć–ćŒ•(FX)でたtickdataた抠淄ずMySQLで缡理
 
ç‚șæ›żăźTickdataをDukascopyă‹ă‚‰ăƒ€ă‚Šăƒłăƒ­ăƒŒăƒ‰ă™ă‚‹
ç‚șæ›żăźTickdataをDukascopyă‹ă‚‰ăƒ€ă‚Šăƒłăƒ­ăƒŒăƒ‰ă™ă‚‹ç‚șæ›żăźTickdataをDukascopyă‹ă‚‰ăƒ€ă‚Šăƒłăƒ­ăƒŒăƒ‰ă™ă‚‹
ç‚șæ›żăźTickdataをDukascopyă‹ă‚‰ăƒ€ă‚Šăƒłăƒ­ăƒŒăƒ‰ă™ă‚‹
 
HPCS2015 pythonă‚’ç”šă„ăŸé‡ć­ćŒ–ć­Šăƒ—ăƒ­ă‚°ăƒ©ăƒ ăźé–‹ç™șべ濜甹
HPCS2015 pythonă‚’ç”šă„ăŸé‡ć­ćŒ–ć­Šăƒ—ăƒ­ă‚°ăƒ©ăƒ ăźé–‹ç™șべ濜甹HPCS2015 pythonă‚’ç”šă„ăŸé‡ć­ćŒ–ć­Šăƒ—ăƒ­ă‚°ăƒ©ăƒ ăźé–‹ç™șべ濜甹
HPCS2015 pythonă‚’ç”šă„ăŸé‡ć­ćŒ–ć­Šăƒ—ăƒ­ă‚°ăƒ©ăƒ ăźé–‹ç™șべ濜甹
 
HPCS2015 ć€§èŠæšĄé‡ć­ćŒ–ć­Šèšˆçź—ăƒ—ăƒ­ă‚°ăƒ©ăƒ SMASHぼ開ç™șべ慬開(çŸłæ‘)
HPCS2015 ć€§èŠæšĄé‡ć­ćŒ–ć­Šèšˆçź—ăƒ—ăƒ­ă‚°ăƒ©ăƒ SMASHぼ開ç™șべ慬開(çŸłæ‘)HPCS2015 ć€§èŠæšĄé‡ć­ćŒ–ć­Šèšˆçź—ăƒ—ăƒ­ă‚°ăƒ©ăƒ SMASHぼ開ç™șべ慬開(çŸłæ‘)
HPCS2015 ć€§èŠæšĄé‡ć­ćŒ–ć­Šèšˆçź—ăƒ—ăƒ­ă‚°ăƒ©ăƒ SMASHぼ開ç™șべ慬開(çŸłæ‘)
 
The PubChemQC Project
The PubChemQC ProjectThe PubChemQC Project
The PubChemQC Project
 
3DプăƒȘăƒłă‚żć°Žć…„èš˜ă€€ă‚żăƒłăƒ‘ă‚ŻèłȘăźæšĄćž‹ă‚’ăƒ—ăƒȘントする
3DプăƒȘăƒłă‚żć°Žć…„èš˜ă€€ă‚żăƒłăƒ‘ă‚ŻèłȘăźæšĄćž‹ă‚’ăƒ—ăƒȘントする3DプăƒȘăƒłă‚żć°Žć…„èš˜ă€€ă‚żăƒłăƒ‘ă‚ŻèłȘăźæšĄćž‹ă‚’ăƒ—ăƒȘントする
3DプăƒȘăƒłă‚żć°Žć…„èš˜ă€€ă‚żăƒłăƒ‘ă‚ŻèłȘăźæšĄćž‹ă‚’ăƒ—ăƒȘントする
 

Recently uploaded

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Bun (KitWorks Team Study 녾별마룹 발표 2024.4.22)
Bun (KitWorks Team Study 녾별마룹 발표 2024.4.22)Bun (KitWorks Team Study 녾별마룹 발표 2024.4.22)
Bun (KitWorks Team Study 녾별마룹 발표 2024.4.22)Wonjun Hwang
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Recently uploaded (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Bun (KitWorks Team Study 녾별마룹 발표 2024.4.22)
Bun (KitWorks Team Study 녾별마룹 발표 2024.4.22)Bun (KitWorks Team Study 녾별마룹 발표 2024.4.22)
Bun (KitWorks Team Study 녾별마룹 발표 2024.4.22)
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 

A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semidefinite programming

  • 1. . A fast implementation of matrix-matrix product in double-double precision on NVIDIA C2050 and application to semideïŹnite programming . Nakata Maho∗† (maho@riken.jp∗ ), Yasuyoshi Takao†† , Noda Shigeho† , Himeno Ryutaro† RIKEN, Advanced Center for Computing and Communication† , JFE Tech†† International Conference on Networking and Computing 2012/12/5 @ Okinawa 14:45-15:15 Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 2. Overview Introduction of this research in a slide. Importance of high precision arithmetic. The double-double precision: a cheap and easy solution for quadruple precision and its details. Matrix-matrix multiplication (Rgemm) in MPACK (high precision version of BLAS and LAPACK). Implementation of a fast Rgemm on C2050 GPU : 150 times faster than CPU. Application: acceleration of semideïŹnite programming solver “SDPA-DD” : 10 times faster than CPU. Summary. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 3. Introduction of this research in a slide. Matrix-matrix multiplication double-double precision NVIDIA C2050, GPU GPU=CPUx150, Peak performance: 26GFLOPS 25 20 GFLOPS 15 10 QuadMul−Sloppy, QuadAdd−Cray Kernel QuadMul−Sloppy, QuadAdd−Cray Total QuadMul−FMA, QuadAdd−Cray Kernel QuadMul−FMA, QuadAdd−Cray Total 5 QuadMul−Sloppy, QuadAdd−IEEE Kernel QuadMul−Sloppy, QuadAdd−IEEE Total QuadMul−FMA, QuadAdd−IEEE Kernel QuadMul−FMA, QuadAdd−IEEE Total 0 0 1000 2000 3000 4000 5000 6000 § Dimension € + Application : SemideïŹnite Programming GPU=CPUx10 Š „ Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 4. Introduction of this research in a slide. Matrix-matrix multiplication double-double precision NVIDIA C2050, GPU GPU=CPUx150, Peak performance: 26GFLOPS 25 20 GFLOPS 15 10 QuadMul−Sloppy, QuadAdd−Cray Kernel QuadMul−Sloppy, QuadAdd−Cray Total QuadMul−FMA, QuadAdd−Cray Kernel QuadMul−FMA, QuadAdd−Cray Total 5 QuadMul−Sloppy, QuadAdd−IEEE Kernel QuadMul−Sloppy, QuadAdd−IEEE Total QuadMul−FMA, QuadAdd−IEEE Kernel QuadMul−FMA, QuadAdd−IEEE Total 0 0 1000 2000 3000 4000 5000 6000 § Dimension € + Application : SemideïŹnite Programming GPU=CPUx10 Š „ Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 5. More accuracy is needed towards PETA and EXA scale computing The EXA scale computing : 1023 FLOP!!! for just one week calculation. ScientiïŹc computing may suffer from the accuracy. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 6. More accuracy is needed towards PETA and EXA scale computing The EXA scale computing : 1023 FLOP!!! for just one week calculation. ScientiïŹc computing may suffer from the accuracy. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 7. More accuracy is needed towards PETA and EXA scale computing The EXA scale computing : 1023 FLOP!!! for just one week calculation. ScientiïŹc computing may suffer from the accuracy. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 8. More accuracy is needed towards PETA and EXA scale computing Iterative methods in double precision calculation sometimes do not even converge. [Hasegawa 2007] Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 9. More accuracy is needed towards PETA and EXA scale computing Iterative methods in double precision calculation sometimes do not even converge. [Hasegawa 2007] Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 10. More accuracy is needed towards PETA and EXA scale computing SemideïŹnite programming (SDP): condition number diverges at the optimum. Therefore, one may be very hard to obtain an accurate solution [Nakata et al 2008], [Nakata 2009], [Waki-Nakata-Muramatsu] The 1-norm and the estimated 1-norm condition number of shur complement matrix 1e+20 1-cond 1-norm 1e+15 1e+10 100000 1 1e-05 1e-10 0 10 20 30 40 50 60 70 80 90 Nakata Maho # of iter. implementation of A fast matrix-matrix product in double-double preci
  • 11. More accuracy is needed towards PETA and EXA scale computing SemideïŹnite programming (SDP): condition number diverges at the optimum. Therefore, one may be very hard to obtain an accurate solution [Nakata et al 2008], [Nakata 2009], [Waki-Nakata-Muramatsu] The 1-norm and the estimated 1-norm condition number of shur complement matrix 1e+20 1-cond 1-norm 1e+15 1e+10 100000 1 1e-05 1e-10 0 10 20 30 40 50 60 70 80 90 Nakata Maho # of iter. implementation of A fast matrix-matrix product in double-double preci
  • 12. More accuracy is needed towards PETA and EXA scale computing SemideïŹnite programming (SDP): condition number diverges at the optimum. Therefore, one may be very hard to obtain an accurate solution [Nakata et al 2008], [Nakata 2009], [Waki-Nakata-Muramatsu] The 1-norm and the estimated 1-norm condition number of shur complement matrix 1e+20 1-cond 1-norm 1e+15 1e+10 100000 1 1e-05 1e-10 0 10 20 30 40 50 60 70 80 90 Nakata Maho # of iter. implementation of A fast matrix-matrix product in double-double preci
  • 13. More accuracy is needed towards PETA and EXA scale computing SemideïŹnite programming (SDP): condition number diverges at the optimum. Therefore, one may be very hard to obtain an accurate solution [Nakata et al 2008], [Nakata 2009], [Waki-Nakata-Muramatsu] The 1-norm and the estimated 1-norm condition number of shur complement matrix 1e+20 1-cond 1-norm 1e+15 1e+10 100000 1 1e-05 1e-10 0 10 20 30 40 50 60 70 80 90 Nakata Maho # of iter. implementation of A fast matrix-matrix product in double-double preci
  • 14. Acceleration high precision operation on GPU is a good idea Double-double precision is a cheap and fast solution for high precision accurate enough for many purposes : almost as accurate as quadruple precision. fast: operations are done only by 8 ∌ 24 double precision operations. operation intensive: requires memory bandwidth than FLOPS. Implementing on GPU is a good idea fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∌ 200GFLOPS. cheap: NVIDIA C2050 $2000, Workstation : $5000 ∌ $10000. do not require complex operations: suitable for GPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 15. Acceleration high precision operation on GPU is a good idea Double-double precision is a cheap and fast solution for high precision accurate enough for many purposes : almost as accurate as quadruple precision. fast: operations are done only by 8 ∌ 24 double precision operations. operation intensive: requires memory bandwidth than FLOPS. Implementing on GPU is a good idea fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∌ 200GFLOPS. cheap: NVIDIA C2050 $2000, Workstation : $5000 ∌ $10000. do not require complex operations: suitable for GPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 16. Acceleration high precision operation on GPU is a good idea Double-double precision is a cheap and fast solution for high precision accurate enough for many purposes : almost as accurate as quadruple precision. fast: operations are done only by 8 ∌ 24 double precision operations. operation intensive: requires memory bandwidth than FLOPS. Implementing on GPU is a good idea fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∌ 200GFLOPS. cheap: NVIDIA C2050 $2000, Workstation : $5000 ∌ $10000. do not require complex operations: suitable for GPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 17. Acceleration high precision operation on GPU is a good idea Double-double precision is a cheap and fast solution for high precision accurate enough for many purposes : almost as accurate as quadruple precision. fast: operations are done only by 8 ∌ 24 double precision operations. operation intensive: requires memory bandwidth than FLOPS. Implementing on GPU is a good idea fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∌ 200GFLOPS. cheap: NVIDIA C2050 $2000, Workstation : $5000 ∌ $10000. do not require complex operations: suitable for GPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 18. Acceleration high precision operation on GPU is a good idea Double-double precision is a cheap and fast solution for high precision accurate enough for many purposes : almost as accurate as quadruple precision. fast: operations are done only by 8 ∌ 24 double precision operations. operation intensive: requires memory bandwidth than FLOPS. Implementing on GPU is a good idea fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∌ 200GFLOPS. cheap: NVIDIA C2050 $2000, Workstation : $5000 ∌ $10000. do not require complex operations: suitable for GPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 19. Acceleration high precision operation on GPU is a good idea Double-double precision is a cheap and fast solution for high precision accurate enough for many purposes : almost as accurate as quadruple precision. fast: operations are done only by 8 ∌ 24 double precision operations. operation intensive: requires memory bandwidth than FLOPS. Implementing on GPU is a good idea fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∌ 200GFLOPS. cheap: NVIDIA C2050 $2000, Workstation : $5000 ∌ $10000. do not require complex operations: suitable for GPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 20. Acceleration high precision operation on GPU is a good idea Double-double precision is a cheap and fast solution for high precision accurate enough for many purposes : almost as accurate as quadruple precision. fast: operations are done only by 8 ∌ 24 double precision operations. operation intensive: requires memory bandwidth than FLOPS. Implementing on GPU is a good idea fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∌ 200GFLOPS. cheap: NVIDIA C2050 $2000, Workstation : $5000 ∌ $10000. do not require complex operations: suitable for GPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 21. Acceleration high precision operation on GPU is a good idea Double-double precision is a cheap and fast solution for high precision accurate enough for many purposes : almost as accurate as quadruple precision. fast: operations are done only by 8 ∌ 24 double precision operations. operation intensive: requires memory bandwidth than FLOPS. Implementing on GPU is a good idea fast: 515GFLOPS by NVIDIA C2050, CPU 100 ∌ 200GFLOPS. cheap: NVIDIA C2050 $2000, Workstation : $5000 ∌ $10000. do not require complex operations: suitable for GPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 22. The double-double precision: handy and easy quadruple precision “754-2008 IEEE Standard for Floating-Point Arithmetic” The binary64 (aka double precision) format has 16 decimal signiïŹcant digits Widely used and very fast. Core i7 920: ∌40GFLOPS; RADEON HD7970 ∌1000GFLOPS, K computer: ∌ over 10PFLOPS) § € Rounding error may occur for every arithmetic operation. Š „ Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 23. The double-double precision: handy and easy quadruple precision The double-double precision number a is expressed by two double precision numbers a hi , a lo. a = (a hi , a lo). Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 24. The double-double precision: handy and easy quadruple precision § € Knuth’s Theorem „ Š Error-free transformation of two ïŹ‚oating point numbers a, b, a + b = (a ⊕ b) + e where ⊕ is addition including rounding errors, + is addition, e is ïŹ‚oating point number § € We can evaluate rounding error exactly for addition! Š „ Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 25. The double-double precision: handy and easy quadruple precision § € Dekker’s Theorem „ Š Error-free transformation of two ïŹ‚oating point numbers a, b, a × b = (a ⊗ b) + e ⊗ is multiplication operator with rounding errors, × is multiplication operator, e is ïŹ‚oating point number. § € We can evaluate rounding error exactly for multiplication! Š „ Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 26. The double-double precision: handy and easy quadruple precision Based on Knuth’s Theorem, we can deïŹne “Quick-Two-Sum (a, b)” where a, b are ïŹ‚oating point numbers, and ⊕, are operators including rounding errors. and when and when |a| ≄ |b|, we can calculate exactly s = (a ⊕ b), e = a + b − (a ⊕ b) in three operations. 1 ( Quick-Two-Sum (a, b): 1. s← a⊕b . e ← b (s a) 2 3. return(s, e) 0 ) § € (s, e) = Quick-Two-Sum (a, b) „ Š Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 27. The double-double precision: handy and easy quadruple precision Based on Knuth’s Theorem, we can deïŹne “Quick-Two-Sum (a, b)” where a, b are ïŹ‚oating point numbers, and ⊕, are operators including rounding errors. and we can calculate exactly s = (a ⊕ b), e = a + b − (a ⊕ b) in six operations. 9 6 Two-Sum (a, b): 1. s← a⊕b . v←s a 2 3. e ← (a (s v)) ⊕ (b v) 4. return(s, e) 8 7 § € (s, e) = Two-Sum (a, b) „ Š Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 28. The double-double precision: handy and easy quadruple precision Basics:Dekker’s Theorem There exists an algorithm which calculate s = (a ⊗ b) and e = a × b − (a ⊗ b), where ⊗ is multiplication operator with rounding errors, using following “Split(a)” in four operations and “Two-Prod(a,b)” in 17 operations. 9 6 9 6 Two-prod (a, b): Split (a): . p← a⊗b 1 1. t ← (227 + 1) ⊗ a . (a , a ) ← Split(a) 2 hi lo . a hi ← t (t a) 2 . (b hi , b lo) ← Split(b) 3 3. a lo ← a a hi . e ← ((a hi ⊗ b hi p) ⊕ a hi ⊗ 4 4. return(a hi , a lo) b lo ⊕ a lo ⊗ b hi ) ⊕ a lo ⊗ b lo 8 7 . return( p, e) 5 8 7 § € (s, e) =Two-Prod(a, b) „ Š Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 29. The double-double precision: handy and easy quadruple precision Addition in double-double operation can be done in 20FLOPS by following “QuadAdd-IEEE” 9 6 QuadAdd-IEEE (a, b): 1. (s hi , e hi ) = Two-Sum(a hi , b hi ) 2. (s lo, e lo) = Two-Sum(a lo, b lo) 3. e hi = e hi ⊕ s lo 4. (s lo, e lo) = Quick-Two-Sum(s hi , e hi ) 5. e hi = e hi ⊕ s lo . (s hi , e lo) = Quick-Two-Sum(s hi , e hi ) 6 7. return(c) 8 7 Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 30. The double-double precision: handy and easy quadruple precision Multiplication in double-double operation can be done in 24FLOPS by following “QuadMul”. 9 6 QuadMul (a, b): 1. ( phi , plo) = Two-Prod(a hi , b hi ) 2. plo = plo ⊕ (a hi ⊗ b lo ⊕ a lo ⊗ b hi ) 3. (c hi , c lo) = Quick-Two-Sum(phi , plo) 4. return(c) 8 7 Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 31. The double-double precision: handy and easy quadruple precision The FMA (fused multiply-add) operation calculates a×b+c in one command. Doing a × b + c exactly, then round to double-precision. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 32. The double-double precision: handy and easy quadruple precision Faster: using FMA instruction Two-Prod becomes 3 operations (17 op. w/o FMA), and QuadMul(-FMA) can be done in only 10 operations (24 ops w/o FMA) 1 ( Two-prod-FMA (a, b): 1. p← a⊗b . e ← FMA(a × b − p) 2 3. return(p, e) 0 ) Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 33. The double-double precision: handy and easy quadruple precision Faster: lower accuracy operations 9 6 9 6 QuadMul-Sloppy (a, b): QuadAdd-Cray (a, b): 1. p = (a hi ⊗ b lo) 1. (c hi , c lo) = 2. q = (a lo ⊗ b hi ) Two-Sum(a hi , b hi ) . t = p⊕ q 3 2. c lo = c lo ⊕ (a lo ⊕ b lo) 4. c hi = FMA(a hi × b hi + t) 3. (c hi , c lo) = 5. e = FMA(a hi × b hi − c hi ) Quick-Two-Sum(c hi , c lo) 6. c lo = e ⊕ t 4. return(c) 8 7 7. return(c) 8 7 Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 34. The double-double precision: handy and easy quadruple precision Summary: Operations count in each double-double arithmetic Algorithm # of operations Quick-Two-Sum 3 Two-Sum 6 Split 4 Two-Prod 17 Two-Prod-FMA 3∗ QuadAdd-IEEE 20 QuadAdd-Cray 11 QuadMul 24 QuadMul-FMA 10∗ QuadMul-FMA-Sloppy 8∗ ∗ 2FLOPScount for FMA. We used QuadAdd-IEEE and QuadMul-FMA when not explicitly stated Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 35. The double-double precision: handy and easy quadruple precision QD library Features: Class of C++.The double-double precision: “dd real”. Free software. Author: Yozo Hida, Xiaoye S. Li, David H. Bailey Download: http://crd.lbl.gov/˜dhbailey/mpdist/ Paper: http://crd.lbl.gov/˜dhbailey/dhbpapers/arith15.pdf Yozo Hida, Xiaoye S. Li, David H. Bailey, “Quad-Double Arithmetic: Algorithms, Implementation, and Application”, Technical Report LBNL-46996, Lawrence Berkeley National Laboratory, 2000. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 36. Implementation on GPU and performance evaluation We accelerated matrix-matrix multiplication routine called “Rgemm”. Prototype deïŹnition of Rgemm ' $ void Rgemm(const char *transa, const char *transb, mpackint m, mpackint n, mpackint k, dd_real alpha, dd_real * A, mpackint lda, dd_real * B, mpackint ldb, dd_real beta, dd_real * C, mpackint ldc) & % “MPACK”by M. Nakata, Multiple pre- cision version of BLAS, LAPACK(de facto standard linear algebra pack- age). http://mplapack.sourceforge.net/ “Rgemm” corresponds to “dgemm” and “sgemm” of BLAS) Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 37. Implementation on GPU and performance evaluation Related study D. Mukunoki and D. Takahashi : Implementation of double-double matrix matrix multiplication on GPU, HPCS, p. 148-156, (2011). → Matrix size should be multiple of 64 and slower than our implementation Nakasato, N.:, “A Fast GEMM Implementation On a Cypress GPU, Performance Modeling, Benchmark and Simulation of High Performance Computing Systems”, Louisiana, USA, 2010. → Matrix size should be multiple of 64 and faster than our implementation § € Both implementations are not practical → we implemented for Š „ general use. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 38. Implementation on GPU and evaluation NVIDIA C2050 Architecture Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 39. Implementation on GPU and evaluation Block algorithm. We divide matrices to small blocks like b K, b M, b N. We used b M = b K = 16 and b N = 64. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 40. Implementation on GPU and evaluation Basic algorithm: 1. Transfer A,B,C matrices on CPU memory to GPU Global memory. 2. Blocking: Ab: 16 × 16 and Bb : 16 × 64: most efïŹcient. 3. Apply 16 × 16 = 256 thread blocks to each elements Each (i, j)-th thread in thread block calculated i-th row of Ab and j, j + 16, j + 32, j + 48-th column (four columns at the same time) of Bb. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 41. Implementation on GPU and evaluation Operation of each thread in detail: 1. Multiply beta to c0, c1, c2, c3 of C matrix which correspond to i-th column of Ab and j, j + 16, j + 32, j + 48-th row of Bb. 2. Read the ïŹrst block Ab and Bb from global memory to shared memory. Each thread of blocks read its elements. 3. Calculate inner product of row vector ai of Ab and column bi of Bb bi , bi+16 , bi+32 , bi+48 as p0 , p1 , p2 , p3 4. Update c0, c1, c2, c3 like c0 ← c0 + α p0. 5. Read next blocks Ab, Bb and repeat 3, 4, until no further blocks are available. 6. Update C-matrix by c0, c1, c2, c3. 7. Finally transfer C-matrix from GPU Global memory to CPU. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 42. Implementation on GPU and evaluation The performance of matrix-matrix operation in double-double precision. Square matrix (m = n = k), we varied m. Max. kernel performance was 16.4GFLOPS. 16.1GFLOPS CPU-GPU transfer included. 16 14 12 GFLOPS 10 8 6 4 2 NN−Kernel NN−Total 0 0 1000 2000 3000 4000 5000 6000 Dimension Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 43. Implementation on GPU and evaluation The performance of matrix-matrix operation in double-double precision with matrix transposes. Square matrix (m = n = k), we varied m. No performance loss with matrix transposes are observed. 16 14 12 NN−Kernel GFLOPS 10 NN−Total 8 NT−Kernel 6 NT−Total TN−Kernel 4 TN−Total 2 TT−Kernel TT−Total 0 0 1000 2000 3000 4000 5000 6000 Dimension Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 44. Implementation on GPU and evaluation We observed no performance loss with matrix transposes, the reason was we make use of texture memory instead. Global memory and Texture memory are essentially the same. However, performance loss was small without coalescing memory access using texture memory. Also, relatively easy to hide the latency of memory transfer in double-double precision since operation intensive (cf. QuadAdd-IEEE req’ 20FLOPS, QuadMul-FMA req 10 FLOPS). Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 45. Implementation on GPU and evaluation “Pointer Redirecting” from “Accelerating GPU kernels for dense linear algebra”, Rajib Nath, Stanimire Tomov, and Jack Dongarra Large performance loss (∌ 35%) are observed for matrix size out of multiple of 64. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 46. Implementation on GPU and evaluation “Pointer redirecting” from “Accelerating GPU kernels for dense linear algebra”, Rajib Nath, Stanimire Tomov, and Jack Dongarra Simple algorithm: if pointer is out of the block, then return the value of the nearest edge. Very simple program. Small amount of performance loss. § € Breakthrough!! Š „ Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 47. Implementation on GPU and evaluation Performance loss was reduced from 35% to 6% !! 16.4 Kernel 16.2 Total 16 15.8 GFLOPS 15.6 15.4 15.2 15 14.8 14.6 2050 2100 2150 2200 2250 Dimension Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 48. Implementation on GPU and evaluation Performance blurred only 0.1% by repeated calculations. 15.5575 15.557 15.5565 GFLOPS(Total) 15.556 15.5555 15.555 15.5545 15.554 15.5535 10 20 30 40 50 60 70 80 90 100 −th measure Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 49. Implementation on GPU and evaluation Using less accurate operations, we attained 26.4GFLOPS. 25 20 GFLOPS 15 10 QuadMul−Sloppy, QuadAdd−Cray Kernel QuadMul−Sloppy, QuadAdd−Cray Total QuadMul−FMA, QuadAdd−Cray Kernel QuadMul−FMA, QuadAdd−Cray Total 5 QuadMul−Sloppy, QuadAdd−IEEE Kernel QuadMul−Sloppy, QuadAdd−IEEE Total QuadMul−FMA, QuadAdd−IEEE Kernel QuadMul−FMA, QuadAdd−IEEE Total 0 0 1000 2000 3000 4000 5000 6000 Dimension Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 50. Implementation on GPU and evaluation Using less accurate operations, we attained 26.4GFLOPS. “CPU” denotes measured on Xeon 3470 + DDR3-1066. Algorithm Performance QuadAdd-Cray, QuadMul-Sloppy kernel 26.4GFLOPS QuadAdd-Cray, QuadMul-Sloppy total 25.7GFLOPS QuadAdd-Cray, QuadMul kernel 23.0GFLOPS QuadAdd-Cray, QuadMul total 22.4GFLOPS QuadAdd-IEEE, QuadMul-Sloppy kernel 18.1GFLOPS QuadAdd-IEEE, QuadMul-Sloppy total 17.8GFLOPS QuadAdd-IEEE, QuadMul kernel 16.4GFLOPS QuadAdd-IEEE, QuadMul total 16.1GFLOPS QuadAdd-IEEE, QuadMul CPU 100MFLOPS QuadAdd-IEEE, QuadMul OpenMP CPU 400MFLOPS Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 51. Implementation on GPU and evaluation 16.1GFLOPS = ??2.4% (or 46.2%) of peak performance (QuadAdd-IEEE, QuadMul-FMA) Average ïŹ‚op per sec:QuadAdd-IEEE 20op. QuadMul-FMA 10op., in Rgemm, same # of mul and add op appear. (20 + 10 − 1)/2 = 14.5 Approx theoretical peak should be... 515GFLOPS/14.5 = 35.5GFLOPS However, on C2050, peak performance is calculated full use of FMA and our calculation is not this case, thus... 515GFLOPS/14.5/2 = 17.8GFLOPS Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 52. Application:x10 acceleration for SemideïŹnite programming solver“SDPA-DD”. Application Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 53. Application:x10 acceleration for SemideïŹnite programming solver“SDPA-DD”. SemideïŹnite programming: Primal min: A0 ‱ X s.t.: Ai ‱ X = bi (i = 1, 2, · · · , m) X 0 ∑m Dual max: bi zi i=1 ∑ m s.t.: Ai zi + Y = A0 i=1 Y 0 Ai : n × n symm. mat., X n × n symm. variable mat., bi : m-dim ∑ vector,Y n × n symm. variable mat, X ‱ Y := Xi j Yi j . X 0 : X semideïŹnite: eigenvalues are lager than or equal to 0. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 54. Application:x10 acceleration for SemideïŹnite programming solver“SDPA-DD”. Nature of optimally. . Theorem (Complementary slackness theorem) . When (X∗ , Y ∗ , z∗ ) are feasible solution and interior point then they satisfy the conditions of SDP of primal and dual, then necessary and sufïŹcient condition for optimally of (X∗ , Y ∗ , z∗ ) is: . X ∗ ‱ Y ∗ = 0. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 55. Application:x10 acceleration for SemideïŹnite programming solver“SDPA-DD”. When X∗ , Y ∗ is optimal, X∗ ‱ Y ∗ = 0. Then, rank X∗ + rankY ∗ ≀ n (1) also follows. § € At least one of X∗ , Y ∗ is singular „ Š Usually both of X∗ , Y ∗ are singular: → unstable and/or less accurate at the optimal. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 56. How to solve SDP:Interior point primal-dual path following method World’s best implementations SDPA and SDPARA are available by the SDPA group led by Prof. Fujisawa. Step 0: Setting the initial points: x0 , X0 , Y 0 , X0 0, Y 0 0. letting h = 0, choose parameter Îł ∈ (0, 1). Step 1: Calculate Shur complementary matrix B ∈ S n. Bi j = ((X h )−1 Fi Y h ) ‱ F j Step 2: Solving linear equation Bdx = r, and calculate dX, dY by solution dx, then we obtain next step (dx, dX, dY) Step 3: Determine step size α keeping positive-semideïŹniteness of matrices. α = max{α ∈ [0, 1] : X h + αdX 0, Y h + αdY 0}. Step 4: Update the current point. (x h+1 , X h+1 , Y h+1 ) = (x h , X h , Y h ) + γα(dx, dX, dY). Step 5: If (x h+1 , X h+1 , Y h+1 ) satisïŹes some requirements, then iteration ends. Otherwise, go back to the Step 1 and increment h = h + 1. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 57. Shur complement matrix becomes singular B is called “Shur complementary matrix” We solve linear equation Bdx = r to determine the next step. This linear equation becomes singular! § € Multiple precision arithmetic is needed for accurate solutions! Š „ The 1-norm and the estimated 1-norm condition number of shur complement matrix 1e+20 1-cond 1-norm 1e+15 1e+10 100000 1 1e-05 1e-10 0 10 20 30 40 50 60 70 80 90 # of iter. Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 58. Application:x10 acceleration for SemideïŹnite programming solver“SDPA-DD”. Benchmark result: lager problem from SDPLIB (problem archive) CPU: Xeon 3470, DDR3 -1066 Problem CPU(sec) GPU(sec) acceleration equalG51 6531.9 573.2 11.4 gpp500-1 902.0 72.2 12.5 gpp500-4 638.0 74.8 8.5 maxG32 36284.4 4373.1 8.3 maxG55 521575.4 53413.1 9.8 mcp500-4 539.1 65.2 8.3 qpG11 16114.7 1408.0 11.4 qpG51 39678.9 3299.2 12.0 ss30 310.7 138.6 2.2 theta5 3250.0 239.8 13.6 theta6 9028.2 623.6 14.5 thetaG51 49161.5 4870.4 10.1 Nakata Maho A fast implementation of matrix-matrix product in double-double preci
  • 59. Summary § € http://mplapack.sourceforge.net/ Š „ Matrix-matrix multiplication double-double precision NVIDIA C2050, GPU CPU x150, Peak performance: 26GFLOPS 25 20 GFLOPS 15 10 QuadMul−Sloppy, QuadAdd−Cray Kernel QuadMul−Sloppy, QuadAdd−Cray Total QuadMul−FMA, QuadAdd−Cray Kernel QuadMul−FMA, QuadAdd−Cray Total 5 QuadMul−Sloppy, QuadAdd−IEEE Kernel QuadMul−Sloppy, QuadAdd−IEEE Total QuadMul−FMA, QuadAdd−IEEE Kernel QuadMul−FMA, QuadAdd−IEEE Total 0 0 1000 2000 3000 4000 5000 6000 Dimension Nakata Maho A fast implementation of matrix-matrix product in double-double preci