SlideShare a Scribd company logo
1 of 95
Download to read offline
CUDA and Fermi
Optimization Techniques
   Hyungon Ryu, PSG SA
Agenda

         CUDA 101 (General Overview)
         Toy of FDM example
         Monte Carlo Simulation & Time Series Analysis
         General CUDA Optimization Tips




NVIDIA Confidential
CUDA 101


General Overview
CUDA 101


                      CUDA
               Graphics      Parallel
                (GPU)     Programming
NVIDIA Confidential
CUDA Parallel programming

                                               Hetereogeneous Knowledge
         GPU Knowledge                            GPU
                      Cg                          Parallel DSP
                      OpenGL                      Parallel ASIC
                      DirectX                     Parallel FPGA
                                                  Cell BE
         Parallel knowledge
                      Pthreads / winthreads
                      MMX, SSE
                      OpenMP
                      PVM / MPI               CUDA    Parallel Computing with GPU
NVIDIA Confidential
Parallel Model in OpenMP

CPU Program
                      Fork




                               Parallel




                        Join



NVIDIA Confidential
                                          6
CUDA parallel Model

CPU Program
                      Kernel Launch




                                      GPU threads




NVIDIA Confidential
                                               7
Saxpy Example : CPU serial


       for (int i = 0; i < n; ++i) {
             y[i] = a*x[i] + y[i];
       }




i
x
y
NVIDIA Confidential
Example of Saxpy Parallel : OpenMP
# pragma omp            parallel shared (n,a,x,y) private (i)
# pragma omp            for
for (int i =            0; i < n; ++i) {
      y[i] =            a*x[i] + y[i];
}




  i
  x
  y
  NVIDIA Confidential
Example of Saxpy Parallel : MPI
for ( i = start ; i < end; i++)
{
y[i] = a * x[i] + y[i];
}
                      MPI_Init(&argc, &argv);
                      MPI_Comm_rank(MPI_COMM_WORLD,&rank);
                      MPI_Comm_size(MPI_COMM_WORLD,&size);

                      void para_range(int lowest, int highest,int nprocs, int myrank,
                                      int *start, int *end) {
                          int wk1, wk2;
                          wk1 = (highest - lowest + 1) / nprocs;
                          wk2 = (highest - lowest + 1) % nprocs;
                          *start = myrank * wk1 + lowest + ( (rank<wk2) ? myrank : wk2);
                          *end = *start + wk1 - 1;
                          if(wk2 > rank) *end = *end + 1;
i                     }

x
y
NVIDIA Confidential
Example of Saxpy Parallel : SSE
                 void saxpy_vector(short *z, short *x, short *y, short a, unsigned n) {
                    __m128i* x_ptr = (__m128i*) x;
                    __m128i* y_ptr = (__m128i*) y;
                    __m128i* z_ptr = (__m128i*) z;
                    __m128i a_vec = _mm_splat_epi16(a);
                    int i;
                    for (i = 0; i < n/8; ++i) {
                      __m128i x_vec = x_ptr[i];
                      __m128i y_vec = y_ptr[i];
                      __m128i z_vec = _mm_add_epi16( _mm_mullo_epi16 x_vec,a_vec),y_vec);
                      z_ptr[i] = z_vec;
                    }
                  }
i
x
y
NVIDIA Confidential
Saxpy Parallel : CUDA

              {
              x[i] = a * x[i] + t * y[i];
              }



                                   Saxpy <<<N ,M >>> (n, 2.0, x, y);




i
x
y
NVIDIA Confidential
CUDA C extension
Launch the kernel
     Function <<< Grid, Block >>> ( parameter);

Additional C standard API for mem control
      cudaXXX : cudaMalloc, cudaMemcpy,
      cuXXX     : cuMalloc, cuMemcpy
      cutXXX : cutXXX

For Function
      __global__, __device__, __host__,     __device__ __host__
For memory
      __shared__, __device__, reg/loc

pre-defined variables
      blockDim, blockIdx, threadIdx, cudaMemcpyHostToDevice

 Pre-defined function
 __syncthreads(), __mul24(); etc
NVIDIA Confidential
                                                                  13
Process of CUDA developing

   Serial                     CUDA parallel

                              Algorithm
   Algorithm
                              serial Programming
 serial Programming
                                   Compile
           Compile
                              CUDA convert
           Debugging
                                  Profile
    Release                       Parallelize
                                  Compile
                                  Debugging
                                  Optimize/profile
                               Release

NVIDIA Confidential
                                  14
CUDA is Parallel Computing !!!

   Serial              MPI parallel                  CUDA parallel

                        Algorithm                    Algorithm
   Algorithm
                        serial Programming           serial Programming
 serial Programming
                             Compile                      Compile
           Compile
                         parallel Programming        CUDA convert
           Debugging
                             Profile                     Profile
    Release                  Parallelize                 Parallelize
                             Compile                     Compile
                             Debugging [totalview]       Debugging
                             Optimize                    Optimize/profile
                         Release                      Release
                         MPIrun
NVIDIA Confidential
                                                         15
CPU Program
                                                           Kernel Launch

                                               SM
                                              I-Cache
                                                                                Block
                                              MT Issue
                                              C-Cache

          TPC                                 SP    SP
       Geometry Controller
              SMC
I-Cache      I-Cache      I-Cache
MT Issue    MT Issue      MT Issue
C-Cache
SP   SP
            C-Cache
            SP    SP
                          C-Cache
                          SP   SP
                                              SP    SP
SP   SP     SP    SP      SP   SP

SP   SP     SP    SP      SP   SP

SP   SP     SP    SP      SP   SP


                                              SP    SP
                                                                                thread
SFU SFU     SFU SFU       SFU SFU

  DP           DP            DP
Shared       Shared       Shared
Memory       Memory       Memory

           Texture Unit
             Tex L1

                                              SP    SP

                                              SFU   SFU

                                                DP
                                              Shared
                                              Memory




                       NVIDIA Confidential
                                                                           16
Image Processing Diagram without CUDA

                                CPU                    Total
                      capture                          time
     Camera                           For Loop start




                                      For Loop End




NVIDIA Confidential
Image Processing Diagram with CUDA

                                CPU                                    Total
                      capture
                                                                       time
     Camera                                 Memcpy

                                      For Loop start


                                                       CUDA parallel
                                      For Loop End




NVIDIA Confidential
1D Heat Equation
       CUDA Toy
for undergraduate student
1D Heat Equation




                        1D bar




NVIDIA Confidential   Heat Source
1D Heat Equation


              u         u
                         2
                                       1D Heat Equation
              t         x 2


              u (0, t )  u(1, t )  0   Boundary condition


              u ( x,0)  u0               Initial condition




NVIDIA Confidential
Discretization
                       u     2u
                          
                       t    x 2
                                descretization

u j ,i 1  u j ,i              u j 1,i  2u j ,i  u j 1,i
                                                              Forward difference with second order
               t                           x   2              i(time), j(space)

                                 relation

u j ,i 1  ru j 1,i  (1  2r )u j ,i  ru j 1,i             Explicit method

                      r  t / x 2
      u[i+1][j] = r*u[i+1][j-1] + (1-2*r)*u[i][j] + r*u[i][j+1];
NVIDIA Confidential
Discretization (stencil)

          u j ,i 1  ru j 1,i  (1  2r )u j ,i  ru j 1,i

                             j+1,i
                                           r



                              j,i                  i+1
                                      (1-2*r)

                                           r

                             j-1,i
NVIDIA Confidential
Conceptual Diagram
                                    Initial Condition
Boundary Condition
                                                   space

                            13
                            12
                            11
                            10             j+1
                                                         r
                             9
                             8
                             7               j                           j
                             6                          (1-2*r)
                             5
                                                                     stencil
                             4
                                                         r
                             3             j-1
                                                   u[i+1][j] = r*u[i+1][j-1] + (1-2*r)*u[i][j] + r*u[i][j+1];
                             2
                             1           Time 0                   Update
                             0
   NVIDIA Confidential   Time : 0                                                              600
Explicit Pseudo Code

         Parameter and data Initialization
                      Stencil, boundary/initial condition,


         FOR LOOP (time, i)
            FOR LOOP (stencil, j)
                 Update the stencil relation
                         u[i+1][j] = r*u[i+1][j-1] + (1-2*r)*u[i][j] + r*u[i][j+1];
         Results



NVIDIA Confidential
CPU code

         u[i][j] vs. u[j]
                      u[i][j] easy to develop
                         Possible to visualize the process


                      u[j] efficient to use memory
                         Get the last result




NVIDIA Confidential
Main algorithm
   for( i =0; i < N; i++) {                                       Time Iteration

                for( j =0; j < M; j++){                           Space Iteration

                      u[i+1][j] =               r * u[i+1][j-1]
                                        + (1-2*r) * u[i ][j]
                                        + r       * u[i ][j+1];
                }
   }
                        Heat relation




NVIDIA Confidential
Boundary Condition
Method1
                         copy       copy   copy
  N

                                            Fixed boundary
N-1



Method2
                       copy                  Free boundary
N+1

 N

N-1                       compute
 NVIDIA Confidential
How to Parallelize
                                     Parallelize the space stencil
                                     Each threads(core) dedicate to each space stencil element
             13
             12
             11                     j+1
                                               r
             10
              9
              8                      j                       j
              7                              (1-2*r)
              6
                                                         stencil
              5
                                               r
              4                     j-1
              3
              2                   Time 0               Update
              1
              0
                      0                                                                  600
             Time :
NVIDIA Confidential
                          Not possible to parallelize the time stencil.
How to parallelize


         Parameter and data Initialization
                      Stencil, boundary/initial condition,
         FOR LOOP (time, i)
             FOR LOOP (stencil, j)
                  Update the stencil relation
         Results




NVIDIA Confidential
Space parallelization



    Time
   Sequence

      Time 0
                                              Comm
      Time 1

      Time 2

      Time 3




NVIDIA Confidential
CPUcode
                      CPU


                      source


                               Heat Relation
                      solve
                               Boundary Condition



                      result




NVIDIA Confidential
CPU code
do{
      time += dt; printf("aaaaaa %fn",time);
      for(i=1; i < mesh+1; i++){
                temper[i].new_1 = (double) (1-2*r)*temper[i].old + r*(temper[i-1].old + temper[i+1].old);
                printf("n processing t %d %f %f n",i, temper[i].new_1, temper[i].old);
      }
      temper[mesh+1].new_1 = temper[mesh-1].new_1;
        printf("t print results %d %f %f n",mesh+1, temper[mesh+1].new_1,temper[mesh-1].new_1 );

        for(i=1; 1 < mesh+2; i++)
        temper[i].old = temper[i].new_1; printf("aatt %d %f %f n",i, temper[i].new_1,temper[i].new_1 );
        if((++count % print_step)==0){
        printf("hh ttt %10.5lf", time);
        for(i=0; i<mesh; i+=2)
                       printf("df 8.4lf", temper[i].new_1);
        if(!(i%2))
                       printf("fd %8.4lfn", temper[mesh].new_1);
        }
                        printf("nn time print %fn", time); getchar();
}while(time < end_time);
if((count % print_step)){
        printf("bghj %10.5lf", time);
        for(i=0; i<mesh; i+=2)
                       printf("ahg 8.4lf", temper[i].new_1);
        printf("nhgf %8.4lfn", temper[mesh].new_1);
}  NVIDIA Confidential
GPUcode-01 Memory Map
                      CPU
                                                     GPU

                      source
                                                    source


                               Heat Relation
                      solve                                  Bypass without solving
                               Boundary Condition



                      result                        result




NVIDIA Confidential
GPUcode-01 Malloc Template
double* init_GPU_data(struct flow * temper, int mesh)
{
          double *u_dev; // for GPU data upload
          size_t gpuMemsize=sizeof(double)*(mesh+2) ;
          double *u_host; // temperal value
          cudaMalloc( (void**)&u_dev, gpuMemsize);cudaErr("malloc u_dev");
          u_host = (double *) malloc( gpuMemsize);
          for(int i=0;i<mesh;i++){
                    u_host[i]= temper[i].old;
                    printf("before %d : data initial :u_host[%d]= %f temper[%d].old =%fn", i, i, u_host[i], i,
temper[i].old);
          }
          cudaMemcpy(u_dev, u_host, gpuMemsize, cudaMemcpyHostToDevice);cudaErr("memcpy u_dev u_host");
          cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev u_host");
          for(int i=0;i<mesh;i++){
                    printf("after %d : data initial :u_host[%d]= %f temper[%d].old =%fn", i, i, u_host[i], i,
temper[i].old);
          }

             free(u_host);

             return (double *)u_dev;
}


    NVIDIA Confidential
GPUcode-01 Launch Template
void __global__ functionG(double *u_dev, int meshsize, double r, double bound )
{
          int idx = blockIdx.x*blockDim.x+threadIdx.x;
          int i = idx+1;
          if(idx <meshsize +1 ) {

                          u_dev[i]= i*0.01;
             }
             if(idx == 6){
             u_dev[idx+1]=u_dev[idx-1]=11;
             }
}

void compute_GPU(         double * u_dev,      double * u_host, double dt, double dx,          double r, int mesh,
                          int print_step,     int count,          double time,       double end_time, double bound )
{
             size_t gpuMemsize = sizeof(double)*(mesh+2);
             //for( int i=0; i < 6000 ; i++){ //time step
                        functionG<<<4,5>>>(u_dev, mesh,r, bound);cudaErr2("kernel launch",1,0);
                        cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev to u_host");
               for(int i=0; i<mesh+1; i++){
                        printf( " in kernel - GPU : temper[%d] ==> %f n", i, u_host[i]);
             }
             //}
             return;
}

    NVIDIA Confidential
GPUcode-02 Solving
                      CPU
                                                     GPU

                      source
                                                    source


                               Heat Relation                 Heat Relation
                      solve                         solve
                               Boundary Condition            Boundary Condition


                      result                        result




NVIDIA Confidential
GPUcode-02 Malloc part
double* init_GPU_data(struct flow * temper, int mesh)
{
          double *u_dev; // for GPU data upload
          size_t gpuMemsize=sizeof(double)*(mesh+2) ;
          double *u_host; // temperal value
          cudaMalloc( (void**)&u_dev, gpuMemsize);cudaErr("malloc u_dev");
          u_host = (double *) malloc( gpuMemsize);
          for(int i=0;i<mesh;i++){
                    u_host[i]= temper[i].old;
                    printf("before %d : data initial :u_host[%d]= %f temper[%d].old =%fn", i, i, u_host[i], i,
temper[i].old);
          }
          cudaMemcpy(u_dev, u_host, gpuMemsize, cudaMemcpyHostToDevice);cudaErr("memcpy u_dev u_host");
          cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev u_host");
          for(int i=0;i<mesh;i++){
                    printf("after %d : data initial :u_host[%d]= %f temper[%d].old =%fn", i, i, u_host[i], i,
temper[i].old);
          }

             free(u_host);

             return (double *)u_dev;
}


    NVIDIA Confidential
GPUcode-02 Launch part
void __global__ functionG(double *u_dev, int meshsize, double r, double bound )
{
          int idx = blockIdx.x*blockDim.x+threadIdx.x;
          int i = idx+1;
          if(idx <meshsize +1 ) {
                    u_dev[i] = (double) (1-2*r)*u_dev[i] + r*(u_dev[i-1] + u_dev[i+1] );
          //        u_dev[i]= i*0.01;
          }
          if(idx == 6){
          u_dev[idx+1]=u_dev[idx-1]=11;
          }
}

void compute_GPU(         double * u_dev,    double * u_host, double dt, double dx,          double r, int mesh,
                          int print_step,   int count,          double time,       double end_time, double bound )
{
             size_t gpuMemsize = sizeof(double)*(mesh+2);
             //for( int i=0; i < 6000 ; i++){ //time step
                        functionG<<<4,5>>>(u_dev, mesh,r, bound);cudaErr2("kernel launch",1,0);
                        cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev to u_host");
               for(int i=0; i<mesh+1; i++){
                        printf( " in kernel - GPU : temper[%d] ==> %f n", i, u_host[i]);
             }
             //}
             return;
}

    NVIDIA Confidential
How to Optimize?




NVIDIA Confidential
Monte Carlo Simulation
Computation Time

        Malliavin MC results
        type                   Black-Sholes     5000     10000     50000       100000     200000    300000
        Price                      12.8216    12.8706   12.8721   12.8413      12.8525    12.8559   12.8480
        Err                        0.000%      0.38%     0.39%     0.15%        0.24%      0.27%     0.21%
        Delta                       0.5858     0.5724    0.5887    0.5826       0.5826     0.5829    0.5824
        Err                              0     2.28%    -0.51%     0.53%        0.53%      0.49%     0.58%
        Gamma                       0.0130     0.0112    0.0134    0.0127       0.0127     0.0127    0.0127
        Err                         0.00%     13.39%    -3.34%     2.07%        2.26%      2.21%     2.47%
        Total Time                    0:00     00:03     00:05     00:25        00:50      01:44     02:33




                                                                                 time : 100 sec
                                                                             target1 : > 1 sec (100X)
                                                                            Target2 : >0.001 sec (2000X)


NVIDIA Confidential
Monte Carlo Simulation for Finance
                      Excel Sheet

                                    Excel VBA


                                                C/C++ dll Link

                                                                 CUDA Acceleration




NVIDIA Confidential
Monte Carlo Code with VBA




function MC( S As Double, X As Double, T As Double, R As Double, _
                     Vol As Double, Q As Double, No As Double, rtype As String) As Double
  Simul_No = No
   dt = 1 'dt = 1 / 365
     For K = 0 To Simul_No - 1
        Juga = S
         For i = 0 To MaxStep - 1
             Juga = Juga * Exp( (R - Q - Vol ^ 2 / 2) * dt + Vol * Sqr(dt) * MakeNorsD() )
         Next
         price = Exp(-R * T) * Max(Juga - X, 0)
        sum_price = sum_price + price
     Next
  MC = sum_price / Simul_No
End function



  NVIDIA Confidential
Malliavin Greeks

         Greek computation for Monte Carlo simulation
                                      E[ f ( S0  S )]  E[ f ( S0 )]
                         E[ f ( S )] 
                      s                            S



         Malliavin approach
                         
                            E[ f ( S )]  E[ f ( S ) W ( S )]
                         s
                                                 Malliavin weights




         With Malliavin approach, we can save the computation time.

NVIDIA Confidential
Problem

 To compare the accuracy, we compute the Price, Delta and Gamma of Vanilla
   Call option.

 Approach
 1. Closed Form solution (VBA,C)
 2. Monte (VBA, C)
 3. Malliavin (VBA,C, CUDA v1, v2)



NVIDIA Confidential
Malliavin Monte Carlo Code with VBA

function Malliavin( S As Double, X As Double, T As Double, R As Double, _
                     Vol As Double, Q As Double, No As Double, rtype As String) As Double
  Simul_No = No
   dt = 1 'dt = 1 / 365
     For K = 0 To Simul_No - 1
        Juga = S
         For i = 0 To MaxStep - 1
            Juga = Juga * Exp( (R - Q - Vol ^ 2 / 2) * dt + Vol * Sqr(dt) * MakeNorsD() )
         Next
         WT = (Log(Juga) - Log(S) - (R - Q - 1 / 2 * Vol ^ 2) * T) / Vol
         WT_delta = (WT / (S * Vol * T))
         WT_gamma = (1 / (Vol * T * S ^ 2)) * (WT ^ 2 / (Vol * T) - WT - 1 / Vol)

         price = Exp(-R * T) * Max(Juga - X, 0)
         delta = Exp(-R * T) * Max(Juga - X, 0) * WT_delta
         gamma = Exp(-R * T) * Max(Juga - X, 0) * WT_gamma

     sum_price = sum_price + price
     sum_delta = sum_delta + delta
     sum_gamma = sum_gamma + Gamma
     Next

   Malliavin = sum_delta / Simul_No
  NVIDIA Confidential
Step1 Malliavin Monte Carlo C language sketch
void Malliavin( double S , double X , double T , double R, double Vol, double Q, long No){
  long Simul_No = No;
  double dt = 1; // dt = 1 / 365
     for ( int K = 0; K< Simul_No - 1 ; K++){
        Juga = S ;
         for (int i = 0; i< MaxStep - 1 ki++){
                Juga = Juga * exp( (R - Q - Vol ^ 2 / 2) * dt + Vol * sqrt(dt) * norm() ); // rand with box muller
          }

        WT = (Log(Juga) - Log(S) - (R - Q - 1 / 2 * Vol ^ 2) * T) / Vol;
        WT_delta = (WT / (S * Vol * T));
        WT_gamma = (1 / (Vol * T * S ^ 2)) * (WT ^ 2 / (Vol * T) - WT - 1 / Vol);

        price = Exp(-R * T) * max(Juga - X, 0);
        delta = Exp(-R * T) * max(Juga - X, 0) * WT_delta;
        gamma = Exp(-R * T) * max(Juga - X, 0) * WT_gamma;
       sum.price = sum.price + price;
       sum.delta = sum.delta + delta;
       sum.gamma = sum.gamma + Gamma ;
     }
   r.price = sum.delta / Simul_No
   r.delta = sum.delta / Simul_No
   r.gamma = sum.delta / Simul_No
   return 0;
   NVIDIA Confidential
Step2 Sketch (kernel part)
Void __global__ Malliavin_compute( double S , double X , double T , double R, double Vol, double Q, long No){
  long Simul_No = No;
  double dt = 1; // dt = 1 / 365
    for ( int K = 0; K< Simul_No - 1 ; K++){                                 Simm_No =
       Juga = S ;                                                            Total Sim /(N threads* M blocks)
        for (int i = 0; i< MaxStep - 1 ki++){
               Juga = Juga * exp( (R - Q - Vol *Vol / 2) * dt + Vol * sqrt(dt) * norm(k,i) );
         }

       WT = (log(Juga) - log(S) - (R - Q - 1 / 2 * Vol *Vol) * T) / Vol;
       WT_delta = (WT / (S * Vol * T));
       WT_gamma = (1 / (Vol * T * S *S)) * (WT*WT / (Vol * T) - WT - 1 / Vol);

       price = Exp(-R * T) * max(Juga - X, 0);
       delta = Exp(-R * T) * max(Juga - X, 0) * WT_delta;
       gamma = Exp(-R * T) * max(Juga - X, 0) * WT_gamma;
      sum.price = sum.price + price;
      sum.delta = sum.delta + delta;
      sum.gamma = sum.gamma + Gamma ;
    }                                                     Real Price =
  r.price = sum.delta / Simul_No
  r.delta = sum.delta / Simul_No                          Sum (r.price ) / (N*M)
  r.gamma = sum.delta / Simul_No
  return 0;
  NVIDIA Confidential
Step2 Parallel Memory Map
 Parallel
 cores
                                      RNG_compute1()
uniform

                                       RNG_compute2()
normal
                                                          normal()
stock
                                     Malliavin_compute()
option
                                                                                data in global memory
     (float *) __device__ normal(int k, int, j, int size_j, float * normal) {

         int index = k*size_j + j;

         return &normal[index];
     }
    NVIDIA Confidential
Step2 Sketch (host part)
#include <stdio.h>

__global__ RNG_compute(parameter);
__global__ Malliavin_compute(parameter);

main(){

    malloc(); //cpu malloc
    cudaMalloc(); //GPU malloc
    cudaMemcpy(); // transfer

    RNG_compute 1<<<N,M>>> (parameter); // generate RNG (uniform)

    RNG_compute2 <<<N,M>>> (parameter); // generate RNG ( BM,Moro)

    Malliavin_compute <<<N,M>>> (parameter); // simulation

    cudaMemcpy(); //get results

    return 0;

}

    NVIDIA Confidential
Step2 Malliavin Monte Carlo CUDA language sketch (rng part1)

__global__
static void RNG_rand48_get_int(uint2 *state, int *res, int num_blocks, uint2 A, uint2 C)
{
  const int nThreads = blockDim.x*gridDim.x;


    int nOutIdx = threadIdx.x + blockIdx.x*blockDim.x;
    uint2 lstate = state[nOutIdx];
    int i;
    for (i = 0; i < num_blocks; ++i) {

        res[nOutIdx] = ( lstate.x >> 17 ) | ( lstate.y << 7);
        nOutIdx += nThreads;

        lstate = RNG_rand48_iterate_single(lstate, A, C);
    }

    nOutIdx = threadIdx.x + blockIdx.x*blockDim.x;
    state[nOutIdx] = lstate;
}
NVIDIA Confidential
Step2 Malliavin Monte Carlo CUDA language sketch (rng part2)

Void __global__ RNG_compute( int * uniform , float * normal, int length){

int index = blockDim.x*blockIdx.x+threadIdx.x;

__shared__ int s[i];
__shared__ float s_r[i];

if( threadIdx.x ==0){
    for (int i = 0 ; i<blockDim.x ; i++){
       s[i]=uniform[blockDim.x*blockIdx.x + i]; // load uniform
    }
}
   s_r[threadIdx.x] = (float) moro(s[threadIdx.x]); // moro inversion with parallel

if( threadIdx.x ==0){
   for (int i = 0 ; i<blockDim.x ; i++){
      s[blockDim.x*blockIdx.x+i]= s_r[i] ]; // save normal
   }
}


}

    NVIDIA Confidential
Box-Muller vs Moro Inversion

    __device__ void BoxMuller(float& u1, float& u2){
        float r = sqrtf(-2.0f * logf(u1));
        float phi = 2 * PI * u2;
        u1 = r * __cosf(phi);
        u2 = r * __sinf(phi);
    }
    __device__ Moro( float u ){
             // skip the const value
             x = u - 0.5;
        if (abs(x) < 0.42) {
             r = x*x;
             r = x * (((a4 * r + a3) * r + a2) * r + a1) / ((((b4 * r + b3) * r + b2) * r + b1) * r + 1);
        } else{
             if (x > 0) r = log(-log(1 - u));
             if (x <= 0) r = log(-log(u));
             r = c1 + r * (c2 + r * (c3 + r * (c4 + r * (c5 + r * (c6 + r * (c7 + r * (c8 + r * c9)))))));
             if (x <= 0) r = -r;
                                  }f
        return r;
    }
NVIDIA Confidential
Step3 New approach for Direct LCG and Parallel Memory Map

Parallel
cores
                             Malliavin_compute()        rand_new(seed)
                                                                         uniform
                                                        moro(u)
                                                                         normal

                                                                         stock
                                                                         option
                                                    data in registers

      double rand_new( int next){
        next = (a * next + b) % M;
      return (double) next * (1/M);
      }

  NVIDIA Confidential
Platform for finance

             Excel Sheet

                           Excel VBA
                                       C/C++ dll Link


                                                          Socket communication



                                                            Grid Manager


                                        GPU cluster
                                              12 nodes system (1/2 for backup)
    12*8 GPU*448 core
    = 43,000 core

NVIDIA Confidential
                                        8 GPU per node
How to Optimize?




NVIDIA Confidential
Time Series Analysis
Multivariate Time-series

                                    m


                             S(1)
Corr(1,2)
Corr(1,3)                    S(2)
Corr(2,3)
                             S(3)


Corr(1,n)




                             S(n)



       NVIDIA Confidential
How to parallelize with CUDA

                                                         serialize

                      S(1)

                      S(2)


                      S(3)




                      S(n)



NVIDIA Confidential
                             Unefficient approach for Shared Memory Usage
How to parallelize with CUDA


                                  S(1)

                                  S(2)
                      serialize




                                  S(3)




                                  S(n)

                                         efficient approach for Shared Memory Usage
NVIDIA Confidential
                                         Need to reduction technique for mean etc.
How to parallelize with CUDA


                                     S(1)

                                     S(2)
                         serialize




Parallelize
                                     S(3)
With
multiNode
multiGPU




                                     S(n)

                                            Good method for multiGPU & large Cluster
   NVIDIA Confidential
In pair(I,J), Pearson Correlation Coefficient


                       ( x  x )( y  y )
                                   i                    i

                        (x  x )  ( y  y)
                               i
                                               2
                                                                i
                                                                                2




                      (x  i    x )( yi  y )   xi yi  x  yi  y  xi   xy
                                                              xi yi  nxy
                               (x  x )   x
                                       i
                                                        2
                                                                    i
                                                                        2
                                                                             nx 2

                               ( y  y)   y
                                           i
                                                            2
                                                                        i
                                                                            2
                                                                                 ny 2



                             x y    i           i    nxy
                      ( xi 2  nx 2 )( yi 2  ny 2 )

                                                                                         We can parallelize the Sumation !!
                                                                                         After sumation, find the mean.
NVIDIA Confidential
How to parallelize with CUDA : Flow Chart

                        Method 1                                                Method 2

Input A, B                                                     Input A, B

Find the mean of A, B                                          Start to sum (Ai), (Bi), (Ai,Bi), (Ai^2), (Bi^2)
  start to sum (Ai), (Bi)
                                                               Find the mean of A, B
Find Cov(A,B), Cov(A,A), Cov(B,B)
  start to sum (Ai,Bi), (Ai^2), (Bi^2) with mean               Find Cov(A,B), Cov(A,A), Cov(B,B)



     Benefit : Easy to implementation                          Benefit : oneshot sum (speed-up)
             with two function

                      R+CUDA project
     http://brainarray.mbni.med.umich.edu/Brainarray/rgpgpu/




NVIDIA Confidential
In pair(i,j), Pearson Correlation Coefficient


 FOR (i, j) – pair : serial
       FOR k (time-series) : parallel
         compute            x  y x y       i     i     i    i   x
                                                                    i
                                                                        2
                                                                            y
                                                                             i
                                                                                 2




                      reduction for results
                        compute mean(i), mean(j),
                        compute cov(i,j), cov(i,i) ,cov(j,j)
                        compute corr(i,j)


NVIDIA Confidential
In pair(i,j), Pearson Correlation Coefficient


 FOR (i, j) – pair : serial
       FOR k (time-series) : parallel
         FOR shared memory
            compute
                            x  y x y       i     i     i    i   x
                                                                    i
                                                                        2
                                                                            y
                                                                             i
                                                                                 2




                      reduction for results
                              compute mean(i), mean(j),
                        compute cov(i,j), cov(i,i) ,cov(j,j)
                        compute corr(i,j)

NVIDIA Confidential
How to Optimize?




NVIDIA Confidential
Focus on Optimization
System Optimization Tips before
   start CUDA programming
Remote Use
                                        VNC protocol
                                VNC                    VNC
snapshot of
real ccreen
                            WDDM                                CUDA
                                                                enabled
                  Remote Host                          Client




                                        RDP protocol
                           RDP driver                  RDP
intercept the
CUDA driver
                            WDDM
                                                                CUDA
                                                                disabled
                  Remote Host                          Client

 NVIDIA Confidential
TCC driver

                      CUDA Application


                                          result
                      GPU
                                                   RDP protocol
                                     RDP driver                   RDP

                      WDM             WDDM                                 CUDA
                                                                           enabled
                            Remote Host                           Client


               • Remote Desktop
               • Kernel Launch Over Head
               • GPU exclusive Mode
               • Windows Service for Session0/1
               • large single Malloc
NVIDIA Confidential
GPU exclusive Mode for multiuser



                      GPU 3

                      GPU 2
                                       GPU Pool
                      GPU 1
                                                  Remote users
                      GPU 0
                                                  Remote users
                              Server
                                                  Remote users

NVIDIA Confidential
GPUDirect for GPU cluster
   MPI Communication

                      System Pinned Memory     System Pinned Memory share
                      System Pageable Memory




NVIDIA Confidential
FBO on SDI with Quadro


                                     Third Party                                            Third Party
                                     SDI Input                                              SDI Input




CPU                      Memory                                   CPU           Memory
                        Controller   NVIDIA Graphics Card                      Controller   NVIDIA Graphics Card
                                         Video              DVI                                 Video              DVI
                                         Memory                                                 Memory
                                                            DVI                                                    DVI
                                         GPU                                                    GPU
                        System                                                  System
                        Memory                                                  Memory
                                     NVIDIA SDI Output                                      NVIDIA SDI Output
                                                            DVI                                                    DVI




Write to Host memory and to write GPU memory                            Direct write to OpenGL Frame Buffer Object

  NVIDIA Confidential
Conceptual Tips for
CUDA optimization
SIMT architecture
Single Instruction Multiple Threads
Abstract overview of CPU Core

                                                                                  CPU    Memory Bank

                      inst. fetch

                                                            Register Bank




                                                      RCX
                                                            RDX
                                                RBX
                                          RAX
                           L1
                      inst. cache


                                                                                L1
        FPU                         ALU                           LSU
                                                                            data cache




NVIDIA Confidential
Threads on Multicore CPU
             CPU

 Core                               CPU   General Programming
 reg

                      TH

 Core                               CPU
                                           Winthread, pthread
 reg

                      TH     TH


 Core                               CPU
                                            Hyper-threading
 reg

                      TH     TH
NVIDIA Confidential     TH     TH
Threads on Manicore GPU
                                                   H/W Multi Processor vs S/W Active Block

                                                                                                                                MP
SP
reg

                   TH            TH           TH           TH   TH        TH           TH        TH   TH           TH


                                      Block                                    Block                                    Block


                                                                                                                                MP
SP
reg


                            TH                      TH               TH                     TH             TH

      NVIDIA Confidential                          Block                                                   Block
                                                                                                            79
Overview of WARP schedule

                        <<<0,0>>>              Warp 0
                            <<<0,1>>>

                                 <<<0,31>>>              MP
                      schedule    <<<0,32>>>
                                <<<0,33>>>
                              <<<0,34>>>




                                                         warp
                                                         threads

                                                        8 SP core


                                                        Register
                                                        per threads
NVIDIA Confidential
Overview of Instruction Fetch

                                                      blockIdx.x=0, threadIdx.x=3
  0           1             2   3     4       5   6    7


                                    Load A, B


FP operation                                                                        cores
    ADD
                                    Store C



 0          1           2       3    4    5       6    7


      NVIDIA Confidential
Overview of Instruction Fetch

                                                   blockIdx.x=0, threadIdx.x=4
  0           1             2   3   4   5      6      7


                                            Load A, B


FP operation                                                                     cores
    ADD
                                            Store C



 0          1           2       3   4   5     6       7


      NVIDIA Confidential
Thread schedule within MP : WARP
                           MP




                                     1024 * 30 : 30K


                                       Look like
                                       Cocurrent
                                        Threads
                      B1

                                B2
           WARP
NVIDIA Confidential
Occupancy Calculator on CUDA SDK




NVIDIA Confidential
CUDA profiler on CUDA toolkit




NVIDIA Confidential
Parallel NSight 1.5 Professional




NVIDIA Confidential
Memory Hierarchy
Managing Memory

        Host                     Device
                                                         GPU
                                   DRAM                       Multiprocessor
                       CPU
                                           Local           Multiprocessor
                                          Memory
                                                          Multiprocessor
                                          Global                       Registers
         DRAM         Chipset
                                          Memory              Shared Memory




                                                   L2 Cache    L1 Cache




                                Bandwidth

NVIDIA Confidential
                                  Size
50
                                                                      0
                                                                    150
                                                                    200
                                                                    300
                                                                    350
                                                                    400
                                                                    450
                                                                    500
                                                                    550
                                                                    650




                                                                    100
                                                                    250
                                                                    600
                                                                                                                       Gflops
                                                          96 x 96
                                                        384 x 384
                                                        672 x 672
                                                        960 x 960




              NVIDIA Confidential
                                                      1248 x 1248
                                                      1536 x 1536
                                                      1824 x 1824
                                                      2112 x 2112
                                                      2400 x 2400
                                                      2688 x 2688
                                                      2976 x 2976
                                                      3264 x 3264
                                                      3552 x 3552
                                                      3840 x 3840
                                                      4128 x 4128
                                                      4416 x 4416
                                                      4704 x 4704
                                                      4992 x 4992
                                                      5280 x 5280
                                                      5568 x 5568
                                                      5856 x 5856
                                                      6144 x 6144
                                                      6432 x 6432
                                                      6720 x 6720
                                                      7008 x 7008
                                                                                                                 SGEMM: Multiples of 96




                                                      7296 x 7296
                                                      7584 x 7584
                                                      7872 x 7872
                                                      8160 x 8160
                                                      8448 x 8448
                                                                        50

                                                                    0
                                                                             100
                                                                                   150
                                                                                               250
                                                                                                     300
                                                                                                           350




                                                                                         200




                                                          64 x 64
                                                                                                                      Gflops




                                                        384 x 384
                                                        704 x 704
                                                      1024 x 1024
                                                      1344 x 1344
                                                      1664 x 1664
                                                      1984 x 1984
                                                      2304 x 2304
                                                      2624 x 2624
cuBLAS 3.2: NVIDIA Tesla C1060, Tesla C2050 (Fermi)
MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHz




                                                      2944 x 2944
                                                      3264 x 3264
                                                      3584 x 3584
                                                      3904 x 3904
                                                                                                                                              Matrix Size for Best CUBLAS3.2 Performance




                                                      4224 x 4224
                                                      4544 x 4544
                                                      4864 x 4864
                                                      5184 x 5184
                                                      5504 x 5504
                                                      5824 x 5824
                                                      6144 x 6144
                                                      6464 x 6464
                                                      6784 x 6784
                                                                                                                     DGEMM: Multiples of 64




                                                      7104 x 7104
                                                      7424 x 7424
                                                      7744 x 7744
                                                      8064 x 8064
                                                      8384 x 8384
0
                                                                             100
                                                                                   150
                                                                                         200
                                                                                               250
                                                                                                     300
                                                                                                           350




                                                                        50
                                                          64 x 64                                                               Gflops
                                                        128 x 128




                      NVIDIA Confidential
                                                        192 x 192
                                                        256 x 256
                                                        320 x 320
                                                        384 x 384
                                                        448 x 448
                                                        512 x 512
                                                        576 x 576
                                                        640 x 640
                                                        704 x 704
                                                        768 x 768
                                                        832 x 832
                                                        896 x 896
                                                                                                                                          cuBLAS level III




                                                        960 x 960
                                                      1024 x 1024
                                                      1088 x 1088
                                                      1152 x 1152
                                                      1216 x 1216
                                                      1280 x 1280
                                                      1344 x 1344
                                                      1408 x 1408
                                                                                                                 DGEMM: Multiples of 64




                                                      1472 x 1472
                                                      1536 x 1536
                                                      1600 x 1600
                                                      1664 x 1664
                                                      1728 x 1728
                                                      1792 x 1792
                                                      1856 x 1856
                                                      1920 x 1920
                                                      1984 x 1984
                                                      2048 x 2048
                                                      2112 x 2112
                                                      2176 x 2176
                                                      2240 x 2240
cuBLAS 3.2: NVIDIA Tesla C1060, Tesla C2050 (Fermi)
MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHz




                                                      2304 x 2304
                                                      2368 x 2368
                                                      2432 x 2432
                                                      2496 x 2496
                                                      2560 x 2560
                                                      2624 x 2624
                                                      2688 x 2688
                                                      2752 x 2752
                                                      2816 x 2816
                                                      2880 x 2880
                                                      2944 x 2944
                                                      3008 x 3008
                                                      3072 x 3072
                                                      3136 x 3136
                                                      3200 x 3200
                                                      3264 x 3264
                                                      3328 x 3328
                                                      3392 x 3392
Roofline Analysis (Arithmetic Intensity)




         Samuel Williams, Andrew Waterman, and Dav id Patterson,
         Roofline: An Insightful Visual Performance Model for Multicore Architectures
NVIDIA Confidential
Perf. on CUDA Application
Gflops                               Bandwidth of PCI-e slot




                         Ns




                              Bandwidth of Global Memory


NVIDIA Confidential
                       F/B
Tips for Optimization

         Consider Algorithm for parallel (naï algorithms will be good)
                                             ve
         Consider Occupancy ( SIMT)
         Consider Memory Bottleneck




NVIDIA Confidential
More Information for CUDA Optimization
         CUDA Zone
          http://www.nvidia.com/CUDA

         Developer Zone
          http://developer.nvidia.com

         GTC 2010 contents
          http://www.nvidia.com/gtc2010-content

         쿠다 카페 (CUDA café in Korea)
          http://cafe.daum.net/KCUG
NVIDIA Confidential
Thanks


  Hyungon Ryu
hryu@nvidia.com

More Related Content

What's hot

CUDA by Example : Graphics Interoperability : Notes
CUDA by Example : Graphics Interoperability : NotesCUDA by Example : Graphics Interoperability : Notes
CUDA by Example : Graphics Interoperability : NotesSubhajit Sahu
 
Dalvik Vm &amp; Jit
Dalvik Vm &amp; JitDalvik Vm &amp; Jit
Dalvik Vm &amp; JitAnkit Somani
 
Porting the Source Engine to Linux: Valve's Lessons Learned
Porting the Source Engine to Linux: Valve's Lessons LearnedPorting the Source Engine to Linux: Valve's Lessons Learned
Porting the Source Engine to Linux: Valve's Lessons Learnedbasisspace
 
JVM: A Platform for Multiple Languages
JVM: A Platform for Multiple LanguagesJVM: A Platform for Multiple Languages
JVM: A Platform for Multiple LanguagesKris Mok
 
Intrinsic Methods in HotSpot VM
Intrinsic Methods in HotSpot VMIntrinsic Methods in HotSpot VM
Intrinsic Methods in HotSpot VMKris Mok
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauAMD Developer Central
 
Avoiding Catastrophic Performance Loss
Avoiding Catastrophic Performance LossAvoiding Catastrophic Performance Loss
Avoiding Catastrophic Performance Lossbasisspace
 
Applying Concurrency Cookbook Recipes to SPEC JBB
Applying Concurrency Cookbook Recipes to SPEC JBBApplying Concurrency Cookbook Recipes to SPEC JBB
Applying Concurrency Cookbook Recipes to SPEC JBBMonica Beckwith
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMDHSA Foundation
 
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC clusterToward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC clusterRyousei Takano
 
qtdd11_qtmultimediakitonmobile
qtdd11_qtmultimediakitonmobileqtdd11_qtmultimediakitonmobile
qtdd11_qtmultimediakitonmobilegareth_stockwell
 
Efficient Buffer Management
Efficient Buffer ManagementEfficient Buffer Management
Efficient Buffer Managementbasisspace
 

What's hot (20)

CUDA by Example : Graphics Interoperability : Notes
CUDA by Example : Graphics Interoperability : NotesCUDA by Example : Graphics Interoperability : Notes
CUDA by Example : Graphics Interoperability : Notes
 
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
 
Dalvik Vm &amp; Jit
Dalvik Vm &amp; JitDalvik Vm &amp; Jit
Dalvik Vm &amp; Jit
 
Cuda materials
Cuda materialsCuda materials
Cuda materials
 
Enhancing the region model of RTSJ
Enhancing the region model of RTSJEnhancing the region model of RTSJ
Enhancing the region model of RTSJ
 
Porting the Source Engine to Linux: Valve's Lessons Learned
Porting the Source Engine to Linux: Valve's Lessons LearnedPorting the Source Engine to Linux: Valve's Lessons Learned
Porting the Source Engine to Linux: Valve's Lessons Learned
 
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
 
JVM: A Platform for Multiple Languages
JVM: A Platform for Multiple LanguagesJVM: A Platform for Multiple Languages
JVM: A Platform for Multiple Languages
 
Intrinsic Methods in HotSpot VM
Intrinsic Methods in HotSpot VMIntrinsic Methods in HotSpot VM
Intrinsic Methods in HotSpot VM
 
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill BilodeauGS-4108, Direct Compute in Gaming, by Bill Bilodeau
GS-4108, Direct Compute in Gaming, by Bill Bilodeau
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 
Avoiding Catastrophic Performance Loss
Avoiding Catastrophic Performance LossAvoiding Catastrophic Performance Loss
Avoiding Catastrophic Performance Loss
 
Pgq
PgqPgq
Pgq
 
Applying Concurrency Cookbook Recipes to SPEC JBB
Applying Concurrency Cookbook Recipes to SPEC JBBApplying Concurrency Cookbook Recipes to SPEC JBB
Applying Concurrency Cookbook Recipes to SPEC JBB
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
 
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC clusterToward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
 
qtdd11_qtmultimediakitonmobile
qtdd11_qtmultimediakitonmobileqtdd11_qtmultimediakitonmobile
qtdd11_qtmultimediakitonmobile
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
Efficient Buffer Management
Efficient Buffer ManagementEfficient Buffer Management
Efficient Buffer Management
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 

Similar to [05][cuda 및 fermi 최적화 기술] hryu optimization

Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011Raymond Tay
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018NVIDIA
 
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages toolslaparuma
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsJack (Jaegeun) Han
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...AMD Developer Central
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Transparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaTransparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaKazuaki Ishizaki
 
[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight
[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight
[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsightlaparuma
 
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel ComputingIRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel ComputingIRJET Journal
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.J On The Beach
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrKohei KaiGai
 
[若渴計畫]由GPU硬體概念到coding CUDA
[若渴計畫]由GPU硬體概念到coding CUDA[若渴計畫]由GPU硬體概念到coding CUDA
[若渴計畫]由GPU硬體概念到coding CUDAAj MaChInE
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 

Similar to [05][cuda 및 fermi 최적화 기술] hryu optimization (20)

Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
 
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools
[01][gpu 컴퓨팅을 위한 언어, 도구 및 api] miller languages tools
 
Profiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systemsProfiling deep learning network using NVIDIA nsight systems
Profiling deep learning network using NVIDIA nsight systems
 
Pgopencl
PgopenclPgopencl
Pgopencl
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Transparent GPU Exploitation for Java
Transparent GPU Exploitation for JavaTransparent GPU Exploitation for Java
Transparent GPU Exploitation for Java
 
[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight
[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight
[03 2][gpu용 개발자 도구 - parallel nsight 및 axe] gateau parallel-nsight
 
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel ComputingIRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
 
Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
 
[若渴計畫]由GPU硬體概念到coding CUDA
[若渴計畫]由GPU硬體概念到coding CUDA[若渴計畫]由GPU硬體概念到coding CUDA
[若渴計畫]由GPU硬體概念到coding CUDA
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 

[05][cuda 및 fermi 최적화 기술] hryu optimization

  • 1. CUDA and Fermi Optimization Techniques Hyungon Ryu, PSG SA
  • 2. Agenda CUDA 101 (General Overview) Toy of FDM example Monte Carlo Simulation & Time Series Analysis General CUDA Optimization Tips NVIDIA Confidential
  • 4. CUDA 101 CUDA Graphics Parallel (GPU) Programming NVIDIA Confidential
  • 5. CUDA Parallel programming Hetereogeneous Knowledge GPU Knowledge GPU Cg Parallel DSP OpenGL Parallel ASIC DirectX Parallel FPGA Cell BE Parallel knowledge Pthreads / winthreads MMX, SSE OpenMP PVM / MPI CUDA Parallel Computing with GPU NVIDIA Confidential
  • 6. Parallel Model in OpenMP CPU Program Fork Parallel Join NVIDIA Confidential 6
  • 7. CUDA parallel Model CPU Program Kernel Launch GPU threads NVIDIA Confidential 7
  • 8. Saxpy Example : CPU serial for (int i = 0; i < n; ++i) { y[i] = a*x[i] + y[i]; } i x y NVIDIA Confidential
  • 9. Example of Saxpy Parallel : OpenMP # pragma omp parallel shared (n,a,x,y) private (i) # pragma omp for for (int i = 0; i < n; ++i) { y[i] = a*x[i] + y[i]; } i x y NVIDIA Confidential
  • 10. Example of Saxpy Parallel : MPI for ( i = start ; i < end; i++) { y[i] = a * x[i] + y[i]; } MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Comm_size(MPI_COMM_WORLD,&size); void para_range(int lowest, int highest,int nprocs, int myrank, int *start, int *end) { int wk1, wk2; wk1 = (highest - lowest + 1) / nprocs; wk2 = (highest - lowest + 1) % nprocs; *start = myrank * wk1 + lowest + ( (rank<wk2) ? myrank : wk2); *end = *start + wk1 - 1; if(wk2 > rank) *end = *end + 1; i } x y NVIDIA Confidential
  • 11. Example of Saxpy Parallel : SSE void saxpy_vector(short *z, short *x, short *y, short a, unsigned n) { __m128i* x_ptr = (__m128i*) x; __m128i* y_ptr = (__m128i*) y; __m128i* z_ptr = (__m128i*) z; __m128i a_vec = _mm_splat_epi16(a); int i; for (i = 0; i < n/8; ++i) { __m128i x_vec = x_ptr[i]; __m128i y_vec = y_ptr[i]; __m128i z_vec = _mm_add_epi16( _mm_mullo_epi16 x_vec,a_vec),y_vec); z_ptr[i] = z_vec; } } i x y NVIDIA Confidential
  • 12. Saxpy Parallel : CUDA { x[i] = a * x[i] + t * y[i]; } Saxpy <<<N ,M >>> (n, 2.0, x, y); i x y NVIDIA Confidential
  • 13. CUDA C extension Launch the kernel Function <<< Grid, Block >>> ( parameter); Additional C standard API for mem control cudaXXX : cudaMalloc, cudaMemcpy, cuXXX : cuMalloc, cuMemcpy cutXXX : cutXXX For Function __global__, __device__, __host__, __device__ __host__ For memory __shared__, __device__, reg/loc pre-defined variables blockDim, blockIdx, threadIdx, cudaMemcpyHostToDevice Pre-defined function __syncthreads(), __mul24(); etc NVIDIA Confidential 13
  • 14. Process of CUDA developing Serial CUDA parallel Algorithm Algorithm serial Programming serial Programming Compile Compile CUDA convert Debugging Profile Release Parallelize Compile Debugging Optimize/profile Release NVIDIA Confidential 14
  • 15. CUDA is Parallel Computing !!! Serial MPI parallel CUDA parallel Algorithm Algorithm Algorithm serial Programming serial Programming serial Programming Compile Compile Compile parallel Programming CUDA convert Debugging Profile Profile Release Parallelize Parallelize Compile Compile Debugging [totalview] Debugging Optimize Optimize/profile Release Release MPIrun NVIDIA Confidential 15
  • 16. CPU Program Kernel Launch SM I-Cache Block MT Issue C-Cache TPC SP SP Geometry Controller SMC I-Cache I-Cache I-Cache MT Issue MT Issue MT Issue C-Cache SP SP C-Cache SP SP C-Cache SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP thread SFU SFU SFU SFU SFU SFU DP DP DP Shared Shared Shared Memory Memory Memory Texture Unit Tex L1 SP SP SFU SFU DP Shared Memory NVIDIA Confidential 16
  • 17. Image Processing Diagram without CUDA CPU Total capture time Camera For Loop start For Loop End NVIDIA Confidential
  • 18. Image Processing Diagram with CUDA CPU Total capture time Camera Memcpy For Loop start CUDA parallel For Loop End NVIDIA Confidential
  • 19. 1D Heat Equation CUDA Toy for undergraduate student
  • 20. 1D Heat Equation 1D bar NVIDIA Confidential Heat Source
  • 21. 1D Heat Equation u  u 2  1D Heat Equation t x 2 u (0, t )  u(1, t )  0 Boundary condition u ( x,0)  u0 Initial condition NVIDIA Confidential
  • 22. Discretization u  2u  t x 2 descretization u j ,i 1  u j ,i u j 1,i  2u j ,i  u j 1,i  Forward difference with second order t x 2 i(time), j(space) relation u j ,i 1  ru j 1,i  (1  2r )u j ,i  ru j 1,i Explicit method r  t / x 2 u[i+1][j] = r*u[i+1][j-1] + (1-2*r)*u[i][j] + r*u[i][j+1]; NVIDIA Confidential
  • 23. Discretization (stencil) u j ,i 1  ru j 1,i  (1  2r )u j ,i  ru j 1,i j+1,i r j,i i+1 (1-2*r) r j-1,i NVIDIA Confidential
  • 24. Conceptual Diagram Initial Condition Boundary Condition space 13 12 11 10 j+1 r 9 8 7 j j 6 (1-2*r) 5 stencil 4 r 3 j-1 u[i+1][j] = r*u[i+1][j-1] + (1-2*r)*u[i][j] + r*u[i][j+1]; 2 1 Time 0 Update 0 NVIDIA Confidential Time : 0 600
  • 25. Explicit Pseudo Code Parameter and data Initialization Stencil, boundary/initial condition, FOR LOOP (time, i) FOR LOOP (stencil, j) Update the stencil relation u[i+1][j] = r*u[i+1][j-1] + (1-2*r)*u[i][j] + r*u[i][j+1]; Results NVIDIA Confidential
  • 26. CPU code u[i][j] vs. u[j] u[i][j] easy to develop Possible to visualize the process u[j] efficient to use memory Get the last result NVIDIA Confidential
  • 27. Main algorithm for( i =0; i < N; i++) { Time Iteration for( j =0; j < M; j++){ Space Iteration u[i+1][j] = r * u[i+1][j-1] + (1-2*r) * u[i ][j] + r * u[i ][j+1]; } } Heat relation NVIDIA Confidential
  • 28. Boundary Condition Method1 copy copy copy N Fixed boundary N-1 Method2 copy Free boundary N+1 N N-1 compute NVIDIA Confidential
  • 29. How to Parallelize Parallelize the space stencil Each threads(core) dedicate to each space stencil element 13 12 11 j+1 r 10 9 8 j j 7 (1-2*r) 6 stencil 5 r 4 j-1 3 2 Time 0 Update 1 0 0 600 Time : NVIDIA Confidential Not possible to parallelize the time stencil.
  • 30. How to parallelize Parameter and data Initialization Stencil, boundary/initial condition, FOR LOOP (time, i) FOR LOOP (stencil, j) Update the stencil relation Results NVIDIA Confidential
  • 31. Space parallelization Time Sequence Time 0 Comm Time 1 Time 2 Time 3 NVIDIA Confidential
  • 32. CPUcode CPU source Heat Relation solve Boundary Condition result NVIDIA Confidential
  • 33. CPU code do{ time += dt; printf("aaaaaa %fn",time); for(i=1; i < mesh+1; i++){ temper[i].new_1 = (double) (1-2*r)*temper[i].old + r*(temper[i-1].old + temper[i+1].old); printf("n processing t %d %f %f n",i, temper[i].new_1, temper[i].old); } temper[mesh+1].new_1 = temper[mesh-1].new_1; printf("t print results %d %f %f n",mesh+1, temper[mesh+1].new_1,temper[mesh-1].new_1 ); for(i=1; 1 < mesh+2; i++) temper[i].old = temper[i].new_1; printf("aatt %d %f %f n",i, temper[i].new_1,temper[i].new_1 ); if((++count % print_step)==0){ printf("hh ttt %10.5lf", time); for(i=0; i<mesh; i+=2) printf("df 8.4lf", temper[i].new_1); if(!(i%2)) printf("fd %8.4lfn", temper[mesh].new_1); } printf("nn time print %fn", time); getchar(); }while(time < end_time); if((count % print_step)){ printf("bghj %10.5lf", time); for(i=0; i<mesh; i+=2) printf("ahg 8.4lf", temper[i].new_1); printf("nhgf %8.4lfn", temper[mesh].new_1); } NVIDIA Confidential
  • 34. GPUcode-01 Memory Map CPU GPU source source Heat Relation solve Bypass without solving Boundary Condition result result NVIDIA Confidential
  • 35. GPUcode-01 Malloc Template double* init_GPU_data(struct flow * temper, int mesh) { double *u_dev; // for GPU data upload size_t gpuMemsize=sizeof(double)*(mesh+2) ; double *u_host; // temperal value cudaMalloc( (void**)&u_dev, gpuMemsize);cudaErr("malloc u_dev"); u_host = (double *) malloc( gpuMemsize); for(int i=0;i<mesh;i++){ u_host[i]= temper[i].old; printf("before %d : data initial :u_host[%d]= %f temper[%d].old =%fn", i, i, u_host[i], i, temper[i].old); } cudaMemcpy(u_dev, u_host, gpuMemsize, cudaMemcpyHostToDevice);cudaErr("memcpy u_dev u_host"); cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev u_host"); for(int i=0;i<mesh;i++){ printf("after %d : data initial :u_host[%d]= %f temper[%d].old =%fn", i, i, u_host[i], i, temper[i].old); } free(u_host); return (double *)u_dev; } NVIDIA Confidential
  • 36. GPUcode-01 Launch Template void __global__ functionG(double *u_dev, int meshsize, double r, double bound ) { int idx = blockIdx.x*blockDim.x+threadIdx.x; int i = idx+1; if(idx <meshsize +1 ) { u_dev[i]= i*0.01; } if(idx == 6){ u_dev[idx+1]=u_dev[idx-1]=11; } } void compute_GPU( double * u_dev, double * u_host, double dt, double dx, double r, int mesh, int print_step, int count, double time, double end_time, double bound ) { size_t gpuMemsize = sizeof(double)*(mesh+2); //for( int i=0; i < 6000 ; i++){ //time step functionG<<<4,5>>>(u_dev, mesh,r, bound);cudaErr2("kernel launch",1,0); cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev to u_host"); for(int i=0; i<mesh+1; i++){ printf( " in kernel - GPU : temper[%d] ==> %f n", i, u_host[i]); } //} return; } NVIDIA Confidential
  • 37. GPUcode-02 Solving CPU GPU source source Heat Relation Heat Relation solve solve Boundary Condition Boundary Condition result result NVIDIA Confidential
  • 38. GPUcode-02 Malloc part double* init_GPU_data(struct flow * temper, int mesh) { double *u_dev; // for GPU data upload size_t gpuMemsize=sizeof(double)*(mesh+2) ; double *u_host; // temperal value cudaMalloc( (void**)&u_dev, gpuMemsize);cudaErr("malloc u_dev"); u_host = (double *) malloc( gpuMemsize); for(int i=0;i<mesh;i++){ u_host[i]= temper[i].old; printf("before %d : data initial :u_host[%d]= %f temper[%d].old =%fn", i, i, u_host[i], i, temper[i].old); } cudaMemcpy(u_dev, u_host, gpuMemsize, cudaMemcpyHostToDevice);cudaErr("memcpy u_dev u_host"); cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev u_host"); for(int i=0;i<mesh;i++){ printf("after %d : data initial :u_host[%d]= %f temper[%d].old =%fn", i, i, u_host[i], i, temper[i].old); } free(u_host); return (double *)u_dev; } NVIDIA Confidential
  • 39. GPUcode-02 Launch part void __global__ functionG(double *u_dev, int meshsize, double r, double bound ) { int idx = blockIdx.x*blockDim.x+threadIdx.x; int i = idx+1; if(idx <meshsize +1 ) { u_dev[i] = (double) (1-2*r)*u_dev[i] + r*(u_dev[i-1] + u_dev[i+1] ); // u_dev[i]= i*0.01; } if(idx == 6){ u_dev[idx+1]=u_dev[idx-1]=11; } } void compute_GPU( double * u_dev, double * u_host, double dt, double dx, double r, int mesh, int print_step, int count, double time, double end_time, double bound ) { size_t gpuMemsize = sizeof(double)*(mesh+2); //for( int i=0; i < 6000 ; i++){ //time step functionG<<<4,5>>>(u_dev, mesh,r, bound);cudaErr2("kernel launch",1,0); cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev to u_host"); for(int i=0; i<mesh+1; i++){ printf( " in kernel - GPU : temper[%d] ==> %f n", i, u_host[i]); } //} return; } NVIDIA Confidential
  • 40. How to Optimize? NVIDIA Confidential
  • 42. Computation Time Malliavin MC results type Black-Sholes 5000 10000 50000 100000 200000 300000 Price 12.8216 12.8706 12.8721 12.8413 12.8525 12.8559 12.8480 Err 0.000% 0.38% 0.39% 0.15% 0.24% 0.27% 0.21% Delta 0.5858 0.5724 0.5887 0.5826 0.5826 0.5829 0.5824 Err 0 2.28% -0.51% 0.53% 0.53% 0.49% 0.58% Gamma 0.0130 0.0112 0.0134 0.0127 0.0127 0.0127 0.0127 Err 0.00% 13.39% -3.34% 2.07% 2.26% 2.21% 2.47% Total Time 0:00 00:03 00:05 00:25 00:50 01:44 02:33 time : 100 sec target1 : > 1 sec (100X) Target2 : >0.001 sec (2000X) NVIDIA Confidential
  • 43. Monte Carlo Simulation for Finance Excel Sheet Excel VBA C/C++ dll Link CUDA Acceleration NVIDIA Confidential
  • 44. Monte Carlo Code with VBA function MC( S As Double, X As Double, T As Double, R As Double, _ Vol As Double, Q As Double, No As Double, rtype As String) As Double Simul_No = No dt = 1 'dt = 1 / 365 For K = 0 To Simul_No - 1 Juga = S For i = 0 To MaxStep - 1 Juga = Juga * Exp( (R - Q - Vol ^ 2 / 2) * dt + Vol * Sqr(dt) * MakeNorsD() ) Next price = Exp(-R * T) * Max(Juga - X, 0) sum_price = sum_price + price Next MC = sum_price / Simul_No End function NVIDIA Confidential
  • 45. Malliavin Greeks Greek computation for Monte Carlo simulation  E[ f ( S0  S )]  E[ f ( S0 )] E[ f ( S )]  s S Malliavin approach  E[ f ( S )]  E[ f ( S ) W ( S )] s Malliavin weights With Malliavin approach, we can save the computation time. NVIDIA Confidential
  • 46. Problem To compare the accuracy, we compute the Price, Delta and Gamma of Vanilla Call option. Approach 1. Closed Form solution (VBA,C) 2. Monte (VBA, C) 3. Malliavin (VBA,C, CUDA v1, v2) NVIDIA Confidential
  • 47. Malliavin Monte Carlo Code with VBA function Malliavin( S As Double, X As Double, T As Double, R As Double, _ Vol As Double, Q As Double, No As Double, rtype As String) As Double Simul_No = No dt = 1 'dt = 1 / 365 For K = 0 To Simul_No - 1 Juga = S For i = 0 To MaxStep - 1 Juga = Juga * Exp( (R - Q - Vol ^ 2 / 2) * dt + Vol * Sqr(dt) * MakeNorsD() ) Next WT = (Log(Juga) - Log(S) - (R - Q - 1 / 2 * Vol ^ 2) * T) / Vol WT_delta = (WT / (S * Vol * T)) WT_gamma = (1 / (Vol * T * S ^ 2)) * (WT ^ 2 / (Vol * T) - WT - 1 / Vol) price = Exp(-R * T) * Max(Juga - X, 0) delta = Exp(-R * T) * Max(Juga - X, 0) * WT_delta gamma = Exp(-R * T) * Max(Juga - X, 0) * WT_gamma sum_price = sum_price + price sum_delta = sum_delta + delta sum_gamma = sum_gamma + Gamma Next Malliavin = sum_delta / Simul_No NVIDIA Confidential
  • 48. Step1 Malliavin Monte Carlo C language sketch void Malliavin( double S , double X , double T , double R, double Vol, double Q, long No){ long Simul_No = No; double dt = 1; // dt = 1 / 365 for ( int K = 0; K< Simul_No - 1 ; K++){ Juga = S ; for (int i = 0; i< MaxStep - 1 ki++){ Juga = Juga * exp( (R - Q - Vol ^ 2 / 2) * dt + Vol * sqrt(dt) * norm() ); // rand with box muller } WT = (Log(Juga) - Log(S) - (R - Q - 1 / 2 * Vol ^ 2) * T) / Vol; WT_delta = (WT / (S * Vol * T)); WT_gamma = (1 / (Vol * T * S ^ 2)) * (WT ^ 2 / (Vol * T) - WT - 1 / Vol); price = Exp(-R * T) * max(Juga - X, 0); delta = Exp(-R * T) * max(Juga - X, 0) * WT_delta; gamma = Exp(-R * T) * max(Juga - X, 0) * WT_gamma; sum.price = sum.price + price; sum.delta = sum.delta + delta; sum.gamma = sum.gamma + Gamma ; } r.price = sum.delta / Simul_No r.delta = sum.delta / Simul_No r.gamma = sum.delta / Simul_No return 0; NVIDIA Confidential
  • 49. Step2 Sketch (kernel part) Void __global__ Malliavin_compute( double S , double X , double T , double R, double Vol, double Q, long No){ long Simul_No = No; double dt = 1; // dt = 1 / 365 for ( int K = 0; K< Simul_No - 1 ; K++){ Simm_No = Juga = S ; Total Sim /(N threads* M blocks) for (int i = 0; i< MaxStep - 1 ki++){ Juga = Juga * exp( (R - Q - Vol *Vol / 2) * dt + Vol * sqrt(dt) * norm(k,i) ); } WT = (log(Juga) - log(S) - (R - Q - 1 / 2 * Vol *Vol) * T) / Vol; WT_delta = (WT / (S * Vol * T)); WT_gamma = (1 / (Vol * T * S *S)) * (WT*WT / (Vol * T) - WT - 1 / Vol); price = Exp(-R * T) * max(Juga - X, 0); delta = Exp(-R * T) * max(Juga - X, 0) * WT_delta; gamma = Exp(-R * T) * max(Juga - X, 0) * WT_gamma; sum.price = sum.price + price; sum.delta = sum.delta + delta; sum.gamma = sum.gamma + Gamma ; } Real Price = r.price = sum.delta / Simul_No r.delta = sum.delta / Simul_No Sum (r.price ) / (N*M) r.gamma = sum.delta / Simul_No return 0; NVIDIA Confidential
  • 50. Step2 Parallel Memory Map Parallel cores RNG_compute1() uniform RNG_compute2() normal normal() stock Malliavin_compute() option data in global memory (float *) __device__ normal(int k, int, j, int size_j, float * normal) { int index = k*size_j + j; return &normal[index]; } NVIDIA Confidential
  • 51. Step2 Sketch (host part) #include <stdio.h> __global__ RNG_compute(parameter); __global__ Malliavin_compute(parameter); main(){ malloc(); //cpu malloc cudaMalloc(); //GPU malloc cudaMemcpy(); // transfer RNG_compute 1<<<N,M>>> (parameter); // generate RNG (uniform) RNG_compute2 <<<N,M>>> (parameter); // generate RNG ( BM,Moro) Malliavin_compute <<<N,M>>> (parameter); // simulation cudaMemcpy(); //get results return 0; } NVIDIA Confidential
  • 52. Step2 Malliavin Monte Carlo CUDA language sketch (rng part1) __global__ static void RNG_rand48_get_int(uint2 *state, int *res, int num_blocks, uint2 A, uint2 C) { const int nThreads = blockDim.x*gridDim.x; int nOutIdx = threadIdx.x + blockIdx.x*blockDim.x; uint2 lstate = state[nOutIdx]; int i; for (i = 0; i < num_blocks; ++i) { res[nOutIdx] = ( lstate.x >> 17 ) | ( lstate.y << 7); nOutIdx += nThreads; lstate = RNG_rand48_iterate_single(lstate, A, C); } nOutIdx = threadIdx.x + blockIdx.x*blockDim.x; state[nOutIdx] = lstate; } NVIDIA Confidential
  • 53. Step2 Malliavin Monte Carlo CUDA language sketch (rng part2) Void __global__ RNG_compute( int * uniform , float * normal, int length){ int index = blockDim.x*blockIdx.x+threadIdx.x; __shared__ int s[i]; __shared__ float s_r[i]; if( threadIdx.x ==0){ for (int i = 0 ; i<blockDim.x ; i++){ s[i]=uniform[blockDim.x*blockIdx.x + i]; // load uniform } } s_r[threadIdx.x] = (float) moro(s[threadIdx.x]); // moro inversion with parallel if( threadIdx.x ==0){ for (int i = 0 ; i<blockDim.x ; i++){ s[blockDim.x*blockIdx.x+i]= s_r[i] ]; // save normal } } } NVIDIA Confidential
  • 54. Box-Muller vs Moro Inversion __device__ void BoxMuller(float& u1, float& u2){ float r = sqrtf(-2.0f * logf(u1)); float phi = 2 * PI * u2; u1 = r * __cosf(phi); u2 = r * __sinf(phi); } __device__ Moro( float u ){ // skip the const value x = u - 0.5; if (abs(x) < 0.42) { r = x*x; r = x * (((a4 * r + a3) * r + a2) * r + a1) / ((((b4 * r + b3) * r + b2) * r + b1) * r + 1); } else{ if (x > 0) r = log(-log(1 - u)); if (x <= 0) r = log(-log(u)); r = c1 + r * (c2 + r * (c3 + r * (c4 + r * (c5 + r * (c6 + r * (c7 + r * (c8 + r * c9))))))); if (x <= 0) r = -r; }f return r; } NVIDIA Confidential
  • 55. Step3 New approach for Direct LCG and Parallel Memory Map Parallel cores Malliavin_compute() rand_new(seed) uniform moro(u) normal stock option data in registers double rand_new( int next){ next = (a * next + b) % M; return (double) next * (1/M); } NVIDIA Confidential
  • 56. Platform for finance Excel Sheet Excel VBA C/C++ dll Link Socket communication Grid Manager GPU cluster 12 nodes system (1/2 for backup) 12*8 GPU*448 core = 43,000 core NVIDIA Confidential 8 GPU per node
  • 57. How to Optimize? NVIDIA Confidential
  • 59. Multivariate Time-series m S(1) Corr(1,2) Corr(1,3) S(2) Corr(2,3) S(3) Corr(1,n) S(n) NVIDIA Confidential
  • 60. How to parallelize with CUDA serialize S(1) S(2) S(3) S(n) NVIDIA Confidential Unefficient approach for Shared Memory Usage
  • 61. How to parallelize with CUDA S(1) S(2) serialize S(3) S(n) efficient approach for Shared Memory Usage NVIDIA Confidential Need to reduction technique for mean etc.
  • 62. How to parallelize with CUDA S(1) S(2) serialize Parallelize S(3) With multiNode multiGPU S(n) Good method for multiGPU & large Cluster NVIDIA Confidential
  • 63. In pair(I,J), Pearson Correlation Coefficient   ( x  x )( y  y ) i i  (x  x )  ( y  y) i 2 i 2 (x i  x )( yi  y )   xi yi  x  yi  y  xi   xy   xi yi  nxy (x  x )   x i 2 i 2  nx 2 ( y  y)   y i 2 i 2  ny 2  x y i i  nxy ( xi 2  nx 2 )( yi 2  ny 2 ) We can parallelize the Sumation !! After sumation, find the mean. NVIDIA Confidential
  • 64. How to parallelize with CUDA : Flow Chart Method 1 Method 2 Input A, B Input A, B Find the mean of A, B Start to sum (Ai), (Bi), (Ai,Bi), (Ai^2), (Bi^2) start to sum (Ai), (Bi) Find the mean of A, B Find Cov(A,B), Cov(A,A), Cov(B,B) start to sum (Ai,Bi), (Ai^2), (Bi^2) with mean Find Cov(A,B), Cov(A,A), Cov(B,B) Benefit : Easy to implementation Benefit : oneshot sum (speed-up) with two function R+CUDA project http://brainarray.mbni.med.umich.edu/Brainarray/rgpgpu/ NVIDIA Confidential
  • 65. In pair(i,j), Pearson Correlation Coefficient FOR (i, j) – pair : serial FOR k (time-series) : parallel compute x  y x y i i i i x i 2 y i 2 reduction for results compute mean(i), mean(j), compute cov(i,j), cov(i,i) ,cov(j,j) compute corr(i,j) NVIDIA Confidential
  • 66. In pair(i,j), Pearson Correlation Coefficient FOR (i, j) – pair : serial FOR k (time-series) : parallel FOR shared memory compute x  y x y i i i i x i 2 y i 2 reduction for results compute mean(i), mean(j), compute cov(i,j), cov(i,i) ,cov(j,j) compute corr(i,j) NVIDIA Confidential
  • 67. How to Optimize? NVIDIA Confidential
  • 69. System Optimization Tips before start CUDA programming
  • 70. Remote Use VNC protocol VNC VNC snapshot of real ccreen WDDM CUDA enabled Remote Host Client RDP protocol RDP driver RDP intercept the CUDA driver WDDM CUDA disabled Remote Host Client NVIDIA Confidential
  • 71. TCC driver CUDA Application result GPU RDP protocol RDP driver RDP WDM WDDM CUDA enabled Remote Host Client • Remote Desktop • Kernel Launch Over Head • GPU exclusive Mode • Windows Service for Session0/1 • large single Malloc NVIDIA Confidential
  • 72. GPU exclusive Mode for multiuser GPU 3 GPU 2 GPU Pool GPU 1 Remote users GPU 0 Remote users Server Remote users NVIDIA Confidential
  • 73. GPUDirect for GPU cluster MPI Communication System Pinned Memory System Pinned Memory share System Pageable Memory NVIDIA Confidential
  • 74. FBO on SDI with Quadro Third Party Third Party SDI Input SDI Input CPU Memory CPU Memory Controller NVIDIA Graphics Card Controller NVIDIA Graphics Card Video DVI Video DVI Memory Memory DVI DVI GPU GPU System System Memory Memory NVIDIA SDI Output NVIDIA SDI Output DVI DVI Write to Host memory and to write GPU memory Direct write to OpenGL Frame Buffer Object NVIDIA Confidential
  • 75. Conceptual Tips for CUDA optimization
  • 77. Abstract overview of CPU Core CPU Memory Bank inst. fetch Register Bank RCX RDX RBX RAX L1 inst. cache L1 FPU ALU LSU data cache NVIDIA Confidential
  • 78. Threads on Multicore CPU CPU Core CPU General Programming reg TH Core CPU Winthread, pthread reg TH TH Core CPU Hyper-threading reg TH TH NVIDIA Confidential TH TH
  • 79. Threads on Manicore GPU H/W Multi Processor vs S/W Active Block MP SP reg TH TH TH TH TH TH TH TH TH TH Block Block Block MP SP reg TH TH TH TH TH NVIDIA Confidential Block Block 79
  • 80. Overview of WARP schedule <<<0,0>>> Warp 0 <<<0,1>>> <<<0,31>>> MP schedule <<<0,32>>> <<<0,33>>> <<<0,34>>> warp threads 8 SP core Register per threads NVIDIA Confidential
  • 81. Overview of Instruction Fetch blockIdx.x=0, threadIdx.x=3 0 1 2 3 4 5 6 7 Load A, B FP operation cores ADD Store C 0 1 2 3 4 5 6 7 NVIDIA Confidential
  • 82. Overview of Instruction Fetch blockIdx.x=0, threadIdx.x=4 0 1 2 3 4 5 6 7 Load A, B FP operation cores ADD Store C 0 1 2 3 4 5 6 7 NVIDIA Confidential
  • 83. Thread schedule within MP : WARP MP 1024 * 30 : 30K Look like Cocurrent Threads B1 B2 WARP NVIDIA Confidential
  • 84. Occupancy Calculator on CUDA SDK NVIDIA Confidential
  • 85. CUDA profiler on CUDA toolkit NVIDIA Confidential
  • 86. Parallel NSight 1.5 Professional NVIDIA Confidential
  • 88. Managing Memory Host Device GPU DRAM Multiprocessor CPU Local Multiprocessor Memory Multiprocessor Global Registers DRAM Chipset Memory Shared Memory L2 Cache L1 Cache Bandwidth NVIDIA Confidential Size
  • 89. 50 0 150 200 300 350 400 450 500 550 650 100 250 600 Gflops 96 x 96 384 x 384 672 x 672 960 x 960 NVIDIA Confidential 1248 x 1248 1536 x 1536 1824 x 1824 2112 x 2112 2400 x 2400 2688 x 2688 2976 x 2976 3264 x 3264 3552 x 3552 3840 x 3840 4128 x 4128 4416 x 4416 4704 x 4704 4992 x 4992 5280 x 5280 5568 x 5568 5856 x 5856 6144 x 6144 6432 x 6432 6720 x 6720 7008 x 7008 SGEMM: Multiples of 96 7296 x 7296 7584 x 7584 7872 x 7872 8160 x 8160 8448 x 8448 50 0 100 150 250 300 350 200 64 x 64 Gflops 384 x 384 704 x 704 1024 x 1024 1344 x 1344 1664 x 1664 1984 x 1984 2304 x 2304 2624 x 2624 cuBLAS 3.2: NVIDIA Tesla C1060, Tesla C2050 (Fermi) MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHz 2944 x 2944 3264 x 3264 3584 x 3584 3904 x 3904 Matrix Size for Best CUBLAS3.2 Performance 4224 x 4224 4544 x 4544 4864 x 4864 5184 x 5184 5504 x 5504 5824 x 5824 6144 x 6144 6464 x 6464 6784 x 6784 DGEMM: Multiples of 64 7104 x 7104 7424 x 7424 7744 x 7744 8064 x 8064 8384 x 8384
  • 90. 0 100 150 200 250 300 350 50 64 x 64 Gflops 128 x 128 NVIDIA Confidential 192 x 192 256 x 256 320 x 320 384 x 384 448 x 448 512 x 512 576 x 576 640 x 640 704 x 704 768 x 768 832 x 832 896 x 896 cuBLAS level III 960 x 960 1024 x 1024 1088 x 1088 1152 x 1152 1216 x 1216 1280 x 1280 1344 x 1344 1408 x 1408 DGEMM: Multiples of 64 1472 x 1472 1536 x 1536 1600 x 1600 1664 x 1664 1728 x 1728 1792 x 1792 1856 x 1856 1920 x 1920 1984 x 1984 2048 x 2048 2112 x 2112 2176 x 2176 2240 x 2240 cuBLAS 3.2: NVIDIA Tesla C1060, Tesla C2050 (Fermi) MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHz 2304 x 2304 2368 x 2368 2432 x 2432 2496 x 2496 2560 x 2560 2624 x 2624 2688 x 2688 2752 x 2752 2816 x 2816 2880 x 2880 2944 x 2944 3008 x 3008 3072 x 3072 3136 x 3136 3200 x 3200 3264 x 3264 3328 x 3328 3392 x 3392
  • 91. Roofline Analysis (Arithmetic Intensity) Samuel Williams, Andrew Waterman, and Dav id Patterson, Roofline: An Insightful Visual Performance Model for Multicore Architectures NVIDIA Confidential
  • 92. Perf. on CUDA Application Gflops Bandwidth of PCI-e slot Ns Bandwidth of Global Memory NVIDIA Confidential F/B
  • 93. Tips for Optimization Consider Algorithm for parallel (naï algorithms will be good) ve Consider Occupancy ( SIMT) Consider Memory Bottleneck NVIDIA Confidential
  • 94. More Information for CUDA Optimization CUDA Zone http://www.nvidia.com/CUDA Developer Zone http://developer.nvidia.com GTC 2010 contents http://www.nvidia.com/gtc2010-content 쿠다 카페 (CUDA café in Korea) http://cafe.daum.net/KCUG NVIDIA Confidential
  • 95. Thanks Hyungon Ryu hryu@nvidia.com