CUDA and Fermi
Optimization Techniques
   Hyungon Ryu, PSG SA

         CUDA 101 (General Overview)
         Toy of FDM example
         Monte Carlo Simulation & Time Series Analysis
         General CUDA Optimization Tips

CUDA 101
CUDA 101

General Overview
CUDA 101

               Graphics      Parallel
                (GPU)     Programming
CUDA Parallel programming
CUDA Parallel programming

                                               Hetereogeneous Knowledge
         GPU Knowledge                            GPU
                      Cg                          Parallel DSP
                      OpenGL                      Parallel ASIC
                      DirectX                     Parallel FPGA
                                                  Cell BE
         Parallel knowledge
                      Pthreads / winthreads
                      MMX, SSE
                      PVM / MPI               CUDA    Parallel Computing with GPU
Parallel Model in OpenMP
Parallel Model in OpenMP

CPU Program



CUDA parallel Model
CUDA parallel Model

CPU Program
                      Kernel Launch

                                      GPU threads

NVIDIA Confidential
Saxpy Example : CPU serial

       for (int i = 0; i < n; ++i) {
             y[i] = a*x[i] + y[i];

Saxpy Example : CPU serial
Example of Saxpy Parallel : OpenMP
# pragma omp            parallel shared (n,a,x,y) private (i)
# pragma omp            for
for (int i =            0; i < n; ++i) {
      y[i] =            a*x[i] + y[i];

  NVIDIA Confidential
Example of Saxpy Parallel : MPI
for ( i = start ; i < end; i++)
y[i] = a * x[i] + y[i];
                      MPI_Init(&argc, &argv);

                      void para_range(int lowest, int highest,int nprocs, int myrank,
                                      int *start, int *end) {
                          int wk1, wk2;
                          wk1 = (highest - lowest + 1) / nprocs;
                          wk2 = (highest - lowest + 1) % nprocs;
                          *start = myrank * wk1 + lowest + ( (rank<wk2) ? myrank : wk2);
                          *end = *start + wk1 - 1;
                          if(wk2 > rank) *end = *end + 1;
i                     }

NVIDIA Confidential
Example of Saxpy Parallel : SSE
                 void saxpy_vector(short *z, short *x, short *y, short a, unsigned n) {
                    __m128i* x_ptr = (__m128i*) x;
                    __m128i* y_ptr = (__m128i*) y;
                    __m128i* z_ptr = (__m128i*) z;
                    __m128i a_vec = _mm_splat_epi16(a);
                    int i;
                    for (i = 0; i < n/8; ++i) {
                      __m128i x_vec = x_ptr[i];
                      __m128i y_vec = y_ptr[i];
                      __m128i z_vec = _mm_add_epi16( _mm_mullo_epi16 x_vec,a_vec),y_vec);
                      z_ptr[i] = z_vec;
NVIDIA Confidential
Saxpy Parallel : CUDA

              x[i] = a * x[i] + t * y[i];

                                   Saxpy <<<N ,M >>> (n, 2.0, x, y);

Saxpy Parallel : CUDA
CUDA C extension
Launch the kernel
     Function <<< Grid, Block >>> ( parameter);

Additional C standard API for mem control
      cudaXXX : cudaMalloc, cudaMemcpy,
      cuXXX     : cuMalloc, cuMemcpy
      cutXXX : cutXXX

For Function
      __global__, __device__, __host__,     __device__ __host__
For memory
      __shared__, __device__, reg/loc

pre-defined variables
      blockDim, blockIdx, threadIdx, cudaMemcpyHostToDevice

 Pre-defined function
 __syncthreads(), __mul24(); etc
CUDA C extension
Process of CUDA developing

   Serial                     CUDA parallel

                              serial Programming
 serial Programming
                              CUDA convert
    Release                       Parallelize

Process of CUDA developing
CUDA is Parallel Computing !!!

   Serial              MPI parallel                  CUDA parallel

                        Algorithm                    Algorithm
                        serial Programming           serial Programming
 serial Programming
                             Compile                      Compile
                         parallel Programming        CUDA convert
                             Profile                     Profile
    Release                  Parallelize                 Parallelize
                             Compile                     Compile
                             Debugging [totalview]       Debugging
                             Optimize                    Optimize/profile
                         Release                      Release
NVIDIA Confidential
CPU Program
                                                           Kernel Launch

                                              MT Issue

          TPC                                 SP    SP
       Geometry Controller
I-Cache      I-Cache      I-Cache
MT Issue    MT Issue      MT Issue
            SP    SP
                          SP   SP
                                              SP    SP
SP   SP     SP    SP      SP   SP

SP   SP     SP    SP      SP   SP

SP   SP     SP    SP      SP   SP

                                              SP    SP

  DP           DP            DP
Shared       Shared       Shared
Memory       Memory       Memory

           Texture Unit
             Tex L1

                                              SP    SP

                                              SFU   SFU


                       NVIDIA Confidential
Image Processing Diagram without CUDA

                                CPU                    Total
                      capture                          time
     Camera                           For Loop start

                                      For Loop End

NVIDIA Confidential
Image Processing Diagram with CUDA

                                CPU                                    Total
     Camera                                 Memcpy

                                      For Loop start

                                                       CUDA parallel
                                      For Loop End

1D Heat Equation
1D Heat Equation
       CUDA Toy
for undergraduate student
1D Heat Equation

                        1D bar

1D Heat Equation
1D Heat Equation

              u         u
                                       1D Heat Equation
              t         x 2

              u (0, t )  u(1, t )  0   Boundary condition

              u ( x,0)  u0               Initial condition

Discretization
                       u     2u
                       t    x 2

u j ,i 1  u j ,i              u j 1,i  2u j ,i  u j 1,i
                                                              Forward difference with second order
               t                           x   2              i(time), j(space)


u j ,i 1  ru j 1,i  (1  2r )u j ,i  ru j 1,i             Explicit method

                      r  t / x 2
      u[i+1][j] = r*u[i+1][j-1] + (1-2*r)*u[i][j] + r*u[i][j+1];
NVIDIA Confidential
Discretization (stencil)

          u j ,i 1  ru j 1,i  (1  2r )u j ,i  ru j 1,i


                              j,i                  i+1


Discretization (stencil)
Conceptual Diagram
                                    Initial Condition
Boundary Condition

                            10             j+1
                             7               j                           j
                             6                          (1-2*r)
                             3             j-1
                                                   u[i+1][j] = r*u[i+1][j-1] + (1-2*r)*u[i][j] + r*u[i][j+1];
                             1           Time 0                   Update
   Conceptual Diagram
Explicit Pseudo Code

         Parameter and data Initialization
                      Stencil, boundary/initial condition,

         FOR LOOP (time, i)
            FOR LOOP (stencil, j)
                 Update the stencil relation
                         u[i+1][j] = r*u[i+1][j-1] + (1-2*r)*u[i][j] + r*u[i][j+1];

Explicit Pseudo Code
CPU code

         u[i][j] vs. u[j]
                      u[i][j] easy to develop
                         Possible to visualize the process

                      u[j] efficient to use memory
                         Get the last result

CPU code
Main algorithm
   for( i =0; i < N; i++) {                                       Time Iteration

                for( j =0; j < M; j++){                           Space Iteration

                      u[i+1][j] =               r * u[i+1][j-1]
                                        + (1-2*r) * u[i ][j]
                                        + r       * u[i ][j+1];
                        Heat relation

Main algorithm
Boundary Condition
                         copy       copy   copy

                                            Fixed boundary

                       copy                  Free boundary


N-1                       compute
 Boundary Condition
How to Parallelize
                                     Parallelize the space stencil
                                     Each threads(core) dedicate to each space stencil element
             11                     j+1
              8                      j                       j
              7                              (1-2*r)
              4                     j-1
              2                   Time 0               Update
                      0                                                                  600
             Time :
How to Parallelize
                          Not possible to parallelize the time stencil.
How to parallelize

         Parameter and data Initialization
                      Stencil, boundary/initial condition,
         FOR LOOP (time, i)
             FOR LOOP (stencil, j)
                  Update the stencil relation

How to parallelize
Space parallelization


      Time 0
      Time 1

      Time 2

      Time 3

Space parallelization


                               Heat Relation
                               Boundary Condition


CPUcode
CPU code
      time += dt; printf("aaaaaa %fn",time);
      for(i=1; i < mesh+1; i++){
                temper[i].new_1 = (double) (1-2*r)*temper[i].old + r*(temper[i-1].old + temper[i+1].old);
                printf("n processing t %d %f %f n",i, temper[i].new_1, temper[i].old);
      temper[mesh+1].new_1 = temper[mesh-1].new_1;
        printf("t print results %d %f %f n",mesh+1, temper[mesh+1].new_1,temper[mesh-1].new_1 );

        for(i=1; 1 < mesh+2; i++)
        temper[i].old = temper[i].new_1; printf("aatt %d %f %f n",i, temper[i].new_1,temper[i].new_1 );
        if((++count % print_step)==0){
        printf("hh ttt %10.5lf", time);
        for(i=0; i<mesh; i+=2)
                       printf("df 8.4lf", temper[i].new_1);
                       printf("fd %8.4lfn", temper[mesh].new_1);
                        printf("nn time print %fn", time); getchar();
}while(time < end_time);
if((count % print_step)){
        printf("bghj %10.5lf", time);
        for(i=0; i<mesh; i+=2)
                       printf("ahg 8.4lf", temper[i].new_1);
        printf("nhgf %8.4lfn", temper[mesh].new_1);
GPUcode-01 Memory Map
GPUcode-01 Memory Map


                               Heat Relation
                      solve                                  Bypass without solving
                               Boundary Condition

                      result                        result

GPUcode-01 Malloc Template
GPUcode-01 Malloc Template
double* init_GPU_data(struct flow * temper, int mesh)
          double *u_dev; // for GPU data upload
          size_t gpuMemsize=sizeof(double)*(mesh+2) ;
          double *u_host; // temperal value
          cudaMalloc( (void**)&u_dev, gpuMemsize);cudaErr("malloc u_dev");
          u_host = (double *) malloc( gpuMemsize);
          for(int i=0;i<mesh;i++){
                    u_host[i]= temper[i].old;
                    printf("before %d : data initial :u_host[%d]= %f temper[%d].old =%fn", i, i, u_host[i], i,
          cudaMemcpy(u_dev, u_host, gpuMemsize, cudaMemcpyHostToDevice);cudaErr("memcpy u_dev u_host");
          cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev u_host");
          for(int i=0;i<mesh;i++){
                    printf("after %d : data initial :u_host[%d]= %f temper[%d].old =%fn", i, i, u_host[i], i,


             return (double *)u_dev;

    GPUcode-01 Launch Template
GPUcode-01 Launch Template
void __global__ functionG(double *u_dev, int meshsize, double r, double bound )
          int idx = blockIdx.x*blockDim.x+threadIdx.x;
          int i = idx+1;
          if(idx <meshsize +1 ) {

                          u_dev[i]= i*0.01;
             if(idx == 6){

void compute_GPU(         double * u_dev,      double * u_host, double dt, double dx,          double r, int mesh,
                          int print_step,     int count,          double time,       double end_time, double bound )
             size_t gpuMemsize = sizeof(double)*(mesh+2);
             //for( int i=0; i < 6000 ; i++){ //time step
                        functionG<<<4,5>>>(u_dev, mesh,r, bound);cudaErr2("kernel launch",1,0);
                        cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev to u_host");
               for(int i=0; i<mesh+1; i++){
                        printf( " in kernel - GPU : temper[%d] ==> %f n", i, u_host[i]);

    GPUcode-02 Solving
GPUcode-02 Solving


                               Heat Relation                 Heat Relation
                      solve                         solve
                               Boundary Condition            Boundary Condition

                      result                        result

GPUcode-02 Malloc part
GPUcode-02 Malloc part
double* init_GPU_data(struct flow * temper, int mesh)
          double *u_dev; // for GPU data upload
          size_t gpuMemsize=sizeof(double)*(mesh+2) ;
          double *u_host; // temperal value
          cudaMalloc( (void**)&u_dev, gpuMemsize);cudaErr("malloc u_dev");
          u_host = (double *) malloc( gpuMemsize);
          for(int i=0;i<mesh;i++){
                    u_host[i]= temper[i].old;
                    printf("before %d : data initial :u_host[%d]= %f temper[%d].old =%fn", i, i, u_host[i], i,
          cudaMemcpy(u_dev, u_host, gpuMemsize, cudaMemcpyHostToDevice);cudaErr("memcpy u_dev u_host");
          cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev u_host");
          for(int i=0;i<mesh;i++){
                    printf("after %d : data initial :u_host[%d]= %f temper[%d].old =%fn", i, i, u_host[i], i,


             return (double *)u_dev;

    GPUcode-02 Launch part
GPUcode-02 Launch part
void __global__ functionG(double *u_dev, int meshsize, double r, double bound )
          int idx = blockIdx.x*blockDim.x+threadIdx.x;
          int i = idx+1;
          if(idx <meshsize +1 ) {
                    u_dev[i] = (double) (1-2*r)*u_dev[i] + r*(u_dev[i-1] + u_dev[i+1] );
          //        u_dev[i]= i*0.01;
          if(idx == 6){

void compute_GPU(         double * u_dev,    double * u_host, double dt, double dx,          double r, int mesh,
                          int print_step,   int count,          double time,       double end_time, double bound )
             size_t gpuMemsize = sizeof(double)*(mesh+2);
             //for( int i=0; i < 6000 ; i++){ //time step
                        functionG<<<4,5>>>(u_dev, mesh,r, bound);cudaErr2("kernel launch",1,0);
                        cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev to u_host");
               for(int i=0; i<mesh+1; i++){
                        printf( " in kernel - GPU : temper[%d] ==> %f n", i, u_host[i]);

    How to Optimize?
How to Optimize?

Monte Carlo Simulation
Monte Carlo Simulation
Computation Time

        Malliavin MC results
        type                   Black-Sholes     5000     10000     50000       100000     200000    300000
        Price                      12.8216    12.8706   12.8721   12.8413      12.8525    12.8559   12.8480
        Err                        0.000%      0.38%     0.39%     0.15%        0.24%      0.27%     0.21%
        Delta                       0.5858     0.5724    0.5887    0.5826       0.5826     0.5829    0.5824
        Err                              0     2.28%    -0.51%     0.53%        0.53%      0.49%     0.58%
        Gamma                       0.0130     0.0112    0.0134    0.0127       0.0127     0.0127    0.0127
        Err                         0.00%     13.39%    -3.34%     2.07%        2.26%      2.21%     2.47%
        Total Time                    0:00     00:03     00:05     00:25        00:50      01:44     02:33

                                                                                 time : 100 sec
                                                                             target1 : > 1 sec (100X)
                                                                            Target2 : >0.001 sec (2000X)

Computation Time
Monte Carlo Simulation for Finance
                      Excel Sheet

                                    Excel VBA

                                                C/C++ dll Link

                                                                 CUDA Acceleration

NVIDIA Confidential
Monte Carlo Code with VBA

function MC( S As Double, X As Double, T As Double, R As Double, _
                     Vol As Double, Q As Double, No As Double, rtype As String) As Double
  Simul_No = No
   dt = 1 'dt = 1 / 365
     For K = 0 To Simul_No - 1
        Juga = S
         For i = 0 To MaxStep - 1
             Juga = Juga * Exp( (R - Q - Vol ^ 2 / 2) * dt + Vol * Sqr(dt) * MakeNorsD() )
         price = Exp(-R * T) * Max(Juga - X, 0)
        sum_price = sum_price + price
  MC = sum_price / Simul_No
End function

  Monte Carlo Code with VBA
Malliavin Greeks

         Greek computation for Monte Carlo simulation
                                      E[ f ( S0  S )]  E[ f ( S0 )]
                         E[ f ( S )] 
                      s                            S

         Malliavin approach
                            E[ f ( S )]  E[ f ( S ) W ( S )]
                                                 Malliavin weights

         With Malliavin approach, we can save the computation time.

Malliavin Greeks

 To compare the accuracy, we compute the Price, Delta and Gamma of Vanilla
   Call option.

 1. Closed Form solution (VBA,C)
 2. Monte (VBA, C)
 3. Malliavin (VBA,C, CUDA v1, v2)

Problem
Malliavin Monte Carlo Code with VBA

function Malliavin( S As Double, X As Double, T As Double, R As Double, _
                     Vol As Double, Q As Double, No As Double, rtype As String) As Double
  Simul_No = No
   dt = 1 'dt = 1 / 365
     For K = 0 To Simul_No - 1
        Juga = S
         For i = 0 To MaxStep - 1
            Juga = Juga * Exp( (R - Q - Vol ^ 2 / 2) * dt + Vol * Sqr(dt) * MakeNorsD() )
         WT = (Log(Juga) - Log(S) - (R - Q - 1 / 2 * Vol ^ 2) * T) / Vol
         WT_delta = (WT / (S * Vol * T))
         WT_gamma = (1 / (Vol * T * S ^ 2)) * (WT ^ 2 / (Vol * T) - WT - 1 / Vol)

         price = Exp(-R * T) * Max(Juga - X, 0)
         delta = Exp(-R * T) * Max(Juga - X, 0) * WT_delta
         gamma = Exp(-R * T) * Max(Juga - X, 0) * WT_gamma

     sum_price = sum_price + price
     sum_delta = sum_delta + delta
     sum_gamma = sum_gamma + Gamma

   Malliavin = sum_delta / Simul_No
  NVIDIA Confidential
Step1 Malliavin Monte Carlo C language sketch
void Malliavin( double S , double X , double T , double R, double Vol, double Q, long No){
  long Simul_No = No;
  double dt = 1; // dt = 1 / 365
     for ( int K = 0; K< Simul_No - 1 ; K++){
        Juga = S ;
         for (int i = 0; i< MaxStep - 1 ki++){
                Juga = Juga * exp( (R - Q - Vol ^ 2 / 2) * dt + Vol * sqrt(dt) * norm() ); // rand with box muller

        WT = (Log(Juga) - Log(S) - (R - Q - 1 / 2 * Vol ^ 2) * T) / Vol;
        WT_delta = (WT / (S * Vol * T));
        WT_gamma = (1 / (Vol * T * S ^ 2)) * (WT ^ 2 / (Vol * T) - WT - 1 / Vol);

        price = Exp(-R * T) * max(Juga - X, 0);
        delta = Exp(-R * T) * max(Juga - X, 0) * WT_delta;
        gamma = Exp(-R * T) * max(Juga - X, 0) * WT_gamma;
       sum.price = sum.price + price; = + delta;
       sum.gamma = sum.gamma + Gamma ;
   r.price = / Simul_No = / Simul_No
   r.gamma = / Simul_No
   return 0;
   NVIDIA Confidential
Step2 Sketch (kernel part)
Void __global__ Malliavin_compute( double S , double X , double T , double R, double Vol, double Q, long No){
  long Simul_No = No;
  double dt = 1; // dt = 1 / 365
    for ( int K = 0; K< Simul_No - 1 ; K++){                                 Simm_No =
       Juga = S ;                                                            Total Sim /(N threads* M blocks)
        for (int i = 0; i< MaxStep - 1 ki++){
               Juga = Juga * exp( (R - Q - Vol *Vol / 2) * dt + Vol * sqrt(dt) * norm(k,i) );

       WT = (log(Juga) - log(S) - (R - Q - 1 / 2 * Vol *Vol) * T) / Vol;
       WT_delta = (WT / (S * Vol * T));
       WT_gamma = (1 / (Vol * T * S *S)) * (WT*WT / (Vol * T) - WT - 1 / Vol);

       price = Exp(-R * T) * max(Juga - X, 0);
       delta = Exp(-R * T) * max(Juga - X, 0) * WT_delta;
       gamma = Exp(-R * T) * max(Juga - X, 0) * WT_gamma;
      sum.price = sum.price + price; = + delta;
      sum.gamma = sum.gamma + Gamma ;
    }                                                     Real Price =
  r.price = / Simul_No = / Simul_No                          Sum (r.price ) / (N*M)
  r.gamma = / Simul_No
  return 0;
  Step2 Sketch (kernel part)
Step2 Parallel Memory Map

                                                                                data in global memory
     (float *) __device__ normal(int k, int, j, int size_j, float * normal) {

         int index = k*size_j + j;

         return &normal[index];
    Step2 Parallel Memory Map
Step2 Sketch (host part)
#include <stdio.h>

__global__ RNG_compute(parameter);
__global__ Malliavin_compute(parameter);


    malloc(); //cpu malloc
    cudaMalloc(); //GPU malloc
    cudaMemcpy(); // transfer

    RNG_compute 1<<<N,M>>> (parameter); // generate RNG (uniform)

    RNG_compute2 <<<N,M>>> (parameter); // generate RNG ( BM,Moro)

    Malliavin_compute <<<N,M>>> (parameter); // simulation

    cudaMemcpy(); //get results

    return 0;


    Step2 Sketch (host part)
Step2 Malliavin Monte Carlo CUDA language sketch (rng part1)

static void RNG_rand48_get_int(uint2 *state, int *res, int num_blocks, uint2 A, uint2 C)
  const int nThreads = blockDim.x*gridDim.x;

    int nOutIdx = threadIdx.x + blockIdx.x*blockDim.x;
    uint2 lstate = state[nOutIdx];
    int i;
    for (i = 0; i < num_blocks; ++i) {

        res[nOutIdx] = ( lstate.x >> 17 ) | ( lstate.y << 7);
        nOutIdx += nThreads;

        lstate = RNG_rand48_iterate_single(lstate, A, C);

    nOutIdx = threadIdx.x + blockIdx.x*blockDim.x;
    state[nOutIdx] = lstate;
NVIDIA Confidential
Step2 Malliavin Monte Carlo CUDA language sketch (rng part2)

Void __global__ RNG_compute( int * uniform , float * normal, int length){

int index = blockDim.x*blockIdx.x+threadIdx.x;

__shared__ int s[i];
__shared__ float s_r[i];

if( threadIdx.x ==0){
    for (int i = 0 ; i<blockDim.x ; i++){
       s[i]=uniform[blockDim.x*blockIdx.x + i]; // load uniform
   s_r[threadIdx.x] = (float) moro(s[threadIdx.x]); // moro inversion with parallel

if( threadIdx.x ==0){
   for (int i = 0 ; i<blockDim.x ; i++){
      s[blockDim.x*blockIdx.x+i]= s_r[i] ]; // save normal


    NVIDIA Confidential
Box-Muller vs Moro Inversion

    __device__ void BoxMuller(float& u1, float& u2){
        float r = sqrtf(-2.0f * logf(u1));
        float phi = 2 * PI * u2;
        u1 = r * __cosf(phi);
        u2 = r * __sinf(phi);
    __device__ Moro( float u ){
             // skip the const value
             x = u - 0.5;
        if (abs(x) < 0.42) {
             r = x*x;
             r = x * (((a4 * r + a3) * r + a2) * r + a1) / ((((b4 * r + b3) * r + b2) * r + b1) * r + 1);
        } else{
             if (x > 0) r = log(-log(1 - u));
             if (x <= 0) r = log(-log(u));
             r = c1 + r * (c2 + r * (c3 + r * (c4 + r * (c5 + r * (c6 + r * (c7 + r * (c8 + r * c9)))))));
             if (x <= 0) r = -r;
        return r;
Box-Muller vs Moro Inversion
Step3 New approach for Direct LCG and Parallel Memory Map

                             Malliavin_compute()        rand_new(seed)

                                                    data in registers

      double rand_new( int next){
        next = (a * next + b) % M;
      return (double) next * (1/M);

  NVIDIA Confidential
Platform for finance

             Excel Sheet

                           Excel VBA
                                       C/C++ dll Link

                                                          Socket communication

                                                            Grid Manager

                                        GPU cluster
                                              12 nodes system (1/2 for backup)
    12*8 GPU*448 core
    = 43,000 core

Platform for finance
                                        8 GPU per node
How to Optimize?

How to Optimize?
Time Series Analysis
Multivariate Time-series


Corr(1,3)                    S(2)



       NVIDIA Confidential
How to parallelize with CUDA






How to parallelize with CUDA
                             Unefficient approach for Shared Memory Usage
How to parallelize with CUDA





                                         efficient approach for Shared Memory Usage
How to parallelize with CUDA
                                         Need to reduction technique for mean etc.
How to parallelize with CUDA





                                            Good method for multiGPU & large Cluster
   How to parallelize with CUDA
In pair(I,J), Pearson Correlation Coefficient

                       ( x  x )( y  y )
                                   i                    i

                        (x  x )  ( y  y)

                      (x  i    x )( yi  y )   xi yi  x  yi  y  xi   xy
                                                              xi yi  nxy
                               (x  x )   x
                                                                             nx 2

                               ( y  y)   y
                                                                                 ny 2

                             x y    i           i    nxy
                      ( xi 2  nx 2 )( yi 2  ny 2 )

                                                                                         We can parallelize the Sumation !!
                                                                                         After sumation, find the mean.
NVIDIA Confidential
How to parallelize with CUDA : Flow Chart

                        Method 1                                                Method 2

Input A, B                                                     Input A, B

Find the mean of A, B                                          Start to sum (Ai), (Bi), (Ai,Bi), (Ai^2), (Bi^2)
  start to sum (Ai), (Bi)
                                                               Find the mean of A, B
Find Cov(A,B), Cov(A,A), Cov(B,B)
  start to sum (Ai,Bi), (Ai^2), (Bi^2) with mean               Find Cov(A,B), Cov(A,A), Cov(B,B)

     Benefit : Easy to implementation                          Benefit : oneshot sum (speed-up)
             with two function

                      R+CUDA project

NVIDIA Confidential
In pair(i,j), Pearson Correlation Coefficient

 FOR (i, j) – pair : serial
       FOR k (time-series) : parallel
         compute            x  y x y       i     i     i    i   x

                      reduction for results
                        compute mean(i), mean(j),
                        compute cov(i,j), cov(i,i) ,cov(j,j)
                        compute corr(i,j)

NVIDIA Confidential
In pair(i,j), Pearson Correlation Coefficient

 FOR (i, j) – pair : serial
       FOR k (time-series) : parallel
         FOR shared memory
                            x  y x y       i     i     i    i   x

                      reduction for results
                              compute mean(i), mean(j),
                        compute cov(i,j), cov(i,i) ,cov(j,j)
                        compute corr(i,j)

NVIDIA Confidential
How to Optimize?

How to Optimize?
Focus on Optimization
System Optimization Tips before
   start CUDA programming
Remote Use
                                        VNC protocol
                                VNC                    VNC
snapshot of
real ccreen
                            WDDM                                CUDA
                  Remote Host                          Client

                                        RDP protocol
                           RDP driver                  RDP
intercept the
CUDA driver
                  Remote Host                          Client

 Remote Use
TCC driver

                      CUDA Application

                                                   RDP protocol
                                     RDP driver                   RDP

                      WDM             WDDM                                 CUDA
                            Remote Host                           Client

               • Remote Desktop
               • Kernel Launch Over Head
               • GPU exclusive Mode
               • Windows Service for Session0/1
               • large single Malloc
TCC driver
GPU exclusive Mode for multiuser

                      GPU 3

                      GPU 2
                                       GPU Pool
                      GPU 1
                                                  Remote users
                      GPU 0
                                                  Remote users
                                                  Remote users

NVIDIA Confidential
GPUDirect for GPU cluster
   MPI Communication

                      System Pinned Memory     System Pinned Memory share
                      System Pageable Memory

GPUDirect for GPU cluster
FBO on SDI with Quadro

                                     Third Party                                            Third Party
                                     SDI Input                                              SDI Input

CPU                      Memory                                   CPU           Memory
                        Controller   NVIDIA Graphics Card                      Controller   NVIDIA Graphics Card
                                         Video              DVI                                 Video              DVI
                                         Memory                                                 Memory
                                                            DVI                                                    DVI
                                         GPU                                                    GPU
                        System                                                  System
                        Memory                                                  Memory
                                     NVIDIA SDI Output                                      NVIDIA SDI Output
                                                            DVI                                                    DVI

Write to Host memory and to write GPU memory                            Direct write to OpenGL Frame Buffer Object

  FBO on SDI with Quadro
Conceptual Tips for
CUDA optimization
SIMT architecture
Single Instruction Multiple Threads
Abstract overview of CPU Core

                                                                                  CPU    Memory Bank

                      inst. fetch

                                                            Register Bank

                      inst. cache

        FPU                         ALU                           LSU
                                                                            data cache

NVIDIA Confidential
Threads on Multicore CPU

 Core                               CPU   General Programming


 Core                               CPU
                                           Winthread, pthread

                      TH     TH

 Core                               CPU

                      TH     TH
Threads on Multicore CPU
Threads on Manicore GPU
                                                   H/W Multi Processor vs S/W Active Block


                   TH            TH           TH           TH   TH        TH           TH        TH   TH           TH

                                      Block                                    Block                                    Block


                            TH                      TH               TH                     TH             TH

      Threads on Manicore GPU
Overview of WARP schedule

                        <<<0,0>>>              Warp 0

                                 <<<0,31>>>              MP
                      schedule    <<<0,32>>>


                                                        8 SP core

                                                        per threads
Overview of WARP schedule
Overview of Instruction Fetch

                                                      blockIdx.x=0, threadIdx.x=3
  0           1             2   3     4       5   6    7

                                    Load A, B

FP operation                                                                        cores
                                    Store C

 0          1           2       3    4    5       6    7

      NVIDIA Confidential
Overview of Instruction Fetch

                                                   blockIdx.x=0, threadIdx.x=4
  0           1             2   3   4   5      6      7

                                            Load A, B

FP operation                                                                     cores
                                            Store C

 0          1           2       3   4   5     6       7

      NVIDIA Confidential
Thread schedule within MP : WARP

                                     1024 * 30 : 30K

                                       Look like

NVIDIA Confidential
Occupancy Calculator on CUDA SDK

NVIDIA Confidential
CUDA profiler on CUDA toolkit

NVIDIA Confidential
Parallel NSight 1.5 Professional

NVIDIA Confidential
Memory Hierarchy
Managing Memory

        Host                     Device
                                   DRAM                       Multiprocessor
                                           Local           Multiprocessor
                                          Global                       Registers
         DRAM         Chipset
                                          Memory              Shared Memory

                                                   L2 Cache    L1 Cache


Managing Memory

                                                          96 x 96
                                                        384 x 384
                                                        672 x 672
                                                        960 x 960

              NVIDIA Confidential
                                                      1248 x 1248
                                                      1536 x 1536
                                                      1824 x 1824
                                                      2112 x 2112
                                                      2400 x 2400
                                                      2688 x 2688
                                                      2976 x 2976
                                                      3264 x 3264
                                                      3552 x 3552
                                                      3840 x 3840
                                                      4128 x 4128
                                                      4416 x 4416
                                                      4704 x 4704
                                                      4992 x 4992
                                                      5280 x 5280
                                                      5568 x 5568
                                                      5856 x 5856
                                                      6144 x 6144
                                                      6432 x 6432
                                                      6720 x 6720
                                                      7008 x 7008
                                                                                                                 SGEMM: Multiples of 96

                                                      7296 x 7296
                                                      7584 x 7584
                                                      7872 x 7872
                                                      8160 x 8160
                                                      8448 x 8448



                                                          64 x 64

                                                        384 x 384
                                                        704 x 704
                                                      1024 x 1024
                                                      1344 x 1344
                                                      1664 x 1664
                                                      1984 x 1984
                                                      2304 x 2304
                                                      2624 x 2624
cuBLAS 3.2: NVIDIA Tesla C1060, Tesla C2050 (Fermi)
MKL Quad-Core Intel Xeon 5550, 2.67 GHz

                                                      2944 x 2944
                                                      3264 x 3264
                                                      3584 x 3584
                                                      3904 x 3904
                                                                                                                                              Matrix Size for Best CUBLAS3.2 Performance

                                                      4224 x 4224
                                                      4544 x 4544
                                                      4864 x 4864
                                                      5184 x 5184
                                                      5504 x 5504
                                                      5824 x 5824
                                                      6144 x 6144
                                                      6464 x 6464
                                                      6784 x 6784
                                                                                                                     DGEMM: Multiples of 64

                                                      7104 x 7104
                                                      7424 x 7424
                                                      7744 x 7744
                                                      8064 x 8064
                                                      8384 x 8384

                                                          64 x 64                                                               Gflops
                                                        128 x 128

                      cuBLAS level III
                                                        192 x 192
                                                        256 x 256
                                                        320 x 320
                                                        384 x 384
                                                        448 x 448
                                                        512 x 512
                                                        576 x 576
                                                        640 x 640
                                                        704 x 704
                                                        768 x 768
                                                        832 x 832
                                                        896 x 896
                                                                                                                                          cuBLAS level III

                                                        960 x 960
                                                      1024 x 1024
                                                      1088 x 1088
                                                      1152 x 1152
                                                      1216 x 1216
                                                      1280 x 1280
                                                      1344 x 1344
                                                      1408 x 1408
                                                                                                                 DGEMM: Multiples of 64

                                                      1472 x 1472
                                                      1536 x 1536
                                                      1600 x 1600
                                                      1664 x 1664
                                                      1728 x 1728
                                                      1792 x 1792
                                                      1856 x 1856
                                                      1920 x 1920
                                                      1984 x 1984
                                                      2048 x 2048
                                                      2112 x 2112
                                                      2176 x 2176
                                                      2240 x 2240
cuBLAS 3.2: NVIDIA Tesla C1060, Tesla C2050 (Fermi)
MKL Quad-Core Intel Xeon 5550, 2.67 GHz

                                                      2304 x 2304
                                                      2368 x 2368
                                                      2432 x 2432
                                                      2496 x 2496
                                                      2560 x 2560
                                                      2624 x 2624
                                                      2688 x 2688
                                                      2752 x 2752
                                                      2816 x 2816
                                                      2880 x 2880
                                                      2944 x 2944
                                                      3008 x 3008
                                                      3072 x 3072
                                                      3136 x 3136
                                                      3200 x 3200
                                                      3264 x 3264
                                                      3328 x 3328
                                                      3392 x 3392
Roofline Analysis (Arithmetic Intensity)

         Samuel Williams, Andrew Waterman, and Dav id Patterson,
         Roofline: An Insightful Visual Performance Model for Multicore Architectures
NVIDIA Confidential
Perf. on CUDA Application
Gflops                               Bandwidth of PCI-e slot


                              Bandwidth of Global Memory

Perf. on CUDA Application
Tips for Optimization

         Consider Algorithm for parallel (naï algorithms will be good)
         Consider Occupancy ( SIMT)
         Consider Memory Bottleneck

Tips for Optimization
More Information for CUDA Optimization
         CUDA Zone

         Developer Zone

         GTC 2010 contents

         쿠다 카페 (CUDA café in Korea)

NVIDIA Confidential

  Hyungon Ryu

[05][cuda 및 fermi 최적화 기술] hryu optimization

  • 1. CUDA and Fermi Optimization Techniques Hyungon Ryu, PSG SA
  • 2. Agenda CUDA 101 (General Overview) Toy of FDM example Monte Carlo Simulation & Time Series Analysis General CUDA Optimization Tips NVIDIA Confidential
  • 4. CUDA 101 CUDA Graphics Parallel (GPU) Programming NVIDIA Confidential
  • 5. CUDA Parallel programming Hetereogeneous Knowledge GPU Knowledge GPU Cg Parallel DSP OpenGL Parallel ASIC DirectX Parallel FPGA Cell BE Parallel knowledge Pthreads / winthreads MMX, SSE OpenMP PVM / MPI CUDA Parallel Computing with GPU NVIDIA Confidential
  • 6. Parallel Model in OpenMP CPU Program Fork Parallel Join NVIDIA Confidential 6
  • 7. CUDA parallel Model CPU Program Kernel Launch GPU threads NVIDIA Confidential 7
  • 8. Saxpy Example : CPU serial for (int i = 0; i < n; ++i) { y[i] = a*x[i] + y[i]; } i x y NVIDIA Confidential
  • 9. Example of Saxpy Parallel : OpenMP # pragma omp parallel shared (n,a,x,y) private (i) # pragma omp for for (int i = 0; i < n; ++i) { y[i] = a*x[i] + y[i]; } i x y NVIDIA Confidential
  • 10. Example of Saxpy Parallel : MPI for ( i = start ; i < end; i++) { y[i] = a * x[i] + y[i]; } MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Comm_size(MPI_COMM_WORLD,&size); void para_range(int lowest, int highest,int nprocs, int myrank, int *start, int *end) { int wk1, wk2; wk1 = (highest - lowest + 1) / nprocs; wk2 = (highest - lowest + 1) % nprocs; *start = myrank * wk1 + lowest + ( (rank<wk2) ? myrank : wk2); *end = *start + wk1 - 1; if(wk2 > rank) *end = *end + 1; i } x y NVIDIA Confidential
  • 11. Example of Saxpy Parallel : SSE void saxpy_vector(short *z, short *x, short *y, short a, unsigned n) { __m128i* x_ptr = (__m128i*) x; __m128i* y_ptr = (__m128i*) y; __m128i* z_ptr = (__m128i*) z; __m128i a_vec = _mm_splat_epi16(a); int i; for (i = 0; i < n/8; ++i) { __m128i x_vec = x_ptr[i]; __m128i y_vec = y_ptr[i]; __m128i z_vec = _mm_add_epi16( _mm_mullo_epi16 x_vec,a_vec),y_vec); z_ptr[i] = z_vec; } } i x y NVIDIA Confidential
  • 12. Saxpy Parallel : CUDA { x[i] = a * x[i] + t * y[i]; } Saxpy <<<N ,M >>> (n, 2.0, x, y); i x y NVIDIA Confidential
  • 13. CUDA C extension Launch the kernel Function <<< Grid, Block >>> ( parameter); Additional C standard API for mem control cudaXXX : cudaMalloc, cudaMemcpy, cuXXX : cuMalloc, cuMemcpy cutXXX : cutXXX For Function __global__, __device__, __host__, __device__ __host__ For memory __shared__, __device__, reg/loc pre-defined variables blockDim, blockIdx, threadIdx, cudaMemcpyHostToDevice Pre-defined function __syncthreads(), __mul24(); etc NVIDIA Confidential 13
  • 14. Process of CUDA developing Serial CUDA parallel Algorithm Algorithm serial Programming serial Programming Compile Compile CUDA convert Debugging Profile Release Parallelize Compile Debugging Optimize/profile Release NVIDIA Confidential 14
  • 15. CUDA is Parallel Computing !!! Serial MPI parallel CUDA parallel Algorithm Algorithm Algorithm serial Programming serial Programming serial Programming Compile Compile Compile parallel Programming CUDA convert Debugging Profile Profile Release Parallelize Parallelize Compile Compile Debugging [totalview] Debugging Optimize Optimize/profile Release Release MPIrun NVIDIA Confidential 15
  • 16. CPU Program Kernel Launch SM I-Cache Block MT Issue C-Cache TPC SP SP Geometry Controller SMC I-Cache I-Cache I-Cache MT Issue MT Issue MT Issue C-Cache SP SP C-Cache SP SP C-Cache SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP thread SFU SFU SFU SFU SFU SFU DP DP DP Shared Shared Shared Memory Memory Memory Texture Unit Tex L1 SP SP SFU SFU DP Shared Memory NVIDIA Confidential 16
  • 17. Image Processing Diagram without CUDA CPU Total capture time Camera For Loop start For Loop End NVIDIA Confidential
  • 18. Image Processing Diagram with CUDA CPU Total capture time Camera Memcpy For Loop start CUDA parallel For Loop End NVIDIA Confidential
  • 19. 1D Heat Equation CUDA Toy for undergraduate student
  • 20. 1D Heat Equation 1D bar NVIDIA Confidential Heat Source
  • 21. 1D Heat Equation u  u 2  1D Heat Equation t x 2 u (0, t )  u(1, t )  0 Boundary condition u ( x,0)  u0 Initial condition NVIDIA Confidential
  • 22. Discretization u  2u  t x 2 descretization u j ,i 1  u j ,i u j 1,i  2u j ,i  u j 1,i  Forward difference with second order t x 2 i(time), j(space) relation u j ,i 1  ru j 1,i  (1  2r )u j ,i  ru j 1,i Explicit method r  t / x 2 u[i+1][j] = r*u[i+1][j-1] + (1-2*r)*u[i][j] + r*u[i][j+1]; NVIDIA Confidential
  • 23. Discretization (stencil) u j ,i 1  ru j 1,i  (1  2r )u j ,i  ru j 1,i j+1,i r j,i i+1 (1-2*r) r j-1,i NVIDIA Confidential
  • 24. Conceptual Diagram Initial Condition Boundary Condition space 13 12 11 10 j+1 r 9 8 7 j j 6 (1-2*r) 5 stencil 4 r 3 j-1 u[i+1][j] = r*u[i+1][j-1] + (1-2*r)*u[i][j] + r*u[i][j+1]; 2 1 Time 0 Update 0 NVIDIA Confidential Time : 0 600
  • 25. Explicit Pseudo Code Parameter and data Initialization Stencil, boundary/initial condition, FOR LOOP (time, i) FOR LOOP (stencil, j) Update the stencil relation u[i+1][j] = r*u[i+1][j-1] + (1-2*r)*u[i][j] + r*u[i][j+1]; Results NVIDIA Confidential
  • 26. CPU code u[i][j] vs. u[j] u[i][j] easy to develop Possible to visualize the process u[j] efficient to use memory Get the last result NVIDIA Confidential
  • 27. Main algorithm for( i =0; i < N; i++) { Time Iteration for( j =0; j < M; j++){ Space Iteration u[i+1][j] = r * u[i+1][j-1] + (1-2*r) * u[i ][j] + r * u[i ][j+1]; } } Heat relation NVIDIA Confidential
  • 28. Boundary Condition Method1 copy copy copy N Fixed boundary N-1 Method2 copy Free boundary N+1 N N-1 compute NVIDIA Confidential
  • 29. How to Parallelize Parallelize the space stencil Each threads(core) dedicate to each space stencil element 13 12 11 j+1 r 10 9 8 j j 7 (1-2*r) 6 stencil 5 r 4 j-1 3 2 Time 0 Update 1 0 0 600 Time : NVIDIA Confidential Not possible to parallelize the time stencil.
  • 30. How to parallelize Parameter and data Initialization Stencil, boundary/initial condition, FOR LOOP (time, i) FOR LOOP (stencil, j) Update the stencil relation Results NVIDIA Confidential
  • 31. Space parallelization Time Sequence Time 0 Comm Time 1 Time 2 Time 3 NVIDIA Confidential
  • 32. CPUcode CPU source Heat Relation solve Boundary Condition result NVIDIA Confidential
  • 33. CPU code do{ time += dt; printf("aaaaaa %fn",time); for(i=1; i < mesh+1; i++){ temper[i].new_1 = (double) (1-2*r)*temper[i].old + r*(temper[i-1].old + temper[i+1].old); printf("n processing t %d %f %f n",i, temper[i].new_1, temper[i].old); } temper[mesh+1].new_1 = temper[mesh-1].new_1; printf("t print results %d %f %f n",mesh+1, temper[mesh+1].new_1,temper[mesh-1].new_1 ); for(i=1; 1 < mesh+2; i++) temper[i].old = temper[i].new_1; printf("aatt %d %f %f n",i, temper[i].new_1,temper[i].new_1 ); if((++count % print_step)==0){ printf("hh ttt %10.5lf", time); for(i=0; i<mesh; i+=2) printf("df 8.4lf", temper[i].new_1); if(!(i%2)) printf("fd %8.4lfn", temper[mesh].new_1); } printf("nn time print %fn", time); getchar(); }while(time < end_time); if((count % print_step)){ printf("bghj %10.5lf", time); for(i=0; i<mesh; i+=2) printf("ahg 8.4lf", temper[i].new_1); printf("nhgf %8.4lfn", temper[mesh].new_1); } NVIDIA Confidential
  • 34. GPUcode-01 Memory Map CPU GPU source source Heat Relation solve Bypass without solving Boundary Condition result result NVIDIA Confidential
  • 35. GPUcode-01 Malloc Template double* init_GPU_data(struct flow * temper, int mesh) { double *u_dev; // for GPU data upload size_t gpuMemsize=sizeof(double)*(mesh+2) ; double *u_host; // temperal value cudaMalloc( (void**)&u_dev, gpuMemsize);cudaErr("malloc u_dev"); u_host = (double *) malloc( gpuMemsize); for(int i=0;i<mesh;i++){ u_host[i]= temper[i].old; printf("before %d : data initial :u_host[%d]= %f temper[%d].old =%fn", i, i, u_host[i], i, temper[i].old); } cudaMemcpy(u_dev, u_host, gpuMemsize, cudaMemcpyHostToDevice);cudaErr("memcpy u_dev u_host"); cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev u_host"); for(int i=0;i<mesh;i++){ printf("after %d : data initial :u_host[%d]= %f temper[%d].old =%fn", i, i, u_host[i], i, temper[i].old); } free(u_host); return (double *)u_dev; } NVIDIA Confidential
  • 36. GPUcode-01 Launch Template void __global__ functionG(double *u_dev, int meshsize, double r, double bound ) { int idx = blockIdx.x*blockDim.x+threadIdx.x; int i = idx+1; if(idx <meshsize +1 ) { u_dev[i]= i*0.01; } if(idx == 6){ u_dev[idx+1]=u_dev[idx-1]=11; } } void compute_GPU( double * u_dev, double * u_host, double dt, double dx, double r, int mesh, int print_step, int count, double time, double end_time, double bound ) { size_t gpuMemsize = sizeof(double)*(mesh+2); //for( int i=0; i < 6000 ; i++){ //time step functionG<<<4,5>>>(u_dev, mesh,r, bound);cudaErr2("kernel launch",1,0); cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev to u_host"); for(int i=0; i<mesh+1; i++){ printf( " in kernel - GPU : temper[%d] ==> %f n", i, u_host[i]); } //} return; } NVIDIA Confidential
  • 37. GPUcode-02 Solving CPU GPU source source Heat Relation Heat Relation solve solve Boundary Condition Boundary Condition result result NVIDIA Confidential
  • 38. GPUcode-02 Malloc part double* init_GPU_data(struct flow * temper, int mesh) { double *u_dev; // for GPU data upload size_t gpuMemsize=sizeof(double)*(mesh+2) ; double *u_host; // temperal value cudaMalloc( (void**)&u_dev, gpuMemsize);cudaErr("malloc u_dev"); u_host = (double *) malloc( gpuMemsize); for(int i=0;i<mesh;i++){ u_host[i]= temper[i].old; printf("before %d : data initial :u_host[%d]= %f temper[%d].old =%fn", i, i, u_host[i], i, temper[i].old); } cudaMemcpy(u_dev, u_host, gpuMemsize, cudaMemcpyHostToDevice);cudaErr("memcpy u_dev u_host"); cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev u_host"); for(int i=0;i<mesh;i++){ printf("after %d : data initial :u_host[%d]= %f temper[%d].old =%fn", i, i, u_host[i], i, temper[i].old); } free(u_host); return (double *)u_dev; } NVIDIA Confidential
  • 39. GPUcode-02 Launch part void __global__ functionG(double *u_dev, int meshsize, double r, double bound ) { int idx = blockIdx.x*blockDim.x+threadIdx.x; int i = idx+1; if(idx <meshsize +1 ) { u_dev[i] = (double) (1-2*r)*u_dev[i] + r*(u_dev[i-1] + u_dev[i+1] ); // u_dev[i]= i*0.01; } if(idx == 6){ u_dev[idx+1]=u_dev[idx-1]=11; } } void compute_GPU( double * u_dev, double * u_host, double dt, double dx, double r, int mesh, int print_step, int count, double time, double end_time, double bound ) { size_t gpuMemsize = sizeof(double)*(mesh+2); //for( int i=0; i < 6000 ; i++){ //time step functionG<<<4,5>>>(u_dev, mesh,r, bound);cudaErr2("kernel launch",1,0); cudaMemcpy(u_host, u_dev, gpuMemsize, cudaMemcpyDeviceToHost);cudaErr("memcpy u_dev to u_host"); for(int i=0; i<mesh+1; i++){ printf( " in kernel - GPU : temper[%d] ==> %f n", i, u_host[i]); } //} return; } NVIDIA Confidential
  • 40. How to Optimize? NVIDIA Confidential
  • 42. Computation Time Malliavin MC results type Black-Sholes 5000 10000 50000 100000 200000 300000 Price 12.8216 12.8706 12.8721 12.8413 12.8525 12.8559 12.8480 Err 0.000% 0.38% 0.39% 0.15% 0.24% 0.27% 0.21% Delta 0.5858 0.5724 0.5887 0.5826 0.5826 0.5829 0.5824 Err 0 2.28% -0.51% 0.53% 0.53% 0.49% 0.58% Gamma 0.0130 0.0112 0.0134 0.0127 0.0127 0.0127 0.0127 Err 0.00% 13.39% -3.34% 2.07% 2.26% 2.21% 2.47% Total Time 0:00 00:03 00:05 00:25 00:50 01:44 02:33 time : 100 sec target1 : > 1 sec (100X) Target2 : >0.001 sec (2000X) NVIDIA Confidential
  • 43. Monte Carlo Simulation for Finance Excel Sheet Excel VBA C/C++ dll Link CUDA Acceleration NVIDIA Confidential
  • 44. Monte Carlo Code with VBA function MC( S As Double, X As Double, T As Double, R As Double, _ Vol As Double, Q As Double, No As Double, rtype As String) As Double Simul_No = No dt = 1 'dt = 1 / 365 For K = 0 To Simul_No - 1 Juga = S For i = 0 To MaxStep - 1 Juga = Juga * Exp( (R - Q - Vol ^ 2 / 2) * dt + Vol * Sqr(dt) * MakeNorsD() ) Next price = Exp(-R * T) * Max(Juga - X, 0) sum_price = sum_price + price Next MC = sum_price / Simul_No End function NVIDIA Confidential
  • 45. Malliavin Greeks Greek computation for Monte Carlo simulation  E[ f ( S0  S )]  E[ f ( S0 )] E[ f ( S )]  s S Malliavin approach  E[ f ( S )]  E[ f ( S ) W ( S )] s Malliavin weights With Malliavin approach, we can save the computation time. NVIDIA Confidential
  • 46. Problem To compare the accuracy, we compute the Price, Delta and Gamma of Vanilla Call option. Approach 1. Closed Form solution (VBA,C) 2. Monte (VBA, C) 3. Malliavin (VBA,C, CUDA v1, v2) NVIDIA Confidential
  • 47. Malliavin Monte Carlo Code with VBA function Malliavin( S As Double, X As Double, T As Double, R As Double, _ Vol As Double, Q As Double, No As Double, rtype As String) As Double Simul_No = No dt = 1 'dt = 1 / 365 For K = 0 To Simul_No - 1 Juga = S For i = 0 To MaxStep - 1 Juga = Juga * Exp( (R - Q - Vol ^ 2 / 2) * dt + Vol * Sqr(dt) * MakeNorsD() ) Next WT = (Log(Juga) - Log(S) - (R - Q - 1 / 2 * Vol ^ 2) * T) / Vol WT_delta = (WT / (S * Vol * T)) WT_gamma = (1 / (Vol * T * S ^ 2)) * (WT ^ 2 / (Vol * T) - WT - 1 / Vol) price = Exp(-R * T) * Max(Juga - X, 0) delta = Exp(-R * T) * Max(Juga - X, 0) * WT_delta gamma = Exp(-R * T) * Max(Juga - X, 0) * WT_gamma sum_price = sum_price + price sum_delta = sum_delta + delta sum_gamma = sum_gamma + Gamma Next Malliavin = sum_delta / Simul_No NVIDIA Confidential
  • 48. Step1 Malliavin Monte Carlo C language sketch void Malliavin( double S , double X , double T , double R, double Vol, double Q, long No){ long Simul_No = No; double dt = 1; // dt = 1 / 365 for ( int K = 0; K< Simul_No - 1 ; K++){ Juga = S ; for (int i = 0; i< MaxStep - 1 ki++){ Juga = Juga * exp( (R - Q - Vol ^ 2 / 2) * dt + Vol * sqrt(dt) * norm() ); // rand with box muller } WT = (Log(Juga) - Log(S) - (R - Q - 1 / 2 * Vol ^ 2) * T) / Vol; WT_delta = (WT / (S * Vol * T)); WT_gamma = (1 / (Vol * T * S ^ 2)) * (WT ^ 2 / (Vol * T) - WT - 1 / Vol); price = Exp(-R * T) * max(Juga - X, 0); delta = Exp(-R * T) * max(Juga - X, 0) * WT_delta; gamma = Exp(-R * T) * max(Juga - X, 0) * WT_gamma; sum.price = sum.price + price; = + delta; sum.gamma = sum.gamma + Gamma ; } r.price = / Simul_No = / Simul_No r.gamma = / Simul_No return 0; NVIDIA Confidential
  • 49. Step2 Sketch (kernel part) Void __global__ Malliavin_compute( double S , double X , double T , double R, double Vol, double Q, long No){ long Simul_No = No; double dt = 1; // dt = 1 / 365 for ( int K = 0; K< Simul_No - 1 ; K++){ Simm_No = Juga = S ; Total Sim /(N threads* M blocks) for (int i = 0; i< MaxStep - 1 ki++){ Juga = Juga * exp( (R - Q - Vol *Vol / 2) * dt + Vol * sqrt(dt) * norm(k,i) ); } WT = (log(Juga) - log(S) - (R - Q - 1 / 2 * Vol *Vol) * T) / Vol; WT_delta = (WT / (S * Vol * T)); WT_gamma = (1 / (Vol * T * S *S)) * (WT*WT / (Vol * T) - WT - 1 / Vol); price = Exp(-R * T) * max(Juga - X, 0); delta = Exp(-R * T) * max(Juga - X, 0) * WT_delta; gamma = Exp(-R * T) * max(Juga - X, 0) * WT_gamma; sum.price = sum.price + price; = + delta; sum.gamma = sum.gamma + Gamma ; } Real Price = r.price = / Simul_No = / Simul_No Sum (r.price ) / (N*M) r.gamma = / Simul_No return 0; NVIDIA Confidential
  • 50. Step2 Parallel Memory Map Parallel cores RNG_compute1() uniform RNG_compute2() normal normal() stock Malliavin_compute() option data in global memory (float *) __device__ normal(int k, int, j, int size_j, float * normal) { int index = k*size_j + j; return &normal[index]; } NVIDIA Confidential
  • 51. Step2 Sketch (host part) #include <stdio.h> __global__ RNG_compute(parameter); __global__ Malliavin_compute(parameter); main(){ malloc(); //cpu malloc cudaMalloc(); //GPU malloc cudaMemcpy(); // transfer RNG_compute 1<<<N,M>>> (parameter); // generate RNG (uniform) RNG_compute2 <<<N,M>>> (parameter); // generate RNG ( BM,Moro) Malliavin_compute <<<N,M>>> (parameter); // simulation cudaMemcpy(); //get results return 0; } NVIDIA Confidential
  • 52. Step2 Malliavin Monte Carlo CUDA language sketch (rng part1) __global__ static void RNG_rand48_get_int(uint2 *state, int *res, int num_blocks, uint2 A, uint2 C) { const int nThreads = blockDim.x*gridDim.x; int nOutIdx = threadIdx.x + blockIdx.x*blockDim.x; uint2 lstate = state[nOutIdx]; int i; for (i = 0; i < num_blocks; ++i) { res[nOutIdx] = ( lstate.x >> 17 ) | ( lstate.y << 7); nOutIdx += nThreads; lstate = RNG_rand48_iterate_single(lstate, A, C); } nOutIdx = threadIdx.x + blockIdx.x*blockDim.x; state[nOutIdx] = lstate; } NVIDIA Confidential
  • 53. Step2 Malliavin Monte Carlo CUDA language sketch (rng part2) Void __global__ RNG_compute( int * uniform , float * normal, int length){ int index = blockDim.x*blockIdx.x+threadIdx.x; __shared__ int s[i]; __shared__ float s_r[i]; if( threadIdx.x ==0){ for (int i = 0 ; i<blockDim.x ; i++){ s[i]=uniform[blockDim.x*blockIdx.x + i]; // load uniform } } s_r[threadIdx.x] = (float) moro(s[threadIdx.x]); // moro inversion with parallel if( threadIdx.x ==0){ for (int i = 0 ; i<blockDim.x ; i++){ s[blockDim.x*blockIdx.x+i]= s_r[i] ]; // save normal } } } NVIDIA Confidential
  • 54. Box-Muller vs Moro Inversion __device__ void BoxMuller(float& u1, float& u2){ float r = sqrtf(-2.0f * logf(u1)); float phi = 2 * PI * u2; u1 = r * __cosf(phi); u2 = r * __sinf(phi); } __device__ Moro( float u ){ // skip the const value x = u - 0.5; if (abs(x) < 0.42) { r = x*x; r = x * (((a4 * r + a3) * r + a2) * r + a1) / ((((b4 * r + b3) * r + b2) * r + b1) * r + 1); } else{ if (x > 0) r = log(-log(1 - u)); if (x <= 0) r = log(-log(u)); r = c1 + r * (c2 + r * (c3 + r * (c4 + r * (c5 + r * (c6 + r * (c7 + r * (c8 + r * c9))))))); if (x <= 0) r = -r; }f return r; } NVIDIA Confidential
  • 55. Step3 New approach for Direct LCG and Parallel Memory Map Parallel cores Malliavin_compute() rand_new(seed) uniform moro(u) normal stock option data in registers double rand_new( int next){ next = (a * next + b) % M; return (double) next * (1/M); } NVIDIA Confidential
  • 56. Platform for finance Excel Sheet Excel VBA C/C++ dll Link Socket communication Grid Manager GPU cluster 12 nodes system (1/2 for backup) 12*8 GPU*448 core = 43,000 core NVIDIA Confidential 8 GPU per node
  • 57. How to Optimize? NVIDIA Confidential
  • 59. Multivariate Time-series m S(1) Corr(1,2) Corr(1,3) S(2) Corr(2,3) S(3) Corr(1,n) S(n) NVIDIA Confidential
  • 60. How to parallelize with CUDA serialize S(1) S(2) S(3) S(n) NVIDIA Confidential Unefficient approach for Shared Memory Usage
  • 61. How to parallelize with CUDA S(1) S(2) serialize S(3) S(n) efficient approach for Shared Memory Usage NVIDIA Confidential Need to reduction technique for mean etc.
  • 62. How to parallelize with CUDA S(1) S(2) serialize Parallelize S(3) With multiNode multiGPU S(n) Good method for multiGPU & large Cluster NVIDIA Confidential
  • 63. In pair(I,J), Pearson Correlation Coefficient   ( x  x )( y  y ) i i  (x  x )  ( y  y) i 2 i 2 (x i  x )( yi  y )   xi yi  x  yi  y  xi   xy   xi yi  nxy (x  x )   x i 2 i 2  nx 2 ( y  y)   y i 2 i 2  ny 2  x y i i  nxy ( xi 2  nx 2 )( yi 2  ny 2 ) We can parallelize the Sumation !! After sumation, find the mean. NVIDIA Confidential
  • 64. How to parallelize with CUDA : Flow Chart Method 1 Method 2 Input A, B Input A, B Find the mean of A, B Start to sum (Ai), (Bi), (Ai,Bi), (Ai^2), (Bi^2) start to sum (Ai), (Bi) Find the mean of A, B Find Cov(A,B), Cov(A,A), Cov(B,B) start to sum (Ai,Bi), (Ai^2), (Bi^2) with mean Find Cov(A,B), Cov(A,A), Cov(B,B) Benefit : Easy to implementation Benefit : oneshot sum (speed-up) with two function R+CUDA project NVIDIA Confidential
  • 65. In pair(i,j), Pearson Correlation Coefficient FOR (i, j) – pair : serial FOR k (time-series) : parallel compute x  y x y i i i i x i 2 y i 2 reduction for results compute mean(i), mean(j), compute cov(i,j), cov(i,i) ,cov(j,j) compute corr(i,j) NVIDIA Confidential
  • 66. In pair(i,j), Pearson Correlation Coefficient FOR (i, j) – pair : serial FOR k (time-series) : parallel FOR shared memory compute x  y x y i i i i x i 2 y i 2 reduction for results compute mean(i), mean(j), compute cov(i,j), cov(i,i) ,cov(j,j) compute corr(i,j) NVIDIA Confidential
  • 67. How to Optimize? NVIDIA Confidential
  • 69. System Optimization Tips before start CUDA programming
  • 70. Remote Use VNC protocol VNC VNC snapshot of real ccreen WDDM CUDA enabled Remote Host Client RDP protocol RDP driver RDP intercept the CUDA driver WDDM CUDA disabled Remote Host Client NVIDIA Confidential
  • 71. TCC driver CUDA Application result GPU RDP protocol RDP driver RDP WDM WDDM CUDA enabled Remote Host Client • Remote Desktop • Kernel Launch Over Head • GPU exclusive Mode • Windows Service for Session0/1 • large single Malloc NVIDIA Confidential
  • 72. GPU exclusive Mode for multiuser GPU 3 GPU 2 GPU Pool GPU 1 Remote users GPU 0 Remote users Server Remote users NVIDIA Confidential
  • 73. GPUDirect for GPU cluster MPI Communication System Pinned Memory System Pinned Memory share System Pageable Memory NVIDIA Confidential
  • 74. FBO on SDI with Quadro Third Party Third Party SDI Input SDI Input CPU Memory CPU Memory Controller NVIDIA Graphics Card Controller NVIDIA Graphics Card Video DVI Video DVI Memory Memory DVI DVI GPU GPU System System Memory Memory NVIDIA SDI Output NVIDIA SDI Output DVI DVI Write to Host memory and to write GPU memory Direct write to OpenGL Frame Buffer Object NVIDIA Confidential
  • 75. Conceptual Tips for CUDA optimization
  • 77. Abstract overview of CPU Core CPU Memory Bank inst. fetch Register Bank RCX RDX RBX RAX L1 inst. cache L1 FPU ALU LSU data cache NVIDIA Confidential
  • 78. Threads on Multicore CPU CPU Core CPU General Programming reg TH Core CPU Winthread, pthread reg TH TH Core CPU Hyper-threading reg TH TH NVIDIA Confidential TH TH
  • 79. Threads on Manicore GPU H/W Multi Processor vs S/W Active Block MP SP reg TH TH TH TH TH TH TH TH TH TH Block Block Block MP SP reg TH TH TH TH TH NVIDIA Confidential Block Block 79
  • 80. Overview of WARP schedule <<<0,0>>> Warp 0 <<<0,1>>> <<<0,31>>> MP schedule <<<0,32>>> <<<0,33>>> <<<0,34>>> warp threads 8 SP core Register per threads NVIDIA Confidential
  • 81. Overview of Instruction Fetch blockIdx.x=0, threadIdx.x=3 0 1 2 3 4 5 6 7 Load A, B FP operation cores ADD Store C 0 1 2 3 4 5 6 7 NVIDIA Confidential
  • 82. Overview of Instruction Fetch blockIdx.x=0, threadIdx.x=4 0 1 2 3 4 5 6 7 Load A, B FP operation cores ADD Store C 0 1 2 3 4 5 6 7 NVIDIA Confidential
  • 83. Thread schedule within MP : WARP MP 1024 * 30 : 30K Look like Cocurrent Threads B1 B2 WARP NVIDIA Confidential
  • 84. Occupancy Calculator on CUDA SDK NVIDIA Confidential
  • 85. CUDA profiler on CUDA toolkit NVIDIA Confidential
  • 86. Parallel NSight 1.5 Professional NVIDIA Confidential
  • 88. Managing Memory Host Device GPU DRAM Multiprocessor CPU Local Multiprocessor Memory Multiprocessor Global Registers DRAM Chipset Memory Shared Memory L2 Cache L1 Cache Bandwidth NVIDIA Confidential Size
  • 89. 50 0 150 200 300 350 400 450 500 550 650 100 250 600 Gflops 96 x 96 384 x 384 672 x 672 960 x 960 NVIDIA Confidential 1248 x 1248 1536 x 1536 1824 x 1824 2112 x 2112 2400 x 2400 2688 x 2688 2976 x 2976 3264 x 3264 3552 x 3552 3840 x 3840 4128 x 4128 4416 x 4416 4704 x 4704 4992 x 4992 5280 x 5280 5568 x 5568 5856 x 5856 6144 x 6144 6432 x 6432 6720 x 6720 7008 x 7008 SGEMM: Multiples of 96 7296 x 7296 7584 x 7584 7872 x 7872 8160 x 8160 8448 x 8448 50 0 100 150 250 300 350 200 64 x 64 Gflops 384 x 384 704 x 704 1024 x 1024 1344 x 1344 1664 x 1664 1984 x 1984 2304 x 2304 2624 x 2624 cuBLAS 3.2: NVIDIA Tesla C1060, Tesla C2050 (Fermi) MKL Quad-Core Intel Xeon 5550, 2.67 GHz 2944 x 2944 3264 x 3264 3584 x 3584 3904 x 3904 Matrix Size for Best CUBLAS3.2 Performance 4224 x 4224 4544 x 4544 4864 x 4864 5184 x 5184 5504 x 5504 5824 x 5824 6144 x 6144 6464 x 6464 6784 x 6784 DGEMM: Multiples of 64 7104 x 7104 7424 x 7424 7744 x 7744 8064 x 8064 8384 x 8384
  • 90. 0 100 150 200 250 300 350 50 64 x 64 Gflops 128 x 128 NVIDIA Confidential 192 x 192 256 x 256 320 x 320 384 x 384 448 x 448 512 x 512 576 x 576 640 x 640 704 x 704 768 x 768 832 x 832 896 x 896 cuBLAS level III 960 x 960 1024 x 1024 1088 x 1088 1152 x 1152 1216 x 1216 1280 x 1280 1344 x 1344 1408 x 1408 DGEMM: Multiples of 64 1472 x 1472 1536 x 1536 1600 x 1600 1664 x 1664 1728 x 1728 1792 x 1792 1856 x 1856 1920 x 1920 1984 x 1984 2048 x 2048 2112 x 2112 2176 x 2176 2240 x 2240 cuBLAS 3.2: NVIDIA Tesla C1060, Tesla C2050 (Fermi) MKL Quad-Core Intel Xeon 5550, 2.67 GHz 2304 x 2304 2368 x 2368 2432 x 2432 2496 x 2496 2560 x 2560 2624 x 2624 2688 x 2688 2752 x 2752 2816 x 2816 2880 x 2880 2944 x 2944 3008 x 3008 3072 x 3072 3136 x 3136 3200 x 3200 3264 x 3264 3328 x 3328 3392 x 3392
  • 91. Roofline Analysis (Arithmetic Intensity) Samuel Williams, Andrew Waterman, and Dav id Patterson, Roofline: An Insightful Visual Performance Model for Multicore Architectures NVIDIA Confidential
  • 92. Perf. on CUDA Application Gflops Bandwidth of PCI-e slot Ns Bandwidth of Global Memory NVIDIA Confidential F/B
  • 93. Tips for Optimization Consider Algorithm for parallel (naï algorithms will be good) ve Consider Occupancy ( SIMT) Consider Memory Bottleneck NVIDIA Confidential
  • 94. More Information for CUDA Optimization CUDA Zone Developer Zone GTC 2010 contents 쿠다 카페 (CUDA café in Korea) NVIDIA Confidential
  • 95. Thanks Hyungon Ryu