Random Number Generation
      using OpenCL




    신성원 나정호 배성호 김종수
Contents
           Introduction

                Theory

                 Result

            Conclusion
Data
                                Traffic



                   MultiGrid              Integer
                                            Sort
 Conjugate
  Gradient


               Embarrassingly
                  Parallel

Fast Fourier
 Transform          Data Cube
                     operator
Contents
           Introduction

                Theory

                 Result

            Conclusion
Marsaglia Polar Method
   Get
   Random
   Numbers       (, )


                        =  2 +  2 < 1


                  −        −  
        Get                 , 
    Gaussian
       Pairs
Pseudo code
double sparse = false;
bool sparseready = false;

double getGaussian(double center, double stdDev) {
       if(sparseready)
              sparseready = false;
              return sparse * stdDev + center;
       double u, v, s;
       do{
              u = random() * 2.0 – 1.0;
              v = random() * 2.0 – 1.0;
              s = u * u + v * v;
       } while(s >= 1 || s == 0);
       sparse = v * sqrt(-2.0 * log(s) / s);
       sparseready = true;
       return center + stdDev * u * sqrt(-2.0 * log(s) / s);
}
Profiling
 Result


    Random
  numbers, 46%
                                    Gaussian pairs,
                                         54%




                 Serial portions,
                      0.01%
Mapping
 Instance
to Kernel
Optimization
Increasing memory bandwidth
                     by using a coalesced memory access

                                  3x4 matrix
 0       1       2       3       (Conceptual)


 4       5       6       7

 8       9       A       B
                                                        In memory
                                                    (Linear mapping)

 0   1       2       3       4     5    6       7    8    9     A      B

                                                          ※ row-wise order
Increasing memory bandwidth
                    by using a coalesced memory access

      Option 1                          Option 2



  0    1        2     3             0     1   2     3

  4    5        6     7             4     5   6     7

  8    9    A         B             8     9   A     B


      Work                Work      Work           Work
      item #1             item #2   item #3        item #4
Lowering memory access latency
                           by using local memory

    Unoptimized                      Optimized

__kernel EP(...) {           __kernel local_EP(...) {
...                          ...
for (i = 0; i < NK; i++)     lq[] = q[];
{                            for (i = 0; i < NK; i++) {
  ...                          ...
  q[l] = q[l] + 1.0;           lq[l] = lq[l] + 1.0;
  // array q[] fits into       // array q[] fits into
  local memory                 local memory
  ...                Hot       ...
}                   spot     }
                             q[] = lq[];
}                            }
Exploiting GPU parallelism
                    with optimal NDRange size
                  Exactly

                   Fit!       Local_work_size : 64



                              Local_work_size : 64
       216
    Iteration                         •
                                      •
                                      •

                              Local_work_size : 64
                Independent
Contents
           Introduction

                Theory

                 Result

            Conclusion
Machine Specification
                Host                   Compute Device

Processor       2 x Intel Xeon E5520   8 x NVIDIA Tesla C1060

Clock Freq.     2.27 Ghz               1296 Mhz

Cores per CPU   4                      (N/A)

Cores per GPU   (N/A)                  240

Memory Size     24GB                   32GB (4GB * 8)

OS              Redhat 4.4             (N/A)
Result

                     Execution Time (sec)
    1000




     100




      10




       1
              CPU   GPU #1    GPU #2        GPU #4   GPU #8
※ Log scale
Result

                 Speed up
350

300

250

200

150

100

 50

 0
      GPU #1   GPU #2   GPU #4   GPU #8
NAS EP Algorithm

NAS EP Algorithm

  • 1.
    Random Number Generation using OpenCL 신성원 나정호 배성호 김종수
  • 2.
    Contents Introduction Theory Result Conclusion
  • 3.
    Data Traffic MultiGrid Integer Sort Conjugate Gradient Embarrassingly Parallel Fast Fourier Transform Data Cube operator
  • 6.
    Contents Introduction Theory Result Conclusion
  • 7.
    Marsaglia Polar Method Get Random Numbers (, ) = 2 + 2 < 1 − − Get , Gaussian Pairs
  • 8.
    Pseudo code double sparse= false; bool sparseready = false; double getGaussian(double center, double stdDev) { if(sparseready) sparseready = false; return sparse * stdDev + center; double u, v, s; do{ u = random() * 2.0 – 1.0; v = random() * 2.0 – 1.0; s = u * u + v * v; } while(s >= 1 || s == 0); sparse = v * sqrt(-2.0 * log(s) / s); sparseready = true; return center + stdDev * u * sqrt(-2.0 * log(s) / s); }
  • 10.
    Profiling Result Random numbers, 46% Gaussian pairs, 54% Serial portions, 0.01%
  • 11.
  • 12.
  • 13.
    Increasing memory bandwidth by using a coalesced memory access 3x4 matrix 0 1 2 3 (Conceptual) 4 5 6 7 8 9 A B In memory (Linear mapping) 0 1 2 3 4 5 6 7 8 9 A B ※ row-wise order
  • 14.
    Increasing memory bandwidth by using a coalesced memory access Option 1 Option 2 0 1 2 3 0 1 2 3 4 5 6 7 4 5 6 7 8 9 A B 8 9 A B Work Work Work Work item #1 item #2 item #3 item #4
  • 15.
    Lowering memory accesslatency by using local memory Unoptimized Optimized __kernel EP(...) { __kernel local_EP(...) { ... ... for (i = 0; i < NK; i++) lq[] = q[]; { for (i = 0; i < NK; i++) { ... ... q[l] = q[l] + 1.0; lq[l] = lq[l] + 1.0; // array q[] fits into // array q[] fits into local memory local memory ... Hot ... } spot } q[] = lq[]; } }
  • 16.
    Exploiting GPU parallelism with optimal NDRange size Exactly Fit! Local_work_size : 64 Local_work_size : 64 216 Iteration • • • Local_work_size : 64 Independent
  • 17.
    Contents Introduction Theory Result Conclusion
  • 18.
    Machine Specification Host Compute Device Processor 2 x Intel Xeon E5520 8 x NVIDIA Tesla C1060 Clock Freq. 2.27 Ghz 1296 Mhz Cores per CPU 4 (N/A) Cores per GPU (N/A) 240 Memory Size 24GB 32GB (4GB * 8) OS Redhat 4.4 (N/A)
  • 19.
    Result Execution Time (sec) 1000 100 10 1 CPU GPU #1 GPU #2 GPU #4 GPU #8 ※ Log scale
  • 20.
    Result Speed up 350 300 250 200 150 100 50 0 GPU #1 GPU #2 GPU #4 GPU #8