NAS EP Algorithm

Random Number Generation
using OpenCL

신성원 나정호 배성호 김종수

Contents
Introduction

Theory

Result

Conclusion

Data
Traffic

MultiGrid Integer
Sort
Conjugate
Gradient

Embarrassingly
Parallel

Fast Fourier
Transform Data Cube
operator

Marsaglia Polar Method
Get
Random
Numbers (, )

= 2 + 2 < 1

− −
Get ,
Gaussian
Pairs

Pseudo code
double sparse = false;
bool sparseready = false;

double getGaussian(double center, double stdDev) {
if(sparseready)
sparseready = false;
return sparse * stdDev + center;
double u, v, s;
do{
u = random() * 2.0 – 1.0;
v = random() * 2.0 – 1.0;
s = u * u + v * v;
} while(s >= 1 || s == 0);
sparse = v * sqrt(-2.0 * log(s) / s);
sparseready = true;
return center + stdDev * u * sqrt(-2.0 * log(s) / s);
}

Profiling
Result

Random
numbers, 46%
Gaussian pairs,
54%

Serial portions,
0.01%

Increasing memory bandwidth
by using a coalesced memory access

3x4 matrix
0 1 2 3 (Conceptual)

4 5 6 7

8 9 A B
In memory
(Linear mapping)

0 1 2 3 4 5 6 7 8 9 A B

※ row-wise order

Increasing memory bandwidth
by using a coalesced memory access

Option 1 Option 2

0 1 2 3 0 1 2 3

4 5 6 7 4 5 6 7

8 9 A B 8 9 A B

Work Work Work Work
item #1 item #2 item #3 item #4

Lowering memory access latency
by using local memory

Unoptimized Optimized

__kernel EP(...) { __kernel local_EP(...) {
... ...
for (i = 0; i < NK; i++) lq[] = q[];
{ for (i = 0; i < NK; i++) {
... ...
q[l] = q[l] + 1.0; lq[l] = lq[l] + 1.0;
// array q[] fits into // array q[] fits into
local memory local memory
... Hot ...
} spot }
q[] = lq[];
} }

Exploiting GPU parallelism
with optimal NDRange size
Exactly

Fit! Local_work_size : 64

Local_work_size : 64
216
Iteration •
•
•

Local_work_size : 64
Independent

Machine Specification
Host Compute Device

Processor 2 x Intel Xeon E5520 8 x NVIDIA Tesla C1060

Clock Freq. 2.27 Ghz 1296 Mhz

Cores per CPU 4 (N/A)

Cores per GPU (N/A) 240

Memory Size 24GB 32GB (4GB * 8)

OS Redhat 4.4 (N/A)

Result

Execution Time (sec)
1000

100

10

1
CPU GPU #1 GPU #2 GPU #4 GPU #8
※ Log scale

Result

Speed up
350

300

250

200

150

100

50

0
GPU #1 GPU #2 GPU #4 GPU #8

NAS EP Algorithm

More Related Content

What's hot

Viewers also liked

Similar to NAS EP Algorithm

Recently uploaded

NAS EP Algorithm