2. Agenda
CUDA 101 (General Overview)
Toy of FDM example
Monte Carlo Simulation & Time Series Analysis
General CUDA Optimization Tips
NVIDIA Confidential
8. Saxpy Example : CPU serial
for (int i = 0; i < n; ++i) {
y[i] = a*x[i] + y[i];
}
i
x
y
NVIDIA Confidential
9. Example of Saxpy Parallel : OpenMP
# pragma omp parallel shared (n,a,x,y) private (i)
# pragma omp for
for (int i = 0; i < n; ++i) {
y[i] = a*x[i] + y[i];
}
i
x
y
NVIDIA Confidential
10. Example of Saxpy Parallel : MPI
for ( i = start ; i < end; i++)
{
y[i] = a * x[i] + y[i];
}
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
void para_range(int lowest, int highest,int nprocs, int myrank,
int *start, int *end) {
int wk1, wk2;
wk1 = (highest - lowest + 1) / nprocs;
wk2 = (highest - lowest + 1) % nprocs;
*start = myrank * wk1 + lowest + ( (rank<wk2) ? myrank : wk2);
*end = *start + wk1 - 1;
if(wk2 > rank) *end = *end + 1;
i }
x
y
NVIDIA Confidential
11. Example of Saxpy Parallel : SSE
void saxpy_vector(short *z, short *x, short *y, short a, unsigned n) {
__m128i* x_ptr = (__m128i*) x;
__m128i* y_ptr = (__m128i*) y;
__m128i* z_ptr = (__m128i*) z;
__m128i a_vec = _mm_splat_epi16(a);
int i;
for (i = 0; i < n/8; ++i) {
__m128i x_vec = x_ptr[i];
__m128i y_vec = y_ptr[i];
__m128i z_vec = _mm_add_epi16( _mm_mullo_epi16 x_vec,a_vec),y_vec);
z_ptr[i] = z_vec;
}
}
i
x
y
NVIDIA Confidential
12. Saxpy Parallel : CUDA
{
x[i] = a * x[i] + t * y[i];
}
Saxpy <<<N ,M >>> (n, 2.0, x, y);
i
x
y
NVIDIA Confidential
13. CUDA C extension
Launch the kernel
Function <<< Grid, Block >>> ( parameter);
Additional C standard API for mem control
cudaXXX : cudaMalloc, cudaMemcpy,
cuXXX : cuMalloc, cuMemcpy
cutXXX : cutXXX
For Function
__global__, __device__, __host__, __device__ __host__
For memory
__shared__, __device__, reg/loc
pre-defined variables
blockDim, blockIdx, threadIdx, cudaMemcpyHostToDevice
Pre-defined function
__syncthreads(), __mul24(); etc
NVIDIA Confidential
13
14. Process of CUDA developing
Serial CUDA parallel
Algorithm
Algorithm
serial Programming
serial Programming
Compile
Compile
CUDA convert
Debugging
Profile
Release Parallelize
Compile
Debugging
Optimize/profile
Release
NVIDIA Confidential
14
15. CUDA is Parallel Computing !!!
Serial MPI parallel CUDA parallel
Algorithm Algorithm
Algorithm
serial Programming serial Programming
serial Programming
Compile Compile
Compile
parallel Programming CUDA convert
Debugging
Profile Profile
Release Parallelize Parallelize
Compile Compile
Debugging [totalview] Debugging
Optimize Optimize/profile
Release Release
MPIrun
NVIDIA Confidential
15
24. Conceptual Diagram
Initial Condition
Boundary Condition
space
13
12
11
10 j+1
r
9
8
7 j j
6 (1-2*r)
5
stencil
4
r
3 j-1
u[i+1][j] = r*u[i+1][j-1] + (1-2*r)*u[i][j] + r*u[i][j+1];
2
1 Time 0 Update
0
NVIDIA Confidential Time : 0 600
25. Explicit Pseudo Code
Parameter and data Initialization
Stencil, boundary/initial condition,
FOR LOOP (time, i)
FOR LOOP (stencil, j)
Update the stencil relation
u[i+1][j] = r*u[i+1][j-1] + (1-2*r)*u[i][j] + r*u[i][j+1];
Results
NVIDIA Confidential
26. CPU code
u[i][j] vs. u[j]
u[i][j] easy to develop
Possible to visualize the process
u[j] efficient to use memory
Get the last result
NVIDIA Confidential
27. Main algorithm
for( i =0; i < N; i++) { Time Iteration
for( j =0; j < M; j++){ Space Iteration
u[i+1][j] = r * u[i+1][j-1]
+ (1-2*r) * u[i ][j]
+ r * u[i ][j+1];
}
}
Heat relation
NVIDIA Confidential
29. How to Parallelize
Parallelize the space stencil
Each threads(core) dedicate to each space stencil element
13
12
11 j+1
r
10
9
8 j j
7 (1-2*r)
6
stencil
5
r
4 j-1
3
2 Time 0 Update
1
0
0 600
Time :
NVIDIA Confidential
Not possible to parallelize the time stencil.
30. How to parallelize
Parameter and data Initialization
Stencil, boundary/initial condition,
FOR LOOP (time, i)
FOR LOOP (stencil, j)
Update the stencil relation
Results
NVIDIA Confidential
31. Space parallelization
Time
Sequence
Time 0
Comm
Time 1
Time 2
Time 3
NVIDIA Confidential
32. CPUcode
CPU
source
Heat Relation
solve
Boundary Condition
result
NVIDIA Confidential
43. Monte Carlo Simulation for Finance
Excel Sheet
Excel VBA
C/C++ dll Link
CUDA Acceleration
NVIDIA Confidential
44. Monte Carlo Code with VBA
function MC( S As Double, X As Double, T As Double, R As Double, _
Vol As Double, Q As Double, No As Double, rtype As String) As Double
Simul_No = No
dt = 1 'dt = 1 / 365
For K = 0 To Simul_No - 1
Juga = S
For i = 0 To MaxStep - 1
Juga = Juga * Exp( (R - Q - Vol ^ 2 / 2) * dt + Vol * Sqr(dt) * MakeNorsD() )
Next
price = Exp(-R * T) * Max(Juga - X, 0)
sum_price = sum_price + price
Next
MC = sum_price / Simul_No
End function
NVIDIA Confidential
45. Malliavin Greeks
Greek computation for Monte Carlo simulation
E[ f ( S0 S )] E[ f ( S0 )]
E[ f ( S )]
s S
Malliavin approach
E[ f ( S )] E[ f ( S ) W ( S )]
s
Malliavin weights
With Malliavin approach, we can save the computation time.
NVIDIA Confidential
46. Problem
To compare the accuracy, we compute the Price, Delta and Gamma of Vanilla
Call option.
Approach
1. Closed Form solution (VBA,C)
2. Monte (VBA, C)
3. Malliavin (VBA,C, CUDA v1, v2)
NVIDIA Confidential
47. Malliavin Monte Carlo Code with VBA
function Malliavin( S As Double, X As Double, T As Double, R As Double, _
Vol As Double, Q As Double, No As Double, rtype As String) As Double
Simul_No = No
dt = 1 'dt = 1 / 365
For K = 0 To Simul_No - 1
Juga = S
For i = 0 To MaxStep - 1
Juga = Juga * Exp( (R - Q - Vol ^ 2 / 2) * dt + Vol * Sqr(dt) * MakeNorsD() )
Next
WT = (Log(Juga) - Log(S) - (R - Q - 1 / 2 * Vol ^ 2) * T) / Vol
WT_delta = (WT / (S * Vol * T))
WT_gamma = (1 / (Vol * T * S ^ 2)) * (WT ^ 2 / (Vol * T) - WT - 1 / Vol)
price = Exp(-R * T) * Max(Juga - X, 0)
delta = Exp(-R * T) * Max(Juga - X, 0) * WT_delta
gamma = Exp(-R * T) * Max(Juga - X, 0) * WT_gamma
sum_price = sum_price + price
sum_delta = sum_delta + delta
sum_gamma = sum_gamma + Gamma
Next
Malliavin = sum_delta / Simul_No
NVIDIA Confidential
48. Step1 Malliavin Monte Carlo C language sketch
void Malliavin( double S , double X , double T , double R, double Vol, double Q, long No){
long Simul_No = No;
double dt = 1; // dt = 1 / 365
for ( int K = 0; K< Simul_No - 1 ; K++){
Juga = S ;
for (int i = 0; i< MaxStep - 1 ki++){
Juga = Juga * exp( (R - Q - Vol ^ 2 / 2) * dt + Vol * sqrt(dt) * norm() ); // rand with box muller
}
WT = (Log(Juga) - Log(S) - (R - Q - 1 / 2 * Vol ^ 2) * T) / Vol;
WT_delta = (WT / (S * Vol * T));
WT_gamma = (1 / (Vol * T * S ^ 2)) * (WT ^ 2 / (Vol * T) - WT - 1 / Vol);
price = Exp(-R * T) * max(Juga - X, 0);
delta = Exp(-R * T) * max(Juga - X, 0) * WT_delta;
gamma = Exp(-R * T) * max(Juga - X, 0) * WT_gamma;
sum.price = sum.price + price;
sum.delta = sum.delta + delta;
sum.gamma = sum.gamma + Gamma ;
}
r.price = sum.delta / Simul_No
r.delta = sum.delta / Simul_No
r.gamma = sum.delta / Simul_No
return 0;
NVIDIA Confidential
52. Step2 Malliavin Monte Carlo CUDA language sketch (rng part1)
__global__
static void RNG_rand48_get_int(uint2 *state, int *res, int num_blocks, uint2 A, uint2 C)
{
const int nThreads = blockDim.x*gridDim.x;
int nOutIdx = threadIdx.x + blockIdx.x*blockDim.x;
uint2 lstate = state[nOutIdx];
int i;
for (i = 0; i < num_blocks; ++i) {
res[nOutIdx] = ( lstate.x >> 17 ) | ( lstate.y << 7);
nOutIdx += nThreads;
lstate = RNG_rand48_iterate_single(lstate, A, C);
}
nOutIdx = threadIdx.x + blockIdx.x*blockDim.x;
state[nOutIdx] = lstate;
}
NVIDIA Confidential
53. Step2 Malliavin Monte Carlo CUDA language sketch (rng part2)
Void __global__ RNG_compute( int * uniform , float * normal, int length){
int index = blockDim.x*blockIdx.x+threadIdx.x;
__shared__ int s[i];
__shared__ float s_r[i];
if( threadIdx.x ==0){
for (int i = 0 ; i<blockDim.x ; i++){
s[i]=uniform[blockDim.x*blockIdx.x + i]; // load uniform
}
}
s_r[threadIdx.x] = (float) moro(s[threadIdx.x]); // moro inversion with parallel
if( threadIdx.x ==0){
for (int i = 0 ; i<blockDim.x ; i++){
s[blockDim.x*blockIdx.x+i]= s_r[i] ]; // save normal
}
}
}
NVIDIA Confidential
54. Box-Muller vs Moro Inversion
__device__ void BoxMuller(float& u1, float& u2){
float r = sqrtf(-2.0f * logf(u1));
float phi = 2 * PI * u2;
u1 = r * __cosf(phi);
u2 = r * __sinf(phi);
}
__device__ Moro( float u ){
// skip the const value
x = u - 0.5;
if (abs(x) < 0.42) {
r = x*x;
r = x * (((a4 * r + a3) * r + a2) * r + a1) / ((((b4 * r + b3) * r + b2) * r + b1) * r + 1);
} else{
if (x > 0) r = log(-log(1 - u));
if (x <= 0) r = log(-log(u));
r = c1 + r * (c2 + r * (c3 + r * (c4 + r * (c5 + r * (c6 + r * (c7 + r * (c8 + r * c9)))))));
if (x <= 0) r = -r;
}f
return r;
}
NVIDIA Confidential
55. Step3 New approach for Direct LCG and Parallel Memory Map
Parallel
cores
Malliavin_compute() rand_new(seed)
uniform
moro(u)
normal
stock
option
data in registers
double rand_new( int next){
next = (a * next + b) % M;
return (double) next * (1/M);
}
NVIDIA Confidential
56. Platform for finance
Excel Sheet
Excel VBA
C/C++ dll Link
Socket communication
Grid Manager
GPU cluster
12 nodes system (1/2 for backup)
12*8 GPU*448 core
= 43,000 core
NVIDIA Confidential
8 GPU per node
60. How to parallelize with CUDA
serialize
S(1)
S(2)
S(3)
S(n)
NVIDIA Confidential
Unefficient approach for Shared Memory Usage
61. How to parallelize with CUDA
S(1)
S(2)
serialize
S(3)
S(n)
efficient approach for Shared Memory Usage
NVIDIA Confidential
Need to reduction technique for mean etc.
62. How to parallelize with CUDA
S(1)
S(2)
serialize
Parallelize
S(3)
With
multiNode
multiGPU
S(n)
Good method for multiGPU & large Cluster
NVIDIA Confidential
63. In pair(I,J), Pearson Correlation Coefficient
( x x )( y y )
i i
(x x ) ( y y)
i
2
i
2
(x i x )( yi y ) xi yi x yi y xi xy
xi yi nxy
(x x ) x
i
2
i
2
nx 2
( y y) y
i
2
i
2
ny 2
x y i i nxy
( xi 2 nx 2 )( yi 2 ny 2 )
We can parallelize the Sumation !!
After sumation, find the mean.
NVIDIA Confidential
64. How to parallelize with CUDA : Flow Chart
Method 1 Method 2
Input A, B Input A, B
Find the mean of A, B Start to sum (Ai), (Bi), (Ai,Bi), (Ai^2), (Bi^2)
start to sum (Ai), (Bi)
Find the mean of A, B
Find Cov(A,B), Cov(A,A), Cov(B,B)
start to sum (Ai,Bi), (Ai^2), (Bi^2) with mean Find Cov(A,B), Cov(A,A), Cov(B,B)
Benefit : Easy to implementation Benefit : oneshot sum (speed-up)
with two function
R+CUDA project
http://brainarray.mbni.med.umich.edu/Brainarray/rgpgpu/
NVIDIA Confidential
65. In pair(i,j), Pearson Correlation Coefficient
FOR (i, j) – pair : serial
FOR k (time-series) : parallel
compute x y x y i i i i x
i
2
y
i
2
reduction for results
compute mean(i), mean(j),
compute cov(i,j), cov(i,i) ,cov(j,j)
compute corr(i,j)
NVIDIA Confidential
66. In pair(i,j), Pearson Correlation Coefficient
FOR (i, j) – pair : serial
FOR k (time-series) : parallel
FOR shared memory
compute
x y x y i i i i x
i
2
y
i
2
reduction for results
compute mean(i), mean(j),
compute cov(i,j), cov(i,i) ,cov(j,j)
compute corr(i,j)
NVIDIA Confidential
70. Remote Use
VNC protocol
VNC VNC
snapshot of
real ccreen
WDDM CUDA
enabled
Remote Host Client
RDP protocol
RDP driver RDP
intercept the
CUDA driver
WDDM
CUDA
disabled
Remote Host Client
NVIDIA Confidential
71. TCC driver
CUDA Application
result
GPU
RDP protocol
RDP driver RDP
WDM WDDM CUDA
enabled
Remote Host Client
• Remote Desktop
• Kernel Launch Over Head
• GPU exclusive Mode
• Windows Service for Session0/1
• large single Malloc
NVIDIA Confidential
72. GPU exclusive Mode for multiuser
GPU 3
GPU 2
GPU Pool
GPU 1
Remote users
GPU 0
Remote users
Server
Remote users
NVIDIA Confidential
73. GPUDirect for GPU cluster
MPI Communication
System Pinned Memory System Pinned Memory share
System Pageable Memory
NVIDIA Confidential
74. FBO on SDI with Quadro
Third Party Third Party
SDI Input SDI Input
CPU Memory CPU Memory
Controller NVIDIA Graphics Card Controller NVIDIA Graphics Card
Video DVI Video DVI
Memory Memory
DVI DVI
GPU GPU
System System
Memory Memory
NVIDIA SDI Output NVIDIA SDI Output
DVI DVI
Write to Host memory and to write GPU memory Direct write to OpenGL Frame Buffer Object
NVIDIA Confidential
77. Abstract overview of CPU Core
CPU Memory Bank
inst. fetch
Register Bank
RCX
RDX
RBX
RAX
L1
inst. cache
L1
FPU ALU LSU
data cache
NVIDIA Confidential
78. Threads on Multicore CPU
CPU
Core CPU General Programming
reg
TH
Core CPU
Winthread, pthread
reg
TH TH
Core CPU
Hyper-threading
reg
TH TH
NVIDIA Confidential TH TH
88. Managing Memory
Host Device
GPU
DRAM Multiprocessor
CPU
Local Multiprocessor
Memory
Multiprocessor
Global Registers
DRAM Chipset
Memory Shared Memory
L2 Cache L1 Cache
Bandwidth
NVIDIA Confidential
Size
89. 50
0
150
200
300
350
400
450
500
550
650
100
250
600
Gflops
96 x 96
384 x 384
672 x 672
960 x 960
NVIDIA Confidential
1248 x 1248
1536 x 1536
1824 x 1824
2112 x 2112
2400 x 2400
2688 x 2688
2976 x 2976
3264 x 3264
3552 x 3552
3840 x 3840
4128 x 4128
4416 x 4416
4704 x 4704
4992 x 4992
5280 x 5280
5568 x 5568
5856 x 5856
6144 x 6144
6432 x 6432
6720 x 6720
7008 x 7008
SGEMM: Multiples of 96
7296 x 7296
7584 x 7584
7872 x 7872
8160 x 8160
8448 x 8448
50
0
100
150
250
300
350
200
64 x 64
Gflops
384 x 384
704 x 704
1024 x 1024
1344 x 1344
1664 x 1664
1984 x 1984
2304 x 2304
2624 x 2624
cuBLAS 3.2: NVIDIA Tesla C1060, Tesla C2050 (Fermi)
MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHz
2944 x 2944
3264 x 3264
3584 x 3584
3904 x 3904
Matrix Size for Best CUBLAS3.2 Performance
4224 x 4224
4544 x 4544
4864 x 4864
5184 x 5184
5504 x 5504
5824 x 5824
6144 x 6144
6464 x 6464
6784 x 6784
DGEMM: Multiples of 64
7104 x 7104
7424 x 7424
7744 x 7744
8064 x 8064
8384 x 8384
90. 0
100
150
200
250
300
350
50
64 x 64 Gflops
128 x 128
NVIDIA Confidential
192 x 192
256 x 256
320 x 320
384 x 384
448 x 448
512 x 512
576 x 576
640 x 640
704 x 704
768 x 768
832 x 832
896 x 896
cuBLAS level III
960 x 960
1024 x 1024
1088 x 1088
1152 x 1152
1216 x 1216
1280 x 1280
1344 x 1344
1408 x 1408
DGEMM: Multiples of 64
1472 x 1472
1536 x 1536
1600 x 1600
1664 x 1664
1728 x 1728
1792 x 1792
1856 x 1856
1920 x 1920
1984 x 1984
2048 x 2048
2112 x 2112
2176 x 2176
2240 x 2240
cuBLAS 3.2: NVIDIA Tesla C1060, Tesla C2050 (Fermi)
MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHz
2304 x 2304
2368 x 2368
2432 x 2432
2496 x 2496
2560 x 2560
2624 x 2624
2688 x 2688
2752 x 2752
2816 x 2816
2880 x 2880
2944 x 2944
3008 x 3008
3072 x 3072
3136 x 3136
3200 x 3200
3264 x 3264
3328 x 3328
3392 x 3392
91. Roofline Analysis (Arithmetic Intensity)
Samuel Williams, Andrew Waterman, and Dav id Patterson,
Roofline: An Insightful Visual Performance Model for Multicore Architectures
NVIDIA Confidential
92. Perf. on CUDA Application
Gflops Bandwidth of PCI-e slot
Ns
Bandwidth of Global Memory
NVIDIA Confidential
F/B
93. Tips for Optimization
Consider Algorithm for parallel (naï algorithms will be good)
ve
Consider Occupancy ( SIMT)
Consider Memory Bottleneck
NVIDIA Confidential
94. More Information for CUDA Optimization
CUDA Zone
http://www.nvidia.com/CUDA
Developer Zone
http://developer.nvidia.com
GTC 2010 contents
http://www.nvidia.com/gtc2010-content
쿠다 카페 (CUDA café in Korea)
http://cafe.daum.net/KCUG
NVIDIA Confidential