Monte Carlo G P U Jan2010

Monte Carlo Simulation
and its Efficient Implementation

Robert Tong
28 January 2010

Experts in numerical algorithms
and HPC services

Outline

 Why use Monte Carlo simulation?
 Higher order methods and convergence
 GPU acceleration
 The need for numerical libraries

2

Why use Monte Carlo methods?

 Essential for high dimensional problems – many
degrees of freedom
 For applications with uncertainty in inputs
 In finance
 Important in risk modelling
 Pricing/hedging derivatives

3

The elements of Monte Carlo simulation

 Derivative pricing
 Simulate a path of asset values
 Compute payoff from path
 Compute option value
 Numerical components
 Pseudo-random number generator
 Discretization scheme

4

The demand for ever increasing performance

 In the past
 Faster solution has been provided by increasing processor
speeds
 Want a quicker solution? Buy a new processor
 Present
 Multi-core/Many-core architectures, without increased
processor clock speeds
 A major challenge for existing numerical algorithms
 The escalator has stopped... or gone into reverse!
 Existing codes may well run slower on multi-core

5

Ways to improve performance in Monte Carlo
simulation
1. Use higher order discretization
2. Keep low order (Euler) discretization –
make use of multi-core potential
e.g. GPU (Graphics Processing Unit)
3. Use high order discretization on GPU
4. Use quasi-random sequence (Sobol’, …) and
Brownian Bridge
5. Implement Sobol’ sequence and Brownian Bridge
on GPU

6

Higher order methods – 1
(work by Kai Zhang, University of Warwick, UK)

7

Higher order methods – 2

8

Numerical example – 1

9

Numerical example – 1a

10

Numerical example – 1b

11

Numerical example – 2a

12

Numerical example – 2b

13

GPU acceleration

 Retain low order Euler discretization
 Use multi-core GPU architecture to achieve speed-up

14

The Emergence of GPGPU Computing
 Initially – computation carried out by CPU (scalar,
serial execution)
 CPU
 evolves to add cache, SSE instructions, ...
 GPU
 added to speed graphics display – driven by gaming needs
 multi-core, SIMT, limited flexibility
 CPU and GPU move closer
 CPU becomes multi-core
 GPU becomes General Purpose (GPGPU) – fully
programmable

15

Current GPU architecture – e.g. NVIDIA Tesla

16

Tesla – processing power
 SM – Streaming Multiprocessor
 8 X SP - Streaming Processor core
 2 X Special Function Unit
 MT – Multithreaded instruction fetch and issue unit
 Instruction cache
 Constant cache (read only)
 Shared memory (16 Kb, read/write)
 C1060 – adds double precision
 30 double precision cores
 240 single precision cores

17

Tesla C1060 memory
(from: M A Martinez-del-Amor et al. (2008) based on E Lindholm et al. (2008))

18

Programming GPUs – CUDA and OpenCL

 CUDA (Compute Unified Device Architecture,
developed by NVIDIA)
 Extension of C to enable programming of GPU devices
 Allows easy management of parallel threads executing on
GPU
 Handles communication with ‘host’ CPU
 OpenCL
 Standard language for multi-device programming
 Not tied to a particular company
 Will open up GPU computing
 Incorporates elements of CUDA

19

First step – obtaining and installing CUDA
 FREE download from
http://www.nvidia.com/object/cuda_learn.html
 See: Quickstart Guide
 Require:
 CUDA capable GPU – GeForce 8, 9, 200, Tesla, many Quadro
 Recent version of NVIDIA driver
 CUDA Toolkit – essential components to compile and build applications
 CUDA SDK – example projects
 Update environment variables (Linux default shown)
 PATH /usr/local/cuda/bin
 LD_LIBRARY_PATH /usr/local/cuda/lib
 CUDA compiler nvcc works with gcc (Linux) MS VC++ (Windows)

20

Host (CPU) – Device (GPU) Relationship

 Application program initiated on Host (CPU)
 Device ‘kernels’ execute on GPU in SIMT (Single
Instruction Multiple Thread) manner
 Host program
 Transfers data from Host memory to Device (GPU)
memory
 Specifies number and organisation of threads on Device
 Calls Device ‘kernel’ as a C function, passing parameters
 Copies output from Device back to Host memory

21

Organisation of threads on GPU

 SM (Streaming Multiprocessor) manages up to 1024
threads
 Each thread is identified by an index
 Threads execute as Warps of 32 threads
 Threads are grouped in blocks (user specifies
number of threads per block)
 Blocks make up a grid

22

Memory hierarchy

• On device can
• Read/write per-thread
• Registers
• Local memory
• Read/write per-block shared memory
• Read/write per-grid global memory
• Read only per-grid constant memory
• On host (CPU) can
• Read/write per-grid
• Global memory
• Constant memory

23

CUDA terminology

 ‘kernel’ – C function executing on the GPU
 __global__ declares function as a kernel
 Executed on the Device
 Callable only from the Host
 void return type
 __device__ declares function that is
 Executed on the Device
 Callable only from the Device

24

Application to Monte Carlo simulation
Monte Carlo paths lead to highly parallel
algorithms
• Applications in finance e.g. simulation based
on SDE (return on asset)
dS t drift + Brownian motion
  dt   dW t
St
• Requires fast pseudorandom or
Quasi-random number generator
• Additional techniques improve efficiency:
Brownian Bridge, stratified sampling, …

25

Random Number Generators:
choice of algorithm

Must be highly parallel
Implementation must satisfy statistical
tests of randomness
Some common generators do not
guarantee randomness properties when
split into parallel streams
A suitable choice: MRG32k3a (L’Ecuyer)

26

MRG32k3a: skip ahead

Generator combines 2 recurrences:
xn ,1  a1 xn  2,1  b1 xn 3,1 mod m1

xn,2  a2 xn1,2  b2 xn3,2 modm2

Each recurrence of form (M Giles, note on
implementation)  xn 
 
yn   xn 1 
yn  Ayn 1 x 
 n2 

Precompute A p in O(log p ) operations on CPU,
yn  p  A p yn mod m
27

MRG32k3a: modulus

Combined and individual recurrences
z n  xn ,1  xn , 2 mod m1
Can compute using double precision divide – slow
Use 64 bit integers (supported on GPU) – avoid
divide
Bit shift – faster (used in CPU implementations)
Note: speed of different possibilities subject to
change as NVIDIA updates floating point
capability

28

MRG32k3a: memory coalescence

 GPU performance limited by memory access
 Require memory coalescence for fast transfer of data
 Order RNG output to retain consecutive memory
accesses
xn ,t ,b the n th element generated by thread t in block b
is stored at
t  Nt n  Nt N pb
sequential ordering n  N pt  N t N pb
N t  num threads, N p  num points per thread
(Implementation by M Giles)

29

MRG32k3a: single – double precision

L’Ecuyer’s example implementation in double
precision floating point
Double precision on high end GPUs – but
arithmetic operations much slower in execution
than single precision
GPU implementation in integers – final output
cast to double
Note: output to single precision gives sequence
that does not pass randomness tests

30

MRG32k3a: GPU benchmarking –
double precision

GPU – NVIDIA Tesla C1060
CPU – serial version of integer implementation running on
single core of quad-core Xeon
VSL – Intel Library MRG32k3a
ICC – Intel C/C++ compiler
VC++ – Microsoft Visual C++
GPU CPU-ICC CPU-VC++ VSL-ICC VSL-VC++

Samples/ 3.00E+09 3.46E+07 4.77E+07 9.35E+07 9.32E+07
sec

31

MRG32k3a: GPU benchmarking –
single precision

Note: for double precision all sequences were identical
For single precision GPU and CPU identical
GPU and VSL differ
max abs err 5.96E-008
Which output preferred?
use statistical tests of randomness
GPU CPU-ICC CPU-VC++ VSL-ICC VSL-VC++

Samples/ 3.49E+09 3.58E+07 5.24E+07 1.02E+08 9.75E+07
sec

32

LIBOR Market Model on GPU
Equally weighted portfolio of 15 swaptions each with
same maturity, but different lengths and
different strikes

33

Numerical Libraries for GPUs

 The problem
 The time-consuming work of writing basic numerical
components should not be repeated
 The general user should not need to spend many days
writing each application
 The solution
 Standard numerical components should be available as
libraries for GPUs

34

NAG Routines for GPUs

35

nag_gpu_mrg32k3a_uniform

36


37


38

Example program: generate random numbers on GPU

...
// Allocate memory on Host
host_array = (double *)calloc(N,sizeof(double));

// Allocate memory on GPU
cudaMalloc((void **)&device_array, sizeof(double)*N);
// Call GPU functions
// Initialise random number generator
nag_gpu_mrg32k3a_init(V1, V2, offset);
// Generate random numbers
nag_gpu_mrg32k3a_uniform(nb, nt, np, device_array);
// Read back GPU results to host
cudaMemcpy(host_array,gpu_array,sizeof(double)*N,cudaMem
cpyDeviceToHost);
...

39

NAG Routines for GPUs

40

nag_gpu_mrg32k3a_next_uniform

41

nag_gpu_mrg32k3a_next_uniform

42

Example program – kernel function
__global__ void mrg32k3a_kernel(int np, FP *d_P){
unsigned int v1[3], v2[3];
int n, i0;
FP x, x2 = nanf("");
// initialisation for first point
nag_gpu_mrg32k3a_stream_init(v1, v2, np);
// now do points
i0 = threadIdx.x + np*blockDim.x*blockIdx.x;
for (n=0; n<np; n++) {
nag_gpu_mrg32k3a_next_uniform(v1, v2, x);
}
d_P[i0] = x;
i0 += blockDim.x;
}
43

Library issues: Auto-tuning

 Performance affected by mapping of algorithm to
GPU via threads, blocks and warps
 Implement a code generator to produce variants
using the relevant parameters
 Determine optimal performance
Li, Dongarra & Tomov (2009)

44

Early Success with BNP Paribas
 Working with Fixed Income Research & Strategies
Team (FIRST)
 NAG mrg32k3a works well in BNP Paribas CUDA “Local Vol
Monte-Carlo”
 Passes rigorous statistical tests for randomness properties
(Diehard, Dieharder,TestU01)
 Performance good
 Being able to match the GPU random numbers with the
CPU version of mrg32k3a has been very valuable for
establishing validity of output

45

BNP Paribas Results – local vol example

46

And with Bank of America Merrill Lynch

 “The NAG GPU libraries are helping us enormously
by providing us with fast, good quality algorithms.
This has let us concentrate on our models and
deliver GPGPU based pricing much more quickly.”

47

“A N Other Tier 1” Risk Group

 “Thank you for the GPU code, we have achieved
speed ups of x120”
 In a simple uncorrelated loss simulation:
 Number of simulations 50,000
 Time taken in seconds 2.373606
 Simulations per second 21065
 Simulated default rate 311.8472
 Theoretical default rate 311.9125
 24 trillion numbers in 6 hours

48

NAG routines for GPUs – 1

 Currently available
 Random Number Generator (L’Ecuyer mrg32k3a)
 Uniform distribution
 Normal distribution
 Exponential distribution
 Gamma distribution
 Sobol sequence for Quasi-Monte Carlo (to 19,000
dimensions)
 Brownian Bridge

49

NAG routines for GPUs – 2

 Future plans
 Random Number Generator – Mersenne Twister
 Linear algebra components for PDE option pricing
methods
 Time series analysis – wavelets ...

50

Summary

 GPUs offer high performance computing for specific
massively parallel algorithms such as Monte Carlo
simulations
 GPUs are lower cost and require less power than
corresponding CPU configurations
 Numerical libraries for GPUs will make these an
important computing resource
 Higher order methods for GPUs being considered

51

Acknowledgments
Mike Giles (Mathematical Institute, University of
Oxford) – algorithmic input
Technology Strategy Board through Knowledge
Transfer Partnership with Smith Institute
NVIDIA for technical support and supply of Tesla
C1060 and Quadro FX 5800
See
www.nag.co.uk/numeric/GPUs/
Contact:
francois.cassier@nag.com

52

Monte Carlo G P U Jan2010

More Related Content

What's hot

Viewers also liked

Similar to Monte Carlo G P U Jan2010

More from John Holden

Recently uploaded

Monte Carlo G P U Jan2010