Monte Carlo Simulation  and its Efficient Implementation Robert Tong 28 January 2010
Outline Why use Monte Carlo simulation? Higher order methods and convergence GPU acceleration The need for numerical libraries
Why use Monte Carlo methods? Essential for high dimensional problems – many degrees of freedom For applications with uncertainty in inputs In finance  Important in risk modelling Pricing/hedging derivatives
The elements of Monte Carlo simulation Derivative pricing Simulate a path of asset values Compute payoff from path Compute option value Numerical components Pseudo-random number generator Discretization scheme
In the past Faster solution has been provided by increasing processor speeds Want a quicker solution?  Buy a new processor Present  Multi-core/Many-core architectures, without increased processor clock speeds A major challenge for existing numerical algorithms The escalator has stopped... or gone into reverse! Existing codes may well run slower on multi-core The demand for ever increasing performance
Ways to improve performance in Monte Carlo simulation Use higher order discretization  Keep low order (Euler) discretization –   make use of multi-core potential e.g. GPU (Graphics Processing Unit) Use high order discretization on GPU Use quasi-random sequence (Sobol’, …) and Brownian Bridge Implement Sobol’ sequence and Brownian Bridge on GPU
Higher order methods – 1 (work by Kai Zhang, University of Warwick, UK)
Higher order methods – 2
Numerical example – 1
Numerical example – 1a
Numerical example – 1b
Numerical example – 2a
Numerical example – 2b
GPU acceleration Retain low order Euler discretization Use multi-core GPU architecture to achieve speed-up
The Emergence of GPGPU Computing Initially – computation carried out by CPU (scalar, serial execution) CPU  evolves to add cache, SSE instructions, ... GPU  added to speed graphics display – driven by gaming needs multi-core, SIMT, limited flexibility CPU and GPU move closer  CPU becomes multi-core GPU becomes General Purpose (GPGPU) – fully  programmable
Current GPU architecture – e.g. NVIDIA Tesla
Tesla – processing power  SM  – Streaming Multiprocessor 8 X SP -  Streaming Processor core 2 X Special Function Unit MT – Multithreaded instruction fetch and issue unit Instruction cache Constant  cache (read only) Shared memory (16 Kb, read/write) C1060  –  adds double precision 30 double precision cores 240 single precision cores
Tesla C1060 memory  (from: M A Martinez-del-Amor et al. (2008) based on E Lindholm et al. (2008))
Programming GPUs – CUDA and OpenCL CUDA (Compute Unified Device Architecture, developed by NVIDIA) Extension of  C  to enable programming of GPU devices Allows easy management of parallel threads executing on GPU Handles communication with ‘host’ CPU OpenCL Standard language for multi-device programming Not tied to a particular company Will open up GPU computing Incorporates elements of CUDA
First step – obtaining and installing CUDA FREE download from http://www.nvidia.com/object/cuda_learn.html See:  Quickstart Guide Require: CUDA capable GPU – GeForce 8, 9, 200, Tesla, many Quadro Recent version of NVIDIA driver CUDA Toolkit – essential components to compile and build applications CUDA SDK – example projects Update environment variables  (Linux default shown)   PATH /usr/local/cuda/bin  LD_LIBRARY_PATH /usr/local/cuda/lib  CUDA compiler  nvcc  works with  gcc  (Linux)  MS VC++  (Windows)
Host (CPU) – Device (GPU) Relationship Application program initiated on Host (CPU) Device ‘kernels’ execute on GPU in SIMT (Single Instruction Multiple Thread) manner Host program  Transfers data from Host memory to Device (GPU) memory Specifies number and organisation of threads on Device Calls Device ‘kernel’ as a C function, passing parameters Copies output from Device back to Host memory
Organisation of threads on GPU SM (Streaming Multiprocessor) manages up to 1024  threads Each  thread   is identified by an index Threads execute as  Warps  of 32 threads Threads are grouped in  blocks  (user specifies number of threads per block) Blocks make up a  grid
Memory hierarchy On device can Read/write per-thread Registers Local memory Read/write per-block shared memory Read/write per-grid global memory Read only per-grid constant memory On host (CPU) can Read/write per-grid Global memory Constant memory
CUDA terminology ‘ kernel’ – C function executing on the GPU __global__  declares function as a kernel Executed on the Device Callable only from the Host void  return type __device__  declares function that is Executed on the Device Callable only from the Device
Application to Monte Carlo simulation Monte Carlo paths lead to highly parallel algorithms  Applications in finance  e.g.  simulation based on SDE  (return on asset) drift + Brownian motion Requires fast pseudorandom or  Quasi-random number generator Additional techniques improve efficiency: Brownian Bridge, stratified sampling, …
Random Number Generators: choice of algorithm Must be highly parallel  Implementation must satisfy statistical tests of randomness Some common generators do not guarantee randomness properties when split into parallel streams  A suitable choice: MRG32k3a (L’Ecuyer)
MRG32k3a: skip ahead   Generator combines 2 recurrences: Each recurrence of form  (M Giles, note on implementation) Precompute    in  operations on CPU,
MRG32k3a: modulus Combined and individual recurrences Can compute using double precision divide – slow Use 64 bit integers (supported on GPU) – avoid divide Bit shift – faster (used in CPU implementations) Note: speed of different possibilities subject to change as NVIDIA updates floating point capability
MRG32k3a: memory coalescence   GPU performance limited by memory access Require memory coalescence for fast transfer of data Order RNG output to retain consecutive memory accesses is stored at sequential ordering (Implementation by M Giles)
MRG32k3a: single – double precision   L’Ecuyer’s example implementation in double precision floating point Double precision on high end GPUs – but arithmetic operations much slower in execution than single precision GPU implementation in integers – final output cast to double Note: output to single precision gives sequence that does not pass randomness tests
MRG32k3a: GPU benchmarking –  double precision GPU – NVIDIA Tesla C1060 CPU – serial version of integer implementation running on single core of quad-core Xeon VSL – Intel Library MRG32k3a ICC – Intel C/C++ compiler VC++ – Microsoft Visual C++   GPU CPU-ICC CPU-VC++ VSL-ICC VSL-VC++ Samples/ sec 3.00E+09 3.46E+07 4.77E+07 9.35E+07 9.32E+07
MRG32k3a: GPU benchmarking –  single precision Note: for double precision all sequences were identical For single precision GPU and CPU identical GPU and VSL differ  max abs err 5.96E-008 Which output preferred?  use statistical tests of randomness GPU CPU-ICC CPU-VC++ VSL-ICC VSL-VC++ Samples/ sec 3.49E+09 3.58E+07 5.24E+07 1.02E+08 9.75E+07
LIBOR Market Model on GPU Equally weighted portfolio of 15 swaptions each with same maturity, but different lengths and different strikes
Numerical Libraries for GPUs The problem The time-consuming work of writing basic numerical components should not be repeated The general user should not need to spend many days writing each application The solution Standard numerical components should be available as libraries for GPUs
NAG Routines for GPUs
nag_gpu_mrg32k3a_uniform
nag_gpu_mrg32k3a_uniform
nag_gpu_mrg32k3a_uniform
Example program: generate random numbers on GPU ...   // Allocate memory on Host   host_array = (double *)calloc(N,sizeof(double));      // Allocate memory on GPU     cudaMalloc((void **)&device_array, sizeof(double)*N);   // Call GPU functions   // Initialise random number generator   nag_gpu_mrg32k3a_init(V1, V2, offset); // Generate random numbers   nag_gpu_mrg32k3a_uniform(nb, nt, np, device_array);   // Read back GPU results to host cudaMemcpy(host_array,gpu_array,sizeof(double)*N,cudaMemcpyDeviceToHost); ...
NAG Routines for GPUs
nag_gpu_mrg32k3a_next_uniform
nag_gpu_mrg32k3a_next_uniform
Example program – kernel function  __global__ void mrg32k3a_kernel(int np, FP *d_P){ unsigned int v1[3], v2[3]; int  n, i0; FP  x, x2 = nanf(&quot;&quot;); // initialisation for first point nag_gpu_mrg32k3a_stream_init(v1, v2, np); // now do points i0 = threadIdx.x + np*blockDim.x*blockIdx.x; for (n=0; n<np; n++) { nag_gpu_mrg32k3a_next_uniform(v1, v2, x); } d_P[i0] = x; i0 += blockDim.x; }
Library issues: Auto-tuning Performance affected by mapping of algorithm to GPU via threads, blocks and warps Implement a code generator to produce variants using the relevant parameters Determine optimal performance Li, Dongarra & Tomov (2009)
Working with Fixed Income Research & Strategies Team (FIRST) NAG mrg32k3a works well in BNP Paribas CUDA “Local Vol Monte-Carlo” Passes rigorous statistical tests for randomness properties (Diehard, Dieharder,TestU01)  Performance good Being able to match the GPU random numbers with the CPU version of mrg32k3a has been very valuable for establishing validity of output Early Success with BNP Paribas
BNP Paribas Results – local vol example
“ The NAG GPU libraries are helping us enormously by providing us with fast, good quality algorithms. This has let us concentrate on our models and deliver GPGPU based pricing much more quickly.” And with Bank of America Merrill Lynch
“ Thank you for the GPU code, we have achieved speed ups of x120” In a simple uncorrelated loss simulation: Number of simulations 50,000 Time taken in seconds 2.373606 Simulations per second 21065 Simulated default rate 311.8472 Theoretical default rate 311.9125 24 trillion numbers in 6 hours “ A N Other Tier 1” Risk Group
NAG routines for GPUs – 1  Currently  available Random Number Generator (L’Ecuyer mrg32k3a) Uniform distribution Normal distribution Exponential distribution Gamma distribution Sobol sequence for Quasi-Monte Carlo (to 19,000 dimensions) Brownian Bridge
NAG routines for GPUs – 2  Future plans Random Number Generator – Mersenne Twister Linear algebra components for PDE option pricing methods Time series analysis – wavelets ...
Summary GPUs offer high performance computing for specific massively parallel algorithms such as Monte Carlo simulations GPUs are lower cost and require less power than corresponding CPU configurations Numerical libraries for GPUs will make these an important computing resource Higher order methods for GPUs being considered
Acknowledgments Mike Giles (Mathematical Institute, University of Oxford) – algorithmic input Technology Strategy Board through Knowledge Transfer Partnership with Smith Institute NVIDIA for technical support and supply of Tesla C1060 and Quadro FX 5800 See  www.nag.co.uk/numeric/GPUs/ Contact:  [email_address]

Monte Carlo on GPUs

  • 1.
    Monte Carlo Simulation and its Efficient Implementation Robert Tong 28 January 2010
  • 2.
    Outline Why useMonte Carlo simulation? Higher order methods and convergence GPU acceleration The need for numerical libraries
  • 3.
    Why use MonteCarlo methods? Essential for high dimensional problems – many degrees of freedom For applications with uncertainty in inputs In finance Important in risk modelling Pricing/hedging derivatives
  • 4.
    The elements ofMonte Carlo simulation Derivative pricing Simulate a path of asset values Compute payoff from path Compute option value Numerical components Pseudo-random number generator Discretization scheme
  • 5.
    In the pastFaster solution has been provided by increasing processor speeds Want a quicker solution? Buy a new processor Present Multi-core/Many-core architectures, without increased processor clock speeds A major challenge for existing numerical algorithms The escalator has stopped... or gone into reverse! Existing codes may well run slower on multi-core The demand for ever increasing performance
  • 6.
    Ways to improveperformance in Monte Carlo simulation Use higher order discretization Keep low order (Euler) discretization – make use of multi-core potential e.g. GPU (Graphics Processing Unit) Use high order discretization on GPU Use quasi-random sequence (Sobol’, …) and Brownian Bridge Implement Sobol’ sequence and Brownian Bridge on GPU
  • 7.
    Higher order methods– 1 (work by Kai Zhang, University of Warwick, UK)
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
    GPU acceleration Retainlow order Euler discretization Use multi-core GPU architecture to achieve speed-up
  • 15.
    The Emergence ofGPGPU Computing Initially – computation carried out by CPU (scalar, serial execution) CPU evolves to add cache, SSE instructions, ... GPU added to speed graphics display – driven by gaming needs multi-core, SIMT, limited flexibility CPU and GPU move closer CPU becomes multi-core GPU becomes General Purpose (GPGPU) – fully programmable
  • 16.
    Current GPU architecture– e.g. NVIDIA Tesla
  • 17.
    Tesla – processingpower SM – Streaming Multiprocessor 8 X SP - Streaming Processor core 2 X Special Function Unit MT – Multithreaded instruction fetch and issue unit Instruction cache Constant cache (read only) Shared memory (16 Kb, read/write) C1060 – adds double precision 30 double precision cores 240 single precision cores
  • 18.
    Tesla C1060 memory (from: M A Martinez-del-Amor et al. (2008) based on E Lindholm et al. (2008))
  • 19.
    Programming GPUs –CUDA and OpenCL CUDA (Compute Unified Device Architecture, developed by NVIDIA) Extension of C to enable programming of GPU devices Allows easy management of parallel threads executing on GPU Handles communication with ‘host’ CPU OpenCL Standard language for multi-device programming Not tied to a particular company Will open up GPU computing Incorporates elements of CUDA
  • 20.
    First step –obtaining and installing CUDA FREE download from http://www.nvidia.com/object/cuda_learn.html See: Quickstart Guide Require: CUDA capable GPU – GeForce 8, 9, 200, Tesla, many Quadro Recent version of NVIDIA driver CUDA Toolkit – essential components to compile and build applications CUDA SDK – example projects Update environment variables (Linux default shown) PATH /usr/local/cuda/bin LD_LIBRARY_PATH /usr/local/cuda/lib CUDA compiler nvcc works with gcc (Linux) MS VC++ (Windows)
  • 21.
    Host (CPU) –Device (GPU) Relationship Application program initiated on Host (CPU) Device ‘kernels’ execute on GPU in SIMT (Single Instruction Multiple Thread) manner Host program Transfers data from Host memory to Device (GPU) memory Specifies number and organisation of threads on Device Calls Device ‘kernel’ as a C function, passing parameters Copies output from Device back to Host memory
  • 22.
    Organisation of threadson GPU SM (Streaming Multiprocessor) manages up to 1024 threads Each thread is identified by an index Threads execute as Warps of 32 threads Threads are grouped in blocks (user specifies number of threads per block) Blocks make up a grid
  • 23.
    Memory hierarchy Ondevice can Read/write per-thread Registers Local memory Read/write per-block shared memory Read/write per-grid global memory Read only per-grid constant memory On host (CPU) can Read/write per-grid Global memory Constant memory
  • 24.
    CUDA terminology ‘kernel’ – C function executing on the GPU __global__ declares function as a kernel Executed on the Device Callable only from the Host void return type __device__ declares function that is Executed on the Device Callable only from the Device
  • 25.
    Application to MonteCarlo simulation Monte Carlo paths lead to highly parallel algorithms Applications in finance e.g. simulation based on SDE (return on asset) drift + Brownian motion Requires fast pseudorandom or Quasi-random number generator Additional techniques improve efficiency: Brownian Bridge, stratified sampling, …
  • 26.
    Random Number Generators:choice of algorithm Must be highly parallel Implementation must satisfy statistical tests of randomness Some common generators do not guarantee randomness properties when split into parallel streams A suitable choice: MRG32k3a (L’Ecuyer)
  • 27.
    MRG32k3a: skip ahead Generator combines 2 recurrences: Each recurrence of form (M Giles, note on implementation) Precompute in operations on CPU,
  • 28.
    MRG32k3a: modulus Combinedand individual recurrences Can compute using double precision divide – slow Use 64 bit integers (supported on GPU) – avoid divide Bit shift – faster (used in CPU implementations) Note: speed of different possibilities subject to change as NVIDIA updates floating point capability
  • 29.
    MRG32k3a: memory coalescence GPU performance limited by memory access Require memory coalescence for fast transfer of data Order RNG output to retain consecutive memory accesses is stored at sequential ordering (Implementation by M Giles)
  • 30.
    MRG32k3a: single –double precision L’Ecuyer’s example implementation in double precision floating point Double precision on high end GPUs – but arithmetic operations much slower in execution than single precision GPU implementation in integers – final output cast to double Note: output to single precision gives sequence that does not pass randomness tests
  • 31.
    MRG32k3a: GPU benchmarking– double precision GPU – NVIDIA Tesla C1060 CPU – serial version of integer implementation running on single core of quad-core Xeon VSL – Intel Library MRG32k3a ICC – Intel C/C++ compiler VC++ – Microsoft Visual C++ GPU CPU-ICC CPU-VC++ VSL-ICC VSL-VC++ Samples/ sec 3.00E+09 3.46E+07 4.77E+07 9.35E+07 9.32E+07
  • 32.
    MRG32k3a: GPU benchmarking– single precision Note: for double precision all sequences were identical For single precision GPU and CPU identical GPU and VSL differ max abs err 5.96E-008 Which output preferred? use statistical tests of randomness GPU CPU-ICC CPU-VC++ VSL-ICC VSL-VC++ Samples/ sec 3.49E+09 3.58E+07 5.24E+07 1.02E+08 9.75E+07
  • 33.
    LIBOR Market Modelon GPU Equally weighted portfolio of 15 swaptions each with same maturity, but different lengths and different strikes
  • 34.
    Numerical Libraries forGPUs The problem The time-consuming work of writing basic numerical components should not be repeated The general user should not need to spend many days writing each application The solution Standard numerical components should be available as libraries for GPUs
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
    Example program: generaterandom numbers on GPU ... // Allocate memory on Host host_array = (double *)calloc(N,sizeof(double)); // Allocate memory on GPU cudaMalloc((void **)&device_array, sizeof(double)*N); // Call GPU functions // Initialise random number generator nag_gpu_mrg32k3a_init(V1, V2, offset); // Generate random numbers nag_gpu_mrg32k3a_uniform(nb, nt, np, device_array); // Read back GPU results to host cudaMemcpy(host_array,gpu_array,sizeof(double)*N,cudaMemcpyDeviceToHost); ...
  • 40.
  • 41.
  • 42.
  • 43.
    Example program –kernel function __global__ void mrg32k3a_kernel(int np, FP *d_P){ unsigned int v1[3], v2[3]; int n, i0; FP x, x2 = nanf(&quot;&quot;); // initialisation for first point nag_gpu_mrg32k3a_stream_init(v1, v2, np); // now do points i0 = threadIdx.x + np*blockDim.x*blockIdx.x; for (n=0; n<np; n++) { nag_gpu_mrg32k3a_next_uniform(v1, v2, x); } d_P[i0] = x; i0 += blockDim.x; }
  • 44.
    Library issues: Auto-tuningPerformance affected by mapping of algorithm to GPU via threads, blocks and warps Implement a code generator to produce variants using the relevant parameters Determine optimal performance Li, Dongarra & Tomov (2009)
  • 45.
    Working with FixedIncome Research & Strategies Team (FIRST) NAG mrg32k3a works well in BNP Paribas CUDA “Local Vol Monte-Carlo” Passes rigorous statistical tests for randomness properties (Diehard, Dieharder,TestU01) Performance good Being able to match the GPU random numbers with the CPU version of mrg32k3a has been very valuable for establishing validity of output Early Success with BNP Paribas
  • 46.
    BNP Paribas Results– local vol example
  • 47.
    “ The NAGGPU libraries are helping us enormously by providing us with fast, good quality algorithms. This has let us concentrate on our models and deliver GPGPU based pricing much more quickly.” And with Bank of America Merrill Lynch
  • 48.
    “ Thank youfor the GPU code, we have achieved speed ups of x120” In a simple uncorrelated loss simulation: Number of simulations 50,000 Time taken in seconds 2.373606 Simulations per second 21065 Simulated default rate 311.8472 Theoretical default rate 311.9125 24 trillion numbers in 6 hours “ A N Other Tier 1” Risk Group
  • 49.
    NAG routines forGPUs – 1 Currently available Random Number Generator (L’Ecuyer mrg32k3a) Uniform distribution Normal distribution Exponential distribution Gamma distribution Sobol sequence for Quasi-Monte Carlo (to 19,000 dimensions) Brownian Bridge
  • 50.
    NAG routines forGPUs – 2 Future plans Random Number Generator – Mersenne Twister Linear algebra components for PDE option pricing methods Time series analysis – wavelets ...
  • 51.
    Summary GPUs offerhigh performance computing for specific massively parallel algorithms such as Monte Carlo simulations GPUs are lower cost and require less power than corresponding CPU configurations Numerical libraries for GPUs will make these an important computing resource Higher order methods for GPUs being considered
  • 52.
    Acknowledgments Mike Giles(Mathematical Institute, University of Oxford) – algorithmic input Technology Strategy Board through Knowledge Transfer Partnership with Smith Institute NVIDIA for technical support and supply of Tesla C1060 and Quadro FX 5800 See www.nag.co.uk/numeric/GPUs/ Contact: [email_address]