SlideShare a Scribd company logo
1 of 52
Download to read offline
Monte Carlo Simulation
and its Efficient Implementation

                                    Robert Tong
                                  28 January 2010




Experts in numerical algorithms
and HPC services
Outline



   Why use Monte Carlo simulation?
   Higher order methods and convergence
   GPU acceleration
   The need for numerical libraries




                                           2
Why use Monte Carlo methods?



 Essential for high dimensional problems – many
  degrees of freedom
 For applications with uncertainty in inputs
 In finance
      Important in risk modelling
      Pricing/hedging derivatives




                                                   3
The elements of Monte Carlo simulation



 Derivative pricing
      Simulate a path of asset values
      Compute payoff from path
      Compute option value
 Numerical components
      Pseudo-random number generator
      Discretization scheme




                                         4
The demand for ever increasing performance

 In the past
      Faster solution has been provided by increasing processor
       speeds
      Want a quicker solution? Buy a new processor
 Present
      Multi-core/Many-core architectures, without increased
       processor clock speeds
      A major challenge for existing numerical algorithms
 The escalator has stopped... or gone into reverse!
      Existing codes may well run slower on multi-core


                                                                   5
Ways to improve performance in Monte Carlo
simulation
1. Use higher order discretization
2. Keep low order (Euler) discretization –
      make use of multi-core potential
      e.g. GPU (Graphics Processing Unit)
3. Use high order discretization on GPU
4. Use quasi-random sequence (Sobol’, …) and
   Brownian Bridge
5. Implement Sobol’ sequence and Brownian Bridge
   on GPU


                                                   6
Higher order methods – 1
(work by Kai Zhang, University of Warwick, UK)




                                                 7
Higher order methods – 2




                           8
Numerical example – 1




                        9
Numerical example – 1a




                         10
Numerical example – 1b




                         11
Numerical example – 2a




                         12
Numerical example – 2b




                         13
GPU acceleration




 Retain low order Euler discretization
 Use multi-core GPU architecture to achieve speed-up




                                                        14
The Emergence of GPGPU Computing
 Initially – computation carried out by CPU (scalar,
  serial execution)
 CPU
      evolves to add cache, SSE instructions, ...
 GPU
      added to speed graphics display – driven by gaming needs
      multi-core, SIMT, limited flexibility
 CPU and GPU move closer
      CPU becomes multi-core
      GPU becomes General Purpose (GPGPU) – fully
        programmable

                                                                  15
Current GPU architecture – e.g. NVIDIA Tesla




                                               16
Tesla – processing power
 SM – Streaming Multiprocessor
     8 X SP - Streaming Processor core
     2 X Special Function Unit
     MT – Multithreaded instruction fetch and issue unit
     Instruction cache
     Constant cache (read only)
     Shared memory (16 Kb, read/write)
 C1060 – adds double precision
     30 double precision cores
     240 single precision cores



                                                            17
Tesla C1060 memory
(from: M A Martinez-del-Amor et al. (2008) based on E Lindholm et al. (2008))




                                                                                18
Programming GPUs – CUDA and OpenCL

 CUDA (Compute Unified Device Architecture,
  developed by NVIDIA)
     Extension of C to enable programming of GPU devices
     Allows easy management of parallel threads executing on
      GPU
     Handles communication with ‘host’ CPU
 OpenCL
     Standard language for multi-device programming
     Not tied to a particular company
     Will open up GPU computing
     Incorporates elements of CUDA

                                                                19
First step – obtaining and installing CUDA
 FREE download from
  http://www.nvidia.com/object/cuda_learn.html
 See: Quickstart Guide
 Require:
      CUDA capable GPU – GeForce 8, 9, 200, Tesla, many Quadro
      Recent version of NVIDIA driver
      CUDA Toolkit – essential components to compile and build applications
      CUDA SDK – example projects
 Update environment variables (Linux default shown)
      PATH                          /usr/local/cuda/bin
      LD_LIBRARY_PATH               /usr/local/cuda/lib
 CUDA compiler nvcc works with gcc (Linux) MS VC++ (Windows)



                                                                               20
Host (CPU) – Device (GPU) Relationship

 Application program initiated on Host (CPU)
 Device ‘kernels’ execute on GPU in SIMT (Single
  Instruction Multiple Thread) manner
 Host program
      Transfers data from Host memory to Device (GPU)
       memory
      Specifies number and organisation of threads on Device
      Calls Device ‘kernel’ as a C function, passing parameters
      Copies output from Device back to Host memory


                                                                   21
Organisation of threads on GPU



 SM (Streaming Multiprocessor) manages up to 1024
    threads
   Each thread is identified by an index
   Threads execute as Warps of 32 threads
   Threads are grouped in blocks (user specifies
    number of threads per block)
   Blocks make up a grid



                                                     22
Memory hierarchy


• On device can
  •   Read/write per-thread
      •   Registers
      •   Local memory
  •   Read/write per-block shared memory
  •   Read/write per-grid global memory
  •   Read only per-grid constant memory
• On host (CPU) can
  •   Read/write per-grid
      •   Global memory
      •   Constant memory

                                           23
CUDA terminology


 ‘kernel’ – C function executing on the GPU
      __global__ declares function as a kernel
           Executed on the Device
           Callable only from the Host
           void return type
      __device__ declares function that is
           Executed on the Device
           Callable only from the Device




                                                  24
Application to Monte Carlo simulation
Monte Carlo paths lead to highly parallel
  algorithms
• Applications in finance e.g. simulation based
  on SDE (return on asset)
        dS t                   drift + Brownian motion
               dt   dW t
         St
• Requires fast pseudorandom or
    Quasi-random number generator
• Additional techniques improve efficiency:
  Brownian Bridge, stratified sampling, …


                                                         25
Random Number Generators:
           choice of algorithm


Must be highly parallel
Implementation must satisfy statistical
 tests of randomness
Some common generators do not
 guarantee randomness properties when
 split into parallel streams
A suitable choice: MRG32k3a (L’Ecuyer)

                                           26
MRG32k3a: skip ahead

Generator combines 2 recurrences:
      xn ,1  a1 xn  2,1  b1 xn 3,1 mod m1

      xn,2  a2 xn1,2  b2 xn3,2 modm2

Each recurrence of form (M Giles, note on
 implementation)                         xn 
                                               
                                   yn   xn 1 
       yn  Ayn 1                      x 
                                         n2 


Precompute          A p in O(log p ) operations on CPU,
                          yn  p  A p yn mod m
                                                           27
MRG32k3a: modulus

Combined and individual recurrences
             z n  xn ,1  xn , 2 mod m1
Can compute using double precision divide – slow
Use 64 bit integers (supported on GPU) – avoid
 divide
Bit shift – faster (used in CPU implementations)
Note: speed of different possibilities subject to
 change as NVIDIA updates floating point
 capability


                                                     28
MRG32k3a: memory coalescence

 GPU performance limited by memory access
 Require memory coalescence for fast transfer of data
 Order RNG output to retain consecutive memory
 accesses
        xn ,t ,b the n th element generated by thread t in block b
 is stored at
                    t  Nt n  Nt N pb
sequential ordering           n  N pt  N t N pb
                N t  num threads, N p  num points per thread
(Implementation by M Giles)


                                                                     29
MRG32k3a: single – double precision

L’Ecuyer’s example implementation in double
 precision floating point
Double precision on high end GPUs – but
 arithmetic operations much slower in execution
 than single precision
GPU implementation in integers – final output
 cast to double
Note: output to single precision gives sequence
 that does not pass randomness tests


                                                   30
MRG32k3a: GPU benchmarking –
              double precision

GPU – NVIDIA Tesla C1060
CPU – serial version of integer implementation running on
  single core of quad-core Xeon
VSL – Intel Library MRG32k3a
ICC – Intel C/C++ compiler
VC++ – Microsoft Visual C++
           GPU      CPU-ICC   CPU-VC++   VSL-ICC   VSL-VC++


Samples/   3.00E+09 3.46E+07 4.77E+07    9.35E+07 9.32E+07
sec

                                                              31
MRG32k3a: GPU benchmarking –
              single precision

Note: for double precision all sequences were identical
For single precision GPU and CPU identical
                     GPU and VSL differ
                     max abs err 5.96E-008
Which output preferred?
       use statistical tests of randomness
           GPU      CPU-ICC   CPU-VC++   VSL-ICC   VSL-VC++


Samples/   3.49E+09 3.58E+07 5.24E+07    1.02E+08 9.75E+07
sec

                                                              32
LIBOR Market Model on GPU
Equally weighted portfolio of 15 swaptions each with
same maturity, but different lengths and
different strikes




                                                       33
Numerical Libraries for GPUs


 The problem
      The time-consuming work of writing basic numerical
       components should not be repeated
      The general user should not need to spend many days
       writing each application
 The solution
      Standard numerical components should be available as
       libraries for GPUs




                                                              34
NAG Routines for GPUs




                        35
nag_gpu_mrg32k3a_uniform




                           36
nag_gpu_mrg32k3a_uniform




                           37
nag_gpu_mrg32k3a_uniform




                           38
Example program: generate random numbers on GPU


   ...
    // Allocate memory on Host
    host_array = (double *)calloc(N,sizeof(double));

      // Allocate memory on GPU
      cudaMalloc((void **)&device_array, sizeof(double)*N);
      // Call GPU functions
      // Initialise random number generator
      nag_gpu_mrg32k3a_init(V1, V2, offset);
    // Generate random numbers
      nag_gpu_mrg32k3a_uniform(nb, nt, np, device_array);
      // Read back GPU results to host
  cudaMemcpy(host_array,gpu_array,sizeof(double)*N,cudaMem
  cpyDeviceToHost);
  ...



                                                              39
NAG Routines for GPUs




                        40
nag_gpu_mrg32k3a_next_uniform




                                41
nag_gpu_mrg32k3a_next_uniform




                                42
Example program – kernel function
__global__ void mrg32k3a_kernel(int np, FP *d_P){
  unsigned int v1[3], v2[3];
  int n, i0;
  FP x, x2 = nanf("");
  // initialisation for first point
  nag_gpu_mrg32k3a_stream_init(v1, v2, np);
  // now do points
  i0 = threadIdx.x + np*blockDim.x*blockIdx.x;
  for (n=0; n<np; n++) {
nag_gpu_mrg32k3a_next_uniform(v1, v2, x);
}
   d_P[i0] = x;
   i0 += blockDim.x;
}
                                                    43
Library issues: Auto-tuning


 Performance affected by mapping of algorithm to
  GPU via threads, blocks and warps
 Implement a code generator to produce variants
  using the relevant parameters
 Determine optimal performance
  Li, Dongarra & Tomov (2009)




                                                    44
Early Success with BNP Paribas
 Working with Fixed Income Research & Strategies
  Team (FIRST)
     NAG mrg32k3a works well in BNP Paribas CUDA “Local Vol
      Monte-Carlo”
     Passes rigorous statistical tests for randomness properties
      (Diehard, Dieharder,TestU01)
     Performance good
     Being able to match the GPU random numbers with the
      CPU version of mrg32k3a has been very valuable for
      establishing validity of output



                                                                    45
BNP Paribas Results – local vol example




                                          46
And with Bank of America Merrill Lynch




 “The NAG GPU libraries are helping us enormously
  by providing us with fast, good quality algorithms.
  This has let us concentrate on our models and
  deliver GPGPU based pricing much more quickly.”




                                                        47
“A N Other Tier 1” Risk Group

 “Thank you for the GPU code, we have achieved
  speed ups of x120”
 In a simple uncorrelated loss simulation:
      Number of simulations      50,000
      Time taken in seconds      2.373606
      Simulations per second     21065
      Simulated default rate     311.8472
      Theoretical default rate   311.9125
 24 trillion numbers in 6 hours

                                                  48
NAG routines for GPUs – 1

 Currently available
      Random Number Generator (L’Ecuyer mrg32k3a)
           Uniform distribution
           Normal distribution
           Exponential distribution
           Gamma distribution
      Sobol sequence for Quasi-Monte Carlo (to 19,000
       dimensions)
      Brownian Bridge



                                                         49
NAG routines for GPUs – 2



 Future plans
     Random Number Generator – Mersenne Twister
     Linear algebra components for PDE option pricing
      methods
     Time series analysis – wavelets ...




                                                         50
Summary

 GPUs offer high performance computing for specific
  massively parallel algorithms such as Monte Carlo
  simulations
 GPUs are lower cost and require less power than
  corresponding CPU configurations
 Numerical libraries for GPUs will make these an
  important computing resource
 Higher order methods for GPUs being considered



                                                       51
Acknowledgments
Mike Giles (Mathematical Institute, University of
 Oxford) – algorithmic input
Technology Strategy Board through Knowledge
 Transfer Partnership with Smith Institute
NVIDIA for technical support and supply of Tesla
 C1060 and Quadro FX 5800
See
     www.nag.co.uk/numeric/GPUs/
Contact:
     francois.cassier@nag.com


                                                     52

More Related Content

What's hot

Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
Angela Mendoza M.
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
Piyush Mittal
 

What's hot (17)

Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
Cuda tutorial
Cuda tutorialCuda tutorial
Cuda tutorial
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009
 
CUDA
CUDACUDA
CUDA
 
Cuda intro
Cuda introCuda intro
Cuda intro
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
05 defense
05 defense05 defense
05 defense
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
Parallel Vision by GPGPU/CUDA
Parallel Vision by GPGPU/CUDAParallel Vision by GPGPU/CUDA
Parallel Vision by GPGPU/CUDA
 
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
 
A beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDAA beginner’s guide to programming GPUs with CUDA
A beginner’s guide to programming GPUs with CUDA
 
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
 

Viewers also liked

Lecture1
Lecture1Lecture1
Lecture1
rjaeh
 
Database and Access Power Point
Database and Access Power PointDatabase and Access Power Point
Database and Access Power Point
Ayee_Its_Bailey
 
Access lesson 06 Integrating Access
Access lesson 06  Integrating AccessAccess lesson 06  Integrating Access
Access lesson 06 Integrating Access
Aram SE
 
Communication skills in english
Communication skills in englishCommunication skills in english
Communication skills in english
Aqib Memon
 
Access lesson 02 Creating a Database
Access lesson 02 Creating a DatabaseAccess lesson 02 Creating a Database
Access lesson 02 Creating a Database
Aram SE
 
01 computer%20 forensics%20in%20todays%20world
01 computer%20 forensics%20in%20todays%20world01 computer%20 forensics%20in%20todays%20world
01 computer%20 forensics%20in%20todays%20world
Aqib Memon
 
Computer Forensics &amp; Windows Registry
Computer Forensics &amp; Windows RegistryComputer Forensics &amp; Windows Registry
Computer Forensics &amp; Windows Registry
aradhanalaw
 
Access lesson 04 Creating and Modifying Forms
Access lesson 04 Creating and Modifying FormsAccess lesson 04 Creating and Modifying Forms
Access lesson 04 Creating and Modifying Forms
Aram SE
 
Access lesson05
Access lesson05Access lesson05
Access lesson05
Aram SE
 
European pricing with monte carlo simulation
European pricing with monte carlo simulationEuropean pricing with monte carlo simulation
European pricing with monte carlo simulation
Giovanni Della Lunga
 

Viewers also liked (20)

Lecture1
Lecture1Lecture1
Lecture1
 
Monte Carlo Statistical Methods
Monte Carlo Statistical MethodsMonte Carlo Statistical Methods
Monte Carlo Statistical Methods
 
Database and Access Power Point
Database and Access Power PointDatabase and Access Power Point
Database and Access Power Point
 
Monte Carlo Simulation
Monte Carlo SimulationMonte Carlo Simulation
Monte Carlo Simulation
 
OWASP Khartoum Cyber Security Session
OWASP Khartoum Cyber Security SessionOWASP Khartoum Cyber Security Session
OWASP Khartoum Cyber Security Session
 
Access lesson 06 Integrating Access
Access lesson 06  Integrating AccessAccess lesson 06  Integrating Access
Access lesson 06 Integrating Access
 
Communication skills in english
Communication skills in englishCommunication skills in english
Communication skills in english
 
Access lesson 02 Creating a Database
Access lesson 02 Creating a DatabaseAccess lesson 02 Creating a Database
Access lesson 02 Creating a Database
 
01 computer%20 forensics%20in%20todays%20world
01 computer%20 forensics%20in%20todays%20world01 computer%20 forensics%20in%20todays%20world
01 computer%20 forensics%20in%20todays%20world
 
Chapter 4 microsoft access 2010
Chapter 4 microsoft access 2010Chapter 4 microsoft access 2010
Chapter 4 microsoft access 2010
 
Computer Forensics &amp; Windows Registry
Computer Forensics &amp; Windows RegistryComputer Forensics &amp; Windows Registry
Computer Forensics &amp; Windows Registry
 
Model inquiri
Model inquiriModel inquiri
Model inquiri
 
Computer Forensics
Computer ForensicsComputer Forensics
Computer Forensics
 
Super Efficient Monte Carlo Simulation
Super Efficient Monte Carlo SimulationSuper Efficient Monte Carlo Simulation
Super Efficient Monte Carlo Simulation
 
Access lesson 04 Creating and Modifying Forms
Access lesson 04 Creating and Modifying FormsAccess lesson 04 Creating and Modifying Forms
Access lesson 04 Creating and Modifying Forms
 
Access lesson05
Access lesson05Access lesson05
Access lesson05
 
Agape explains the importance Of Computer Forensics.
Agape explains the importance Of Computer Forensics.Agape explains the importance Of Computer Forensics.
Agape explains the importance Of Computer Forensics.
 
Unit 5 general principles, simulation software
Unit 5 general principles, simulation softwareUnit 5 general principles, simulation software
Unit 5 general principles, simulation software
 
Hemolytic anaemia
Hemolytic anaemiaHemolytic anaemia
Hemolytic anaemia
 
European pricing with monte carlo simulation
European pricing with monte carlo simulationEuropean pricing with monte carlo simulation
European pricing with monte carlo simulation
 

Similar to Monte Carlo G P U Jan2010

Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
Haris456
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 

Similar to Monte Carlo G P U Jan2010 (20)

Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
Exploring Gpgpu Workloads
Exploring Gpgpu WorkloadsExploring Gpgpu Workloads
Exploring Gpgpu Workloads
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
QGATE 0.3: QUANTUM CIRCUIT SIMULATORQGATE 0.3: QUANTUM CIRCUIT SIMULATOR
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with r
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Boyang gao gpu k-means_gmm_final_v1
Boyang gao gpu k-means_gmm_final_v1Boyang gao gpu k-means_gmm_final_v1
Boyang gao gpu k-means_gmm_final_v1
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
 
Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 

More from John Holden

More from John Holden (6)

Cloud Task Execution at Scale with example from quant finance
Cloud Task Execution at Scale with example from quant financeCloud Task Execution at Scale with example from quant finance
Cloud Task Execution at Scale with example from quant finance
 
ISC Frankfurt 2015: Good, bad and ugly of accelerators and a complementary path
ISC Frankfurt 2015: Good, bad and ugly of accelerators and a complementary pathISC Frankfurt 2015: Good, bad and ugly of accelerators and a complementary path
ISC Frankfurt 2015: Good, bad and ugly of accelerators and a complementary path
 
NAG software for the Actuarial Community (Sep. 2012)
NAG software for the Actuarial Community (Sep. 2012)NAG software for the Actuarial Community (Sep. 2012)
NAG software for the Actuarial Community (Sep. 2012)
 
Wilmott Nyc Jul2012 Nag Talk John Holden
Wilmott Nyc Jul2012 Nag Talk John HoldenWilmott Nyc Jul2012 Nag Talk John Holden
Wilmott Nyc Jul2012 Nag Talk John Holden
 
Numerical Excellence In Finance N A G Jan2010
Numerical Excellence In Finance N A G Jan2010Numerical Excellence In Finance N A G Jan2010
Numerical Excellence In Finance N A G Jan2010
 
N A G P A R I S280101
N A G P A R I S280101N A G P A R I S280101
N A G P A R I S280101
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Monte Carlo G P U Jan2010

  • 1. Monte Carlo Simulation and its Efficient Implementation Robert Tong 28 January 2010 Experts in numerical algorithms and HPC services
  • 2. Outline  Why use Monte Carlo simulation?  Higher order methods and convergence  GPU acceleration  The need for numerical libraries 2
  • 3. Why use Monte Carlo methods?  Essential for high dimensional problems – many degrees of freedom  For applications with uncertainty in inputs  In finance  Important in risk modelling  Pricing/hedging derivatives 3
  • 4. The elements of Monte Carlo simulation  Derivative pricing  Simulate a path of asset values  Compute payoff from path  Compute option value  Numerical components  Pseudo-random number generator  Discretization scheme 4
  • 5. The demand for ever increasing performance  In the past  Faster solution has been provided by increasing processor speeds  Want a quicker solution? Buy a new processor  Present  Multi-core/Many-core architectures, without increased processor clock speeds  A major challenge for existing numerical algorithms  The escalator has stopped... or gone into reverse!  Existing codes may well run slower on multi-core 5
  • 6. Ways to improve performance in Monte Carlo simulation 1. Use higher order discretization 2. Keep low order (Euler) discretization – make use of multi-core potential e.g. GPU (Graphics Processing Unit) 3. Use high order discretization on GPU 4. Use quasi-random sequence (Sobol’, …) and Brownian Bridge 5. Implement Sobol’ sequence and Brownian Bridge on GPU 6
  • 7. Higher order methods – 1 (work by Kai Zhang, University of Warwick, UK) 7
  • 14. GPU acceleration  Retain low order Euler discretization  Use multi-core GPU architecture to achieve speed-up 14
  • 15. The Emergence of GPGPU Computing  Initially – computation carried out by CPU (scalar, serial execution)  CPU  evolves to add cache, SSE instructions, ...  GPU  added to speed graphics display – driven by gaming needs  multi-core, SIMT, limited flexibility  CPU and GPU move closer  CPU becomes multi-core  GPU becomes General Purpose (GPGPU) – fully programmable 15
  • 16. Current GPU architecture – e.g. NVIDIA Tesla 16
  • 17. Tesla – processing power  SM – Streaming Multiprocessor  8 X SP - Streaming Processor core  2 X Special Function Unit  MT – Multithreaded instruction fetch and issue unit  Instruction cache  Constant cache (read only)  Shared memory (16 Kb, read/write)  C1060 – adds double precision  30 double precision cores  240 single precision cores 17
  • 18. Tesla C1060 memory (from: M A Martinez-del-Amor et al. (2008) based on E Lindholm et al. (2008)) 18
  • 19. Programming GPUs – CUDA and OpenCL  CUDA (Compute Unified Device Architecture, developed by NVIDIA)  Extension of C to enable programming of GPU devices  Allows easy management of parallel threads executing on GPU  Handles communication with ‘host’ CPU  OpenCL  Standard language for multi-device programming  Not tied to a particular company  Will open up GPU computing  Incorporates elements of CUDA 19
  • 20. First step – obtaining and installing CUDA  FREE download from http://www.nvidia.com/object/cuda_learn.html  See: Quickstart Guide  Require:  CUDA capable GPU – GeForce 8, 9, 200, Tesla, many Quadro  Recent version of NVIDIA driver  CUDA Toolkit – essential components to compile and build applications  CUDA SDK – example projects  Update environment variables (Linux default shown)  PATH /usr/local/cuda/bin  LD_LIBRARY_PATH /usr/local/cuda/lib  CUDA compiler nvcc works with gcc (Linux) MS VC++ (Windows) 20
  • 21. Host (CPU) – Device (GPU) Relationship  Application program initiated on Host (CPU)  Device ‘kernels’ execute on GPU in SIMT (Single Instruction Multiple Thread) manner  Host program  Transfers data from Host memory to Device (GPU) memory  Specifies number and organisation of threads on Device  Calls Device ‘kernel’ as a C function, passing parameters  Copies output from Device back to Host memory 21
  • 22. Organisation of threads on GPU  SM (Streaming Multiprocessor) manages up to 1024 threads  Each thread is identified by an index  Threads execute as Warps of 32 threads  Threads are grouped in blocks (user specifies number of threads per block)  Blocks make up a grid 22
  • 23. Memory hierarchy • On device can • Read/write per-thread • Registers • Local memory • Read/write per-block shared memory • Read/write per-grid global memory • Read only per-grid constant memory • On host (CPU) can • Read/write per-grid • Global memory • Constant memory 23
  • 24. CUDA terminology  ‘kernel’ – C function executing on the GPU  __global__ declares function as a kernel  Executed on the Device  Callable only from the Host  void return type  __device__ declares function that is  Executed on the Device  Callable only from the Device 24
  • 25. Application to Monte Carlo simulation Monte Carlo paths lead to highly parallel algorithms • Applications in finance e.g. simulation based on SDE (return on asset) dS t drift + Brownian motion   dt   dW t St • Requires fast pseudorandom or Quasi-random number generator • Additional techniques improve efficiency: Brownian Bridge, stratified sampling, … 25
  • 26. Random Number Generators: choice of algorithm Must be highly parallel Implementation must satisfy statistical tests of randomness Some common generators do not guarantee randomness properties when split into parallel streams A suitable choice: MRG32k3a (L’Ecuyer) 26
  • 27. MRG32k3a: skip ahead Generator combines 2 recurrences: xn ,1  a1 xn  2,1  b1 xn 3,1 mod m1 xn,2  a2 xn1,2  b2 xn3,2 modm2 Each recurrence of form (M Giles, note on implementation)  xn    yn   xn 1  yn  Ayn 1 x   n2  Precompute A p in O(log p ) operations on CPU, yn  p  A p yn mod m 27
  • 28. MRG32k3a: modulus Combined and individual recurrences z n  xn ,1  xn , 2 mod m1 Can compute using double precision divide – slow Use 64 bit integers (supported on GPU) – avoid divide Bit shift – faster (used in CPU implementations) Note: speed of different possibilities subject to change as NVIDIA updates floating point capability 28
  • 29. MRG32k3a: memory coalescence  GPU performance limited by memory access  Require memory coalescence for fast transfer of data  Order RNG output to retain consecutive memory accesses xn ,t ,b the n th element generated by thread t in block b is stored at t  Nt n  Nt N pb sequential ordering n  N pt  N t N pb N t  num threads, N p  num points per thread (Implementation by M Giles) 29
  • 30. MRG32k3a: single – double precision L’Ecuyer’s example implementation in double precision floating point Double precision on high end GPUs – but arithmetic operations much slower in execution than single precision GPU implementation in integers – final output cast to double Note: output to single precision gives sequence that does not pass randomness tests 30
  • 31. MRG32k3a: GPU benchmarking – double precision GPU – NVIDIA Tesla C1060 CPU – serial version of integer implementation running on single core of quad-core Xeon VSL – Intel Library MRG32k3a ICC – Intel C/C++ compiler VC++ – Microsoft Visual C++ GPU CPU-ICC CPU-VC++ VSL-ICC VSL-VC++ Samples/ 3.00E+09 3.46E+07 4.77E+07 9.35E+07 9.32E+07 sec 31
  • 32. MRG32k3a: GPU benchmarking – single precision Note: for double precision all sequences were identical For single precision GPU and CPU identical GPU and VSL differ max abs err 5.96E-008 Which output preferred? use statistical tests of randomness GPU CPU-ICC CPU-VC++ VSL-ICC VSL-VC++ Samples/ 3.49E+09 3.58E+07 5.24E+07 1.02E+08 9.75E+07 sec 32
  • 33. LIBOR Market Model on GPU Equally weighted portfolio of 15 swaptions each with same maturity, but different lengths and different strikes 33
  • 34. Numerical Libraries for GPUs  The problem  The time-consuming work of writing basic numerical components should not be repeated  The general user should not need to spend many days writing each application  The solution  Standard numerical components should be available as libraries for GPUs 34
  • 35. NAG Routines for GPUs 35
  • 39. Example program: generate random numbers on GPU ... // Allocate memory on Host host_array = (double *)calloc(N,sizeof(double)); // Allocate memory on GPU cudaMalloc((void **)&device_array, sizeof(double)*N); // Call GPU functions // Initialise random number generator nag_gpu_mrg32k3a_init(V1, V2, offset); // Generate random numbers nag_gpu_mrg32k3a_uniform(nb, nt, np, device_array); // Read back GPU results to host cudaMemcpy(host_array,gpu_array,sizeof(double)*N,cudaMem cpyDeviceToHost); ... 39
  • 40. NAG Routines for GPUs 40
  • 43. Example program – kernel function __global__ void mrg32k3a_kernel(int np, FP *d_P){ unsigned int v1[3], v2[3]; int n, i0; FP x, x2 = nanf(""); // initialisation for first point nag_gpu_mrg32k3a_stream_init(v1, v2, np); // now do points i0 = threadIdx.x + np*blockDim.x*blockIdx.x; for (n=0; n<np; n++) { nag_gpu_mrg32k3a_next_uniform(v1, v2, x); } d_P[i0] = x; i0 += blockDim.x; } 43
  • 44. Library issues: Auto-tuning  Performance affected by mapping of algorithm to GPU via threads, blocks and warps  Implement a code generator to produce variants using the relevant parameters  Determine optimal performance Li, Dongarra & Tomov (2009) 44
  • 45. Early Success with BNP Paribas  Working with Fixed Income Research & Strategies Team (FIRST)  NAG mrg32k3a works well in BNP Paribas CUDA “Local Vol Monte-Carlo”  Passes rigorous statistical tests for randomness properties (Diehard, Dieharder,TestU01)  Performance good  Being able to match the GPU random numbers with the CPU version of mrg32k3a has been very valuable for establishing validity of output 45
  • 46. BNP Paribas Results – local vol example 46
  • 47. And with Bank of America Merrill Lynch  “The NAG GPU libraries are helping us enormously by providing us with fast, good quality algorithms. This has let us concentrate on our models and deliver GPGPU based pricing much more quickly.” 47
  • 48. “A N Other Tier 1” Risk Group  “Thank you for the GPU code, we have achieved speed ups of x120”  In a simple uncorrelated loss simulation:  Number of simulations 50,000  Time taken in seconds 2.373606  Simulations per second 21065  Simulated default rate 311.8472  Theoretical default rate 311.9125  24 trillion numbers in 6 hours 48
  • 49. NAG routines for GPUs – 1  Currently available  Random Number Generator (L’Ecuyer mrg32k3a)  Uniform distribution  Normal distribution  Exponential distribution  Gamma distribution  Sobol sequence for Quasi-Monte Carlo (to 19,000 dimensions)  Brownian Bridge 49
  • 50. NAG routines for GPUs – 2  Future plans  Random Number Generator – Mersenne Twister  Linear algebra components for PDE option pricing methods  Time series analysis – wavelets ... 50
  • 51. Summary  GPUs offer high performance computing for specific massively parallel algorithms such as Monte Carlo simulations  GPUs are lower cost and require less power than corresponding CPU configurations  Numerical libraries for GPUs will make these an important computing resource  Higher order methods for GPUs being considered 51
  • 52. Acknowledgments Mike Giles (Mathematical Institute, University of Oxford) – algorithmic input Technology Strategy Board through Knowledge Transfer Partnership with Smith Institute NVIDIA for technical support and supply of Tesla C1060 and Quadro FX 5800 See www.nag.co.uk/numeric/GPUs/ Contact: francois.cassier@nag.com 52