SlideShare a Scribd company logo
1 of 34
Download to read offline
Parallel Random
Generator
Manny Ko
Principal Engineer
Activision
Outline
●Serial RNG
●Background
●LCG, LFG, crypto-hash
●Parallel RNG
●Leapfrog, splitting, crypto-hash
RNG - desiderata
● White noise like
● Repeatable for any # of cores
● Fast
● Small storage
RNG Quality
● DIEHARD
● Spectral test
● SmallCrush
● BigCrush
GPUBBS
Power Spectrum
Power spectrum density Radial Mean Radial Variance
Serial RNG: LCG
● Linear-congruential (LCG)
● 𝑋𝑖 = π‘Ž βˆ— π‘‹π‘–βˆ’1 + 𝑐 π‘šπ‘œπ‘‘ 𝑀,
● a, c and M must be chosen carefully!
● Never choose 𝑀 = 231
! Should be a prime
● Park & Miller: π‘Ž = 16807, π‘š = 214748647 =
231 βˆ’ 1. π‘š is a Mersenne prime!
● Most likely in your C runtime
LCG: the good and bad
● Good:
● Simple and efficient even if we use mod
● Single word of state
● Bad:
● Short period – at most m
● Low-bits are correlated especially if π‘š = 2 𝑛
● Pure serial
LCG - bad
● 𝑋 π‘˜_+1 = (3 βˆ— 𝑋 π‘˜+4) π‘šπ‘œπ‘‘ 8
● {1,7,1,7, … }
Mersenne Prime modulo
● IDIV can be 40~80 cycles for 32b/32b
● π‘˜ π‘šπ‘œπ‘‘ 𝑝 where 𝑝 = 2 𝑠 βˆ’ 1:
● 𝑖 = π‘˜ & 𝑝 + π‘˜ ≫ 𝑠 ;
● π‘Ÿπ‘’π‘‘ 𝑖 β‰₯ 𝑝 ? 𝑖 βˆ’ 𝑝 ∢ 𝑖;
Lagged-Fibonacci Generator
● 𝑋𝑖 = π‘‹π‘–βˆ’π‘ βˆ— π‘‹π‘–βˆ’π‘ž; p and q are the lags
● βˆ— is =-* mod M (or XOR);
● ALFG: 𝑋 𝑛 = 𝑋 π‘›βˆ’π‘— + 𝑋 π‘›βˆ’π‘˜(π‘šπ‘œπ‘‘ 2 π‘š)
● * give best quality
● Period = 2 𝑝 βˆ’ 1 2 π‘βˆ’3; 𝑀 = 2 𝑏
LFG
● The good:
●Very efficient: 2 ops + power-of-2 mod
●Much Long period than LCG;
●Directly works in floats
●Higher quality than LCG
●ALFG can skip ahead
LFG – the bad
● Need to store max(p,q) floats
● Pure sequential –
● multiplicative LFG can’t jump ahead.
Mersenne Twister
● Gold standard ?
● Large state (624 ints)
● Lots of flops
● Hard to leapfrog
● Limited parallelism
power spectrum
● End of Basic RNG Overview
Parallel RNG
● Maintain the RNG’s quality
● Same result regardless of the # of cores
● Minimal state especially for gpu.
● Minimal correlation among the streams.
Random Tree
β€’ 2 LCGs with different π‘Ž
β€’ L used to generate a
seed for R
β€’ No need to know how
many generators or # of
values #s per-thread
β€’ GG
Leapfrog with 3 cores
β€’ Each thread leaps
ahead by 𝑁 using L
β€’ Each thread use its
own R to generate its
own sequence
β€’ 𝑁 = π‘π‘œπ‘Ÿπ‘’π‘  βˆ— π‘ π‘’π‘žπ‘π‘’π‘Ÿπ‘π‘œπ‘Ÿπ‘’
Leapfrog
● basic LCG without c:
● 𝐿 π‘˜+1 = π‘ŽπΏ π‘˜ π‘šπ‘œπ‘‘ π‘š
● 𝑅 π‘˜+1 = π‘Ž 𝑛 𝑅 π‘˜ π‘šπ‘œπ‘‘ π‘š
● LCG: 𝐴 = π‘Ž 𝑛and 𝐢 = 𝑐(π‘Ž 𝑛 βˆ’ 1)/(π‘Ž βˆ’ 1) –
each core jumps ahead by n (# of cores)
Leapfrog with 3 cores
β€’ Each sequence will
not overlap
β€’ Final sequence is the
same as the serial
code
Leapfrog – the good
● Same sequence as serial code
● Limited choice of RNG (e.g. no MLFG)
● No need to fix the # of random values used
per core (need to fix β€˜n’)
Leapfrog – the bad
● π‘Ž 𝑝no longer have the good qualities of π‘Ž
● power-of-2 N produce correlated sub-
sequences
● Need to fix β€˜n’ - # of generators/sequences
● the period of the original RNG is shorten by a
factor of β€˜n’. 32 bit LCG has a short period to
start with.
Sequence Splitting
β€’ If we know the # of
values per thread 𝑛
β€’ 𝐿 π‘˜+1 = π‘Ž 𝑛
𝐿 π‘˜ π‘šπ‘œπ‘‘ π‘š
β€’ 𝑅 π‘˜+1 = π‘Žπ‘… π‘˜ π‘šπ‘œπ‘‘ π‘š
β€’ the sequence is a subset
of the serial code
Leapfrog and Splitting
● Only guarantees the sequences are non-
overlap; nothing about its quality
● Not invariant to degree of parallelism
● Result change when # cores change
● Serial and parallel code does not match
Lagged-Fibonacci Leapfrog
● LFG has very long period
● Period = 2 𝑝 βˆ’ 1 2 π‘βˆ’3; 𝑀 = 2 𝑏
● 𝑀 can be power-of-two!
● Much better quality than LCG
● No leapfrog for the best variant – β€˜*’
● Luckily the ALFG supports leapfrogging
Issues with Leapfrog & Splitting
● LCG’s period get even shorter
● Questionable quality
● ALFG is much better but have to store
more state – for the β€˜lag’.
Crypto Hash
● MD5
● TEA: tiny encryption algorithm
Core Idea
1. input trivially prepared
in parallel, e.g. linear
ramp
2. feed input value into
hash, independently
and in parallel
3. output white noise
hash
input
output
TEA
● A Feistel coder
● Input is split into L
and R
● 128B key
● F: shift and XORs or
adds
TEA
Magic β€˜delta’
● π‘‘π‘’π‘™π‘‘π‘Ž = 5 βˆ’ 1 231
● Avalanche in 6 cycles (often in 4)
● * mixes better than ^ but makes TEA
twice as slow
Applications
Fractal terrain
(vertex
shader)
Texture tiling
(fragment
shader)st
SPRNG
● Good package by Michael Mascagni
● http://www.sprng.org/
References
● [Mascagni 99] Some Methods for Parallel Pseudorandom Number Generation, 1999.
● [Park & Miller 88] Random Number Generators: Good Ones are hard to Find, CACM, 1988.
● [Pryor 94] Implementation of a Portable and Reproducible Parallel Pseudorandom Number
Generator, SC, 1994
● [Tzeng & Li 08] Parallel White Noise Generation on a GPU via Cryptographic Hash, I3D, 2008
● [Wheeler 95] TEA, a tiny encryption algorithm, 1995.
Take Aways
● Look beyond LCG
● ALFG is worth a closer look
● Crypto-based hash is most promising –
especially TEA.

More Related Content

What's hot

What's hot (6)

Opal compiler
Opal compilerOpal compiler
Opal compiler
Β 
NIR on the Mesa i965 backend (FOSDEM 2016)
NIR on the Mesa i965 backend (FOSDEM 2016)NIR on the Mesa i965 backend (FOSDEM 2016)
NIR on the Mesa i965 backend (FOSDEM 2016)
Β 
Lec sequential
Lec sequentialLec sequential
Lec sequential
Β 
OWASP Netherlands -- ML vs Cryptocoin Miners
OWASP Netherlands -- ML vs Cryptocoin MinersOWASP Netherlands -- ML vs Cryptocoin Miners
OWASP Netherlands -- ML vs Cryptocoin Miners
Β 
JVM memory management & Diagnostics
JVM memory management & DiagnosticsJVM memory management & Diagnostics
JVM memory management & Diagnostics
Β 
Loop and switch
Loop and switchLoop and switch
Loop and switch
Β 

Similar to ParallelRandom-mannyko

A Methodology for Automatic GPU Kernel Optimization - NECSTTechTalk 4/06/2020
A Methodology for Automatic GPU Kernel Optimization - NECSTTechTalk 4/06/2020A Methodology for Automatic GPU Kernel Optimization - NECSTTechTalk 4/06/2020
A Methodology for Automatic GPU Kernel Optimization - NECSTTechTalk 4/06/2020
NECST Lab @ Politecnico di Milano
Β 
PostgreSQL Replication
PostgreSQL ReplicationPostgreSQL Replication
PostgreSQL Replication
elliando dias
Β 
bgp features presentation routing protocle
bgp features presentation routing protoclebgp features presentation routing protocle
bgp features presentation routing protocle
Badr Belhajja
Β 

Similar to ParallelRandom-mannyko (20)

hbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMihbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMi
Β 
How to build TiDB
How to build TiDBHow to build TiDB
How to build TiDB
Β 
OpenFOAM benchmark for EPYC server -Influence of coarsestLevelCorr in GAMG so...
OpenFOAM benchmark for EPYC server -Influence of coarsestLevelCorr in GAMG so...OpenFOAM benchmark for EPYC server -Influence of coarsestLevelCorr in GAMG so...
OpenFOAM benchmark for EPYC server -Influence of coarsestLevelCorr in GAMG so...
Β 
5G-Performance-Optimisation DATA RADIO oT+P+++.pdf
5G-Performance-Optimisation DATA RADIO oT+P+++.pdf5G-Performance-Optimisation DATA RADIO oT+P+++.pdf
5G-Performance-Optimisation DATA RADIO oT+P+++.pdf
Β 
A Methodology for Automatic GPU Kernel Optimization - NECSTTechTalk 4/06/2020
A Methodology for Automatic GPU Kernel Optimization - NECSTTechTalk 4/06/2020A Methodology for Automatic GPU Kernel Optimization - NECSTTechTalk 4/06/2020
A Methodology for Automatic GPU Kernel Optimization - NECSTTechTalk 4/06/2020
Β 
Java under the hood
Java under the hoodJava under the hood
Java under the hood
Β 
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
Β 
My talk from PgConf.Russia 2016
My talk from PgConf.Russia 2016My talk from PgConf.Russia 2016
My talk from PgConf.Russia 2016
Β 
Mastering GC.pdf
Mastering GC.pdfMastering GC.pdf
Mastering GC.pdf
Β 
Gpu Join Presentation
Gpu Join PresentationGpu Join Presentation
Gpu Join Presentation
Β 
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala TaiwanScala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
Β 
One-Wire-Serial-Communication.pdf
One-Wire-Serial-Communication.pdfOne-Wire-Serial-Communication.pdf
One-Wire-Serial-Communication.pdf
Β 
Lock free programming - pro tips devoxx uk
Lock free programming - pro tips devoxx ukLock free programming - pro tips devoxx uk
Lock free programming - pro tips devoxx uk
Β 
Limen Alpha Processor
Limen Alpha ProcessorLimen Alpha Processor
Limen Alpha Processor
Β 
Eugene Khvedchenia - Image processing using FPGAs
Eugene Khvedchenia - Image processing using FPGAsEugene Khvedchenia - Image processing using FPGAs
Eugene Khvedchenia - Image processing using FPGAs
Β 
Haskell Symposium 2010: An LLVM backend for GHC
Haskell Symposium 2010: An LLVM backend for GHCHaskell Symposium 2010: An LLVM backend for GHC
Haskell Symposium 2010: An LLVM backend for GHC
Β 
PostgreSQL Replication
PostgreSQL ReplicationPostgreSQL Replication
PostgreSQL Replication
Β 
Rotor Cipher and Enigma Machine
Rotor Cipher and Enigma MachineRotor Cipher and Enigma Machine
Rotor Cipher and Enigma Machine
Β 
Erasure codes and storage tiers on gluster
Erasure codes and storage tiers on glusterErasure codes and storage tiers on gluster
Erasure codes and storage tiers on gluster
Β 
bgp features presentation routing protocle
bgp features presentation routing protoclebgp features presentation routing protocle
bgp features presentation routing protocle
Β 

ParallelRandom-mannyko

  • 2. Outline ●Serial RNG ●Background ●LCG, LFG, crypto-hash ●Parallel RNG ●Leapfrog, splitting, crypto-hash
  • 3. RNG - desiderata ● White noise like ● Repeatable for any # of cores ● Fast ● Small storage
  • 4. RNG Quality ● DIEHARD ● Spectral test ● SmallCrush ● BigCrush GPUBBS
  • 5. Power Spectrum Power spectrum density Radial Mean Radial Variance
  • 6. Serial RNG: LCG ● Linear-congruential (LCG) ● 𝑋𝑖 = π‘Ž βˆ— π‘‹π‘–βˆ’1 + 𝑐 π‘šπ‘œπ‘‘ 𝑀, ● a, c and M must be chosen carefully! ● Never choose 𝑀 = 231 ! Should be a prime ● Park & Miller: π‘Ž = 16807, π‘š = 214748647 = 231 βˆ’ 1. π‘š is a Mersenne prime! ● Most likely in your C runtime
  • 7. LCG: the good and bad ● Good: ● Simple and efficient even if we use mod ● Single word of state ● Bad: ● Short period – at most m ● Low-bits are correlated especially if π‘š = 2 𝑛 ● Pure serial
  • 8. LCG - bad ● 𝑋 π‘˜_+1 = (3 βˆ— 𝑋 π‘˜+4) π‘šπ‘œπ‘‘ 8 ● {1,7,1,7, … }
  • 9. Mersenne Prime modulo ● IDIV can be 40~80 cycles for 32b/32b ● π‘˜ π‘šπ‘œπ‘‘ 𝑝 where 𝑝 = 2 𝑠 βˆ’ 1: ● 𝑖 = π‘˜ & 𝑝 + π‘˜ ≫ 𝑠 ; ● π‘Ÿπ‘’π‘‘ 𝑖 β‰₯ 𝑝 ? 𝑖 βˆ’ 𝑝 ∢ 𝑖;
  • 10. Lagged-Fibonacci Generator ● 𝑋𝑖 = π‘‹π‘–βˆ’π‘ βˆ— π‘‹π‘–βˆ’π‘ž; p and q are the lags ● βˆ— is =-* mod M (or XOR); ● ALFG: 𝑋 𝑛 = 𝑋 π‘›βˆ’π‘— + 𝑋 π‘›βˆ’π‘˜(π‘šπ‘œπ‘‘ 2 π‘š) ● * give best quality ● Period = 2 𝑝 βˆ’ 1 2 π‘βˆ’3; 𝑀 = 2 𝑏
  • 11. LFG ● The good: ●Very efficient: 2 ops + power-of-2 mod ●Much Long period than LCG; ●Directly works in floats ●Higher quality than LCG ●ALFG can skip ahead
  • 12. LFG – the bad ● Need to store max(p,q) floats ● Pure sequential – ● multiplicative LFG can’t jump ahead.
  • 13. Mersenne Twister ● Gold standard ? ● Large state (624 ints) ● Lots of flops ● Hard to leapfrog ● Limited parallelism power spectrum
  • 14. ● End of Basic RNG Overview
  • 15. Parallel RNG ● Maintain the RNG’s quality ● Same result regardless of the # of cores ● Minimal state especially for gpu. ● Minimal correlation among the streams.
  • 16. Random Tree β€’ 2 LCGs with different π‘Ž β€’ L used to generate a seed for R β€’ No need to know how many generators or # of values #s per-thread β€’ GG
  • 17. Leapfrog with 3 cores β€’ Each thread leaps ahead by 𝑁 using L β€’ Each thread use its own R to generate its own sequence β€’ 𝑁 = π‘π‘œπ‘Ÿπ‘’π‘  βˆ— π‘ π‘’π‘žπ‘π‘’π‘Ÿπ‘π‘œπ‘Ÿπ‘’
  • 18. Leapfrog ● basic LCG without c: ● 𝐿 π‘˜+1 = π‘ŽπΏ π‘˜ π‘šπ‘œπ‘‘ π‘š ● 𝑅 π‘˜+1 = π‘Ž 𝑛 𝑅 π‘˜ π‘šπ‘œπ‘‘ π‘š ● LCG: 𝐴 = π‘Ž 𝑛and 𝐢 = 𝑐(π‘Ž 𝑛 βˆ’ 1)/(π‘Ž βˆ’ 1) – each core jumps ahead by n (# of cores)
  • 19. Leapfrog with 3 cores β€’ Each sequence will not overlap β€’ Final sequence is the same as the serial code
  • 20. Leapfrog – the good ● Same sequence as serial code ● Limited choice of RNG (e.g. no MLFG) ● No need to fix the # of random values used per core (need to fix β€˜n’)
  • 21. Leapfrog – the bad ● π‘Ž 𝑝no longer have the good qualities of π‘Ž ● power-of-2 N produce correlated sub- sequences ● Need to fix β€˜n’ - # of generators/sequences ● the period of the original RNG is shorten by a factor of β€˜n’. 32 bit LCG has a short period to start with.
  • 22. Sequence Splitting β€’ If we know the # of values per thread 𝑛 β€’ 𝐿 π‘˜+1 = π‘Ž 𝑛 𝐿 π‘˜ π‘šπ‘œπ‘‘ π‘š β€’ 𝑅 π‘˜+1 = π‘Žπ‘… π‘˜ π‘šπ‘œπ‘‘ π‘š β€’ the sequence is a subset of the serial code
  • 23. Leapfrog and Splitting ● Only guarantees the sequences are non- overlap; nothing about its quality ● Not invariant to degree of parallelism ● Result change when # cores change ● Serial and parallel code does not match
  • 24. Lagged-Fibonacci Leapfrog ● LFG has very long period ● Period = 2 𝑝 βˆ’ 1 2 π‘βˆ’3; 𝑀 = 2 𝑏 ● 𝑀 can be power-of-two! ● Much better quality than LCG ● No leapfrog for the best variant – β€˜*’ ● Luckily the ALFG supports leapfrogging
  • 25. Issues with Leapfrog & Splitting ● LCG’s period get even shorter ● Questionable quality ● ALFG is much better but have to store more state – for the β€˜lag’.
  • 26. Crypto Hash ● MD5 ● TEA: tiny encryption algorithm
  • 27. Core Idea 1. input trivially prepared in parallel, e.g. linear ramp 2. feed input value into hash, independently and in parallel 3. output white noise hash input output
  • 28. TEA ● A Feistel coder ● Input is split into L and R ● 128B key ● F: shift and XORs or adds
  • 29. TEA
  • 30. Magic β€˜delta’ ● π‘‘π‘’π‘™π‘‘π‘Ž = 5 βˆ’ 1 231 ● Avalanche in 6 cycles (often in 4) ● * mixes better than ^ but makes TEA twice as slow
  • 32. SPRNG ● Good package by Michael Mascagni ● http://www.sprng.org/
  • 33. References ● [Mascagni 99] Some Methods for Parallel Pseudorandom Number Generation, 1999. ● [Park & Miller 88] Random Number Generators: Good Ones are hard to Find, CACM, 1988. ● [Pryor 94] Implementation of a Portable and Reproducible Parallel Pseudorandom Number Generator, SC, 1994 ● [Tzeng & Li 08] Parallel White Noise Generation on a GPU via Cryptographic Hash, I3D, 2008 ● [Wheeler 95] TEA, a tiny encryption algorithm, 1995.
  • 34. Take Aways ● Look beyond LCG ● ALFG is worth a closer look ● Crypto-based hash is most promising – especially TEA.