This document discusses parallel random number generation techniques. It reviews serial random number generators like linear congruential generators and lagged Fibonacci generators. For parallel generation, it describes methods like leapfrogging where each thread independently generates a subset of the sequence, and sequence splitting where the serial sequence is divided among threads. Cryptographic hashing of incremental inputs is also proposed as a parallel-friendly approach that generates independent and high-quality random streams for each thread.
Generating random numbers in a highly parallel program is surprising non-trivial. A lot of good generators have lots of state and is purely serial. Simple generators like LCG can leapfrog ahead but of limited quality and depends on #cores. We want our code to be independent of the degree of parallelism.
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)Igalia
Β
By Andy Wingo.
With the new compiler and virtual machine in Guile 2.2, Guile hackers need to update their mental performance models. This talk will give a bit of a state of the union of Guile performance, with an updated overview of the cost of various kinds of abstractions. Sometimes abstraction is free!
(c) 2016 FOSDEM VZW
CC BY 2.0 BE
https://archive.fosdem.org/2016/
Generating random numbers in a highly parallel program is surprising non-trivial. A lot of good generators have lots of state and is purely serial. Simple generators like LCG can leapfrog ahead but of limited quality and depends on #cores. We want our code to be independent of the degree of parallelism.
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)Igalia
Β
By Andy Wingo.
With the new compiler and virtual machine in Guile 2.2, Guile hackers need to update their mental performance models. This talk will give a bit of a state of the union of Guile performance, with an updated overview of the cost of various kinds of abstractions. Sometimes abstraction is free!
(c) 2016 FOSDEM VZW
CC BY 2.0 BE
https://archive.fosdem.org/2016/
By Eduardo Lima.
If you are running Linux on an Intel GPU, chances are that your graphics driver just got much better. Mesa, the most popular open source OpenGL implementation, has got a new intermediate language to represent GLSL shader programs. It is called NIR, and is based on modern knowledge on compilers and GPU architecture. The Intel i965 driver is fully powered by NIR now, after support to non-scalar shaders has been recently added.
This talk will give the audience a tour around the work done during 2015, that culminated in the rewrite of the i965 non-scalar backend to use NIR, and was released in Mesa 11.0. It will present a brief technical overview of how NIR integrates into a Mesa backend, some important considerations, and the main challenges we faced. It will also illustrate the benefits of the new backend by comparing to the old one in terms of performance and backend code complexity.
(c) 2016 FOSDEM VZW
CC BY 2.0 BE
https://archive.fosdem.org/2016/
hbaseconasia2017: HBase Practice At XiaoMiHBaseCon
Β
Zheng Hu
We'll share some HBase experience at XiaoMi:
1. How did we tuning G1GC for HBase Clusters.
2. Development and performance of Async HBase Client.
hbaseconasia2017 hbasecon hbase xiaomi https://www.eventbrite.com/e/hbasecon-asia-2017-tickets-34935546159#
This is the speech Max Liu gave at Percona Live Open Source Database Conference 2016.
Max Liu: Co-founder and CEO, a hacker with a free soul
The slide covered the following topics:
- Why another database?
- What kind of database we want to build?
- How to design such a database, including the principles, the architecture, and design decisions?
- How to develop such a database, including the architecture and the core technologies for TiKV and TiDB?
- How to test the database to ensure the quality and stability?
OpenFOAM benchmark for EPYC server -Influence of coarsestLevelCorr in GAMG so...takuyayamamoto1800
Β
Parallel efficiency of GAMG solver in OpenFOAM is evaluated for EPYC server. Especially, in this study, the influence of coarsestLevelCorr on the calculation time is evaluated in lid driven cavity flow.
Abstract inglese:
In recent years Graphic Processing Units have seen widespread adoption in many scientific fields, from Machine Learning (ML) to Genomics. Their use makes it possible to achieve significant speedups and improvements in power efficiency over computationally intensive algorithms compared to General Purpose Central Processing Units. However, algorithms require specific knowledge of the GPU architecture and expertise to achieve significant results. In this work, we describe a methodology for automatic GPU kernel optimization. Our methodology exploits the Berkeley Roofline Model to perform a performance analysis of the algorithm considered and aims to increase the accessibility of GPU programming automatizing the optimization process of the kernel. We provide an in-depth analysis of this methodology, an overview on the state of the art, and a description of a tool we developed that automatically applies our methodology to obtain a highly optimized GPU version of two of the most popular algorithms used in computational biology, the X-drop and Smith-Waterman algorithms. The Smith-Waterman algorithm is one of the most used algorithms in genomics pipelines. The algorithm finds the optimal local alignment between two genomic sequences, at the cost of being particularly compute-intensive. The popular X-drop algorithm reduces the time required by the alignment by searching only for high-quality alignments. The algorithms accelerated using our methodology achieve more than 6x and 3x speed-up, for the X-drop and Smith-Waterman algorithms respectively, with respect to the state of the art implementation of these algorithms.
Why and how to clean Illumina genome sequencing reads. Includes illustrative examples, and a case where a project was saved by using Nesoni clip: to discover the cause of non-mapping reads.
Most of the time, We forget that GC exists because it handles memory on its own. But, unfortunately, it is often involved in production incidents. This is at that moment it reminds you it exists and not everything is magic! Moreover, OpenJDK brings a handful of GCs with different characteristics and the default one (well not always...) is not the easiest to understand.
Though, this choice of GCs allows the JVM to adapt to different workloads and applications in terms of latency or throughput. I will explain how to tame those beasts and how to take advantage of them to improve your applications and resources.
This paper talks about algorithms to do database joins on a GPU. Some interesting work here, that will someday lead to implementing databases on a GPGPU like CUDA.
Most scalability bottlenecks come from code containing locks, producing significant contention under heavy loads. We'll cover striping, copy-on-write, ring buffer, spinning, to reduce this contention or get lock free code & explain concepts like Compare-And-Swap and memory barriers.
This is a matter of the heart for me; introduction of my first decent soft-core processor - Limen Alpha. A dual-core 16-bit RISC processor with interrupts and hardware synchronization of its cores. Originally, it was used as a part of my high school culminating project, and later it was extended to the current state. It is open-sourced! You can examine its source code at https://github.com/dominiksalvet/limen_alpha.
This presentation was used as a project of the PA176 Architecture of Digital Systems II course and it was given at the Faculty of Informatics of Masaryk University in 2018.
By Eduardo Lima.
If you are running Linux on an Intel GPU, chances are that your graphics driver just got much better. Mesa, the most popular open source OpenGL implementation, has got a new intermediate language to represent GLSL shader programs. It is called NIR, and is based on modern knowledge on compilers and GPU architecture. The Intel i965 driver is fully powered by NIR now, after support to non-scalar shaders has been recently added.
This talk will give the audience a tour around the work done during 2015, that culminated in the rewrite of the i965 non-scalar backend to use NIR, and was released in Mesa 11.0. It will present a brief technical overview of how NIR integrates into a Mesa backend, some important considerations, and the main challenges we faced. It will also illustrate the benefits of the new backend by comparing to the old one in terms of performance and backend code complexity.
(c) 2016 FOSDEM VZW
CC BY 2.0 BE
https://archive.fosdem.org/2016/
hbaseconasia2017: HBase Practice At XiaoMiHBaseCon
Β
Zheng Hu
We'll share some HBase experience at XiaoMi:
1. How did we tuning G1GC for HBase Clusters.
2. Development and performance of Async HBase Client.
hbaseconasia2017 hbasecon hbase xiaomi https://www.eventbrite.com/e/hbasecon-asia-2017-tickets-34935546159#
This is the speech Max Liu gave at Percona Live Open Source Database Conference 2016.
Max Liu: Co-founder and CEO, a hacker with a free soul
The slide covered the following topics:
- Why another database?
- What kind of database we want to build?
- How to design such a database, including the principles, the architecture, and design decisions?
- How to develop such a database, including the architecture and the core technologies for TiKV and TiDB?
- How to test the database to ensure the quality and stability?
OpenFOAM benchmark for EPYC server -Influence of coarsestLevelCorr in GAMG so...takuyayamamoto1800
Β
Parallel efficiency of GAMG solver in OpenFOAM is evaluated for EPYC server. Especially, in this study, the influence of coarsestLevelCorr on the calculation time is evaluated in lid driven cavity flow.
Abstract inglese:
In recent years Graphic Processing Units have seen widespread adoption in many scientific fields, from Machine Learning (ML) to Genomics. Their use makes it possible to achieve significant speedups and improvements in power efficiency over computationally intensive algorithms compared to General Purpose Central Processing Units. However, algorithms require specific knowledge of the GPU architecture and expertise to achieve significant results. In this work, we describe a methodology for automatic GPU kernel optimization. Our methodology exploits the Berkeley Roofline Model to perform a performance analysis of the algorithm considered and aims to increase the accessibility of GPU programming automatizing the optimization process of the kernel. We provide an in-depth analysis of this methodology, an overview on the state of the art, and a description of a tool we developed that automatically applies our methodology to obtain a highly optimized GPU version of two of the most popular algorithms used in computational biology, the X-drop and Smith-Waterman algorithms. The Smith-Waterman algorithm is one of the most used algorithms in genomics pipelines. The algorithm finds the optimal local alignment between two genomic sequences, at the cost of being particularly compute-intensive. The popular X-drop algorithm reduces the time required by the alignment by searching only for high-quality alignments. The algorithms accelerated using our methodology achieve more than 6x and 3x speed-up, for the X-drop and Smith-Waterman algorithms respectively, with respect to the state of the art implementation of these algorithms.
Why and how to clean Illumina genome sequencing reads. Includes illustrative examples, and a case where a project was saved by using Nesoni clip: to discover the cause of non-mapping reads.
Most of the time, We forget that GC exists because it handles memory on its own. But, unfortunately, it is often involved in production incidents. This is at that moment it reminds you it exists and not everything is magic! Moreover, OpenJDK brings a handful of GCs with different characteristics and the default one (well not always...) is not the easiest to understand.
Though, this choice of GCs allows the JVM to adapt to different workloads and applications in terms of latency or throughput. I will explain how to tame those beasts and how to take advantage of them to improve your applications and resources.
This paper talks about algorithms to do database joins on a GPU. Some interesting work here, that will someday lead to implementing databases on a GPGPU like CUDA.
Most scalability bottlenecks come from code containing locks, producing significant contention under heavy loads. We'll cover striping, copy-on-write, ring buffer, spinning, to reduce this contention or get lock free code & explain concepts like Compare-And-Swap and memory barriers.
This is a matter of the heart for me; introduction of my first decent soft-core processor - Limen Alpha. A dual-core 16-bit RISC processor with interrupts and hardware synchronization of its cores. Originally, it was used as a part of my high school culminating project, and later it was extended to the current state. It is open-sourced! You can examine its source code at https://github.com/dominiksalvet/limen_alpha.
This presentation was used as a project of the PA176 Architecture of Digital Systems II course and it was given at the Faculty of Informatics of Masaryk University in 2018.
In this session, we'll discuss new volume types in Red Hat Gluster Storage. We will talk about erasure codes and storage tiers, and how they can work together. Future directions will also be touched on, including rule based classifiers and data transformations.
You will learn about:
How erasure codes lower the cost of storage.
How to configure and manage an erasure coded volume.
How to tune Gluster and Linux to optimize erasure code performance.
Using erasure codes for archival workloads.
How to utilize an SSD inexpensively as a storage tier.
Gluster's erasure code and storage tiering design.
6. Serial RNG: LCG
β Linear-congruential (LCG)
β ππ = π β ππβ1 + π πππ π,
β a, c and M must be chosen carefully!
β Never choose π = 231
! Should be a prime
β Park & Miller: π = 16807, π = 214748647 =
231 β 1. π is a Mersenne prime!
β Most likely in your C runtime
7. LCG: the good and bad
β Good:
β Simple and efficient even if we use mod
β Single word of state
β Bad:
β Short period β at most m
β Low-bits are correlated especially if π = 2 π
β Pure serial
9. Mersenne Prime modulo
β IDIV can be 40~80 cycles for 32b/32b
β π πππ π where π = 2 π β 1:
β π = π & π + π β« π ;
β πππ‘ π β₯ π ? π β π βΆ π;
10. Lagged-Fibonacci Generator
β ππ = ππβπ β ππβπ; p and q are the lags
β β is =-* mod M (or XOR);
β ALFG: π π = π πβπ + π πβπ(πππ 2 π)
β * give best quality
β Period = 2 π β 1 2 πβ3; π = 2 π
11. LFG
β The good:
βVery efficient: 2 ops + power-of-2 mod
βMuch Long period than LCG;
βDirectly works in floats
βHigher quality than LCG
βALFG can skip ahead
12. LFG β the bad
β Need to store max(p,q) floats
β Pure sequential β
β multiplicative LFG canβt jump ahead.
13. Mersenne Twister
β Gold standard ?
β Large state (624 ints)
β Lots of flops
β Hard to leapfrog
β Limited parallelism
power spectrum
15. Parallel RNG
β Maintain the RNGβs quality
β Same result regardless of the # of cores
β Minimal state especially for gpu.
β Minimal correlation among the streams.
16. Random Tree
β’ 2 LCGs with different π
β’ L used to generate a
seed for R
β’ No need to know how
many generators or # of
values #s per-thread
β’ GG
17. Leapfrog with 3 cores
β’ Each thread leaps
ahead by π using L
β’ Each thread use its
own R to generate its
own sequence
β’ π = πππππ β π πππππππππ
18. Leapfrog
β basic LCG without c:
β πΏ π+1 = ππΏ π πππ π
β π π+1 = π π π π πππ π
β LCG: π΄ = π πand πΆ = π(π π β 1)/(π β 1) β
each core jumps ahead by n (# of cores)
19. Leapfrog with 3 cores
β’ Each sequence will
not overlap
β’ Final sequence is the
same as the serial
code
20. Leapfrog β the good
β Same sequence as serial code
β Limited choice of RNG (e.g. no MLFG)
β No need to fix the # of random values used
per core (need to fix βnβ)
21. Leapfrog β the bad
β π πno longer have the good qualities of π
β power-of-2 N produce correlated sub-
sequences
β Need to fix βnβ - # of generators/sequences
β the period of the original RNG is shorten by a
factor of βnβ. 32 bit LCG has a short period to
start with.
22. Sequence Splitting
β’ If we know the # of
values per thread π
β’ πΏ π+1 = π π
πΏ π πππ π
β’ π π+1 = ππ π πππ π
β’ the sequence is a subset
of the serial code
23. Leapfrog and Splitting
β Only guarantees the sequences are non-
overlap; nothing about its quality
β Not invariant to degree of parallelism
β Result change when # cores change
β Serial and parallel code does not match
24. Lagged-Fibonacci Leapfrog
β LFG has very long period
β Period = 2 π β 1 2 πβ3; π = 2 π
β π can be power-of-two!
β Much better quality than LCG
β No leapfrog for the best variant β β*β
β Luckily the ALFG supports leapfrogging
25. Issues with Leapfrog & Splitting
β LCGβs period get even shorter
β Questionable quality
β ALFG is much better but have to store
more state β for the βlagβ.
27. Core Idea
1. input trivially prepared
in parallel, e.g. linear
ramp
2. feed input value into
hash, independently
and in parallel
3. output white noise
hash
input
output
28. TEA
β A Feistel coder
β Input is split into L
and R
β 128B key
β F: shift and XORs or
adds
33. References
β [Mascagni 99] Some Methods for Parallel Pseudorandom Number Generation, 1999.
β [Park & Miller 88] Random Number Generators: Good Ones are hard to Find, CACM, 1988.
β [Pryor 94] Implementation of a Portable and Reproducible Parallel Pseudorandom Number
Generator, SC, 1994
β [Tzeng & Li 08] Parallel White Noise Generation on a GPU via Cryptographic Hash, I3D, 2008
β [Wheeler 95] TEA, a tiny encryption algorithm, 1995.
34. Take Aways
β Look beyond LCG
β ALFG is worth a closer look
β Crypto-based hash is most promising β
especially TEA.