1.
POLITECHNIKA WROC£AWSKA
WYDZIA£ INFORMATYKI I ZARZ•DZANIA
GPGPU driven simulations of zero-temperature 1D
Ising model with Glauber dynamics
Daniel Kosalla
FINAL THESIS
under supervision of
Dr inø. Dariusz Konieczny
Wroc≥aw 2013
4.
8.3. Thread communication . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8.4. Race conditions with shared memory . . . . . . . . . . . . . . . . . . . . 27
8.5. Thread per spin approach - reduction . . . . . . . . . . . . . . . . . . . . 27
8.6. Thread per spin approach - ﬂags . . . . . . . . . . . . . . . . . . . . . . 29
8.7. Thread-per-spin performance . . . . . . . . . . . . . . . . . . . . . . . . 30
8.8. Thread-per-spin vs thread-per-simulation performance . . . . . . . . . . . 31
9. Bond density for some W0 values 34
10.Conclusions 36
11.Future work 36
Appendix 38
A. Sequential algorithm - CPU 39
B. Thread per simulation - no optimizations 43
C. Thread per simulation - static memory 48
D. Thread per spin - no optimizations 53
E. Thread per spin - parallel reduction 58
F. Thread per spin - update ﬂag 63
iv
5.
1. Motivation
In the presence of recent developments of SCM (Single Chain Magnets) [1–4] the
issue of criticality in 1D Ising-like magnet chains has turned out to be an promising ﬁeld
of study [5–8]. Some practical applications has been already suggested [2]. Unfortunately,
the details of general mechanism driving this changes in real world is yet to be discovered.
Traditionaly, Monte Carlo Simulations regarding Ising model were conducted on CPUs1
.
However, in the presence of powerful GPGPU’s2
new trend in scientiﬁc computations
was started enabling more detailed and even faster calculations.
2. Target
The following document describes developed GPGPU applications capable of pro-
ducing insights into underlying physical problem, examination of diÄerent approaches
of conducting Monte Carlo simulations on GPGPU and comparison between developed
parallel GPGPU algorithms and sequential CPU-based approach.
3. Scope of work
The scope of this document includes development of 5 parallel GPGPU algorithms,
namely:
• Thread-per-simulation algorithm
• Thread-per-simulation algorithm with static memory
• Thread-per-spin algorithm
• Thread-per-spin algorithm with ﬂags
• Thread-per-spin algorithm with reduction
1
CPU - Central Processing Unit
2
GPGPU - General Purpose Graphics Processing Unit
5
6.
4. Theoretical background and proposed model
4.1. Ising model
Although initially proposed by Wilhelm Lenz, it was Ernst Ising[10], who developed
a mathematical model for ferromagnetic phenomena. Ising model is usually represented
by means of lattice of spins - discrete variables {≠1, 1}, representing magnetic dipole
moments of molecules in the material. The spins are then interacting with it’s neighbours,
which may cause the phase transition of the whole lattice.
4.2. Historic methods
Monte Carlo Simulation (MC) on Ising model consist of a sequence of lattice updates.
Traditionally all (synchronous) or single (sequential) spins are updated in each iteration
producing the lattice-state for future iterations. The update methods are based on the
so called dynamics that are describing spin interactions.
4.3. Updating
The idea of partially synchronous updating scheme has been suggested [5–7]. This
c-synchronous mode has a ﬁxed parameter of spins being updated in one step-time.
However, one can imagine, that the number of updated spins/molecules (often referred
to as cL, where: L denotes size of the chain and c œ (0, 1]) is changing as the simulation
progresses. If so, then it is either linked to some characteristics of the system or may
be expressed with some probability distribution (described in subsection 4.5). This
approach of changing c parameter can be applied while choosing spins randomly as well
as in cluster (subsection 4.6) but only the later will be considered in this document.
4.4. Simulations
In the proposed model cL sequential updating is used with c due to provided
distribution. The considered environment consist of one dimensional array of L spins
si = ±1. Index of each spin is denoted by i = 1, 2, . . . , L. Periodic boundary conditions
are assumed, i.e. sL+1 = s1.
It has been shown in [8] that the system under synchronous Glauber dynamics
reaches one of two absorbing states - ferromagnetic or antiferromagnetic. Therefore,
let’s introduce density of bonds (ﬂ) as an order parameter:
ﬂ =
Lq
i=1
(1 ≠ sisi+1)
2L
(4.1)
6
7.
As stated in [8] phase transitions in synchronous updating modes and c-sequential
[7] ought to be rather continuous (in cases diÄerent then c = 1 for the later). Smooth
phase transition can be observed in the Figure 4.1.
Figure 4.1. The average density of active bonds in the stationary state < ﬂst > as a function
of W0 for c = 0.9 and several lattice sizes L.
[7] B. Skorupa, K. Sznajd-Weron, and R. Topolnicki. Phase diagram for a zero-
temperature Glauber dynamics under partially synchronous updates, Phys. Rev.
E 86, 051113 (2012)
The system is considered in low temperatures (T) and therefore T = 0 can be
assumed. Metropolis algorithm can be considered as a special case of zero-temperature
Glauber dynamics for 1/2 spins. Each spin is ﬂipped (si = ≠si) with rate W(”E) per
unit time. While T = 0:
W(”E) =
Y
____]
____[
1 if ”E < 0,
W0 if ”E = 0,
0 if ”E > 0
(4.3)
In the case of T = 0, the ordering parameter W0 = [0; 1] (e.g. Glauber rate -
W0 = 1/2 or Metropolis rate W0 = 1) is assumed to be constant. One can imagine that
even W0 parameter can in fact be changed during simulation process but that’s out of
scope of proposed model.
System starts in the fully ferromagnetic state (ﬂ = ﬂf = 0). After each time-step
changes are applied to the system and the next time-step is being evaluated. After
predetermined number of time steps state of the system is investigated. If the chain has
obtained antiferromagnetic state (ﬂ = ﬂaf = 1) or suÖciently large number of time-steps
has been inconclusive then whole simulation is being shout down.
4.5. Distributions
During the simulation c will not be ﬁxed in time but rather change from [0; 1]
according to triangular continuous probability distribution[9] presented in the Figure 4.2.
While studying diÄerent initial conditions for simulations, distributions are to be
adjusted in order to provide peak values in range {0, 1}. This is due to the fact that
7
8.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
2.0
Figure 4.2. c could be any value in the interval [0; 1] but is most likely to around value of
c = 1/2. Other values possible but their probabilities are inversely proportional
to their distance from c = 1/2.
the value of 0.5 (as presented in the plot) would mean that in each time-step half of
the spins gets to be updated.
4.6. Updating
The following algorithms make use of triangular probability distribution to assign
appropriate c value before each time step. After (on average) L updated spins each
Monte Carlo Step (MCS) can be distinguished.
4.7. Algorithm
Transformation of the above-mentioned rules into set of instructions could yield in
following description or pseudocode (below):
Update cL consecutive spins starting from randomly chosen one. Each
change is saved to the new array rather than the old one. After each Stop
updated spins are saved and new time-step can be started.
1. Assign c value with given distribution
2. Choose random value of i œ [0, L]
3. max = i + cL
4. si is i-th spin
• if si = si+1 · si = si≠1 :
– sÕ
i = si+1 = si≠1
• otherwise:
– Flip si with probability W0
8
9.
5. if i ˛ max
• i = i + 1
• Go to step 4
6. Stop
9
10.
5. General Purpose Graphic Processing Units
5.1. History of General Purpose GPUs
Traditionally, in desktop computer GPU is highly specialized electronic circuit
designed to robustly handle 2D and 3D graphics. In 1992 Silicon Graphics, released
OpenGL library. OpenGL was meant as standardised, platform-independent interface
for writing 3D graphics. By the mid 1990s an increasing demand for 3D applications
appeared in the customer market.
It was NVIDIA who developed GeForce 256 and branded it as ”the word’s ﬁrst
GPU”3
. GeForce 256, although, one of many graphical accelerators was one that showed
a very rapid increase in the ﬁeld incorporating many features such as transform and
lighting computations directly on graphics processor. The release of GPUs capable of
coping with programmable pipelines attracted researchers to explore the possibility of
using graphical processors outside their original use scheme. Although, early GPUs of
early 2000s were programmable in a way that enabled for pixel manipulation, researchers
noticed that since this manipulations could actually represent any kind of operations
and pixels could virtually represent any kind of data.
In the late 2006 NVIDIA revealed GeForce 8800 GTX, the ﬁrst GPU built with
CUDA Architecture. CUDA Architecture enables programmer to use every arithmetic
logic unit4
on the GPU (as opposed to early days of GPGPU when the access to ALUs
was granted only via the restricted and complicated interface of OpenGL and DirectX).
The new family of GPUs started with 8800 GTX was built with IEEE compliant
ALUs capable of single-precision ﬂoating-point arithmetics. Moreover, new ALUs were
equipped not only with extended set of instructions that could be used in general
purpose computing but also enabled for the arbitrary read and write operations to
device memory.
Few months after the lunch of 8800 GTX NVIDIA published a compiler that took
standard C language extended with some additional keywords and transformed it into
fully featured GPU code capable of general purpose processing. It is important to stress
that currently used CUDA C is by far easier to use then OpenGL/DirectX. Programmers
do not have to disguise their data for graphics and can use industry-standard C or even
other languages like C#, Java or Python (via appropriate bindings).
CUDA in now used in various ﬁelds of science raging from medical imaging, ﬂuid
dynamics to environmental science and others oÄering enormous, several-orders-of-
magnitude speed ups5
. GPUs are not only faster then CPUs in terms of computed data
3
http://www.nvidia.com/page/geforce256.html
4
ALU - Arithmetic Logic Unit
5
http://www.nvidia.com/object/cuda-apps-flash-new-changed.html
10
11.
per unit time (e.g. FLOPS6
) but also in terms of power and cost eÖciency.
5.2. CUDA Architecture
The underlying architecture of CUDA is driven by design decisions connected with
GPU’s primary purpose, that is graphics processing. Graphics processing is usually
highly parallel process. Therefore, GPU also works in parallel fashion. The important
distinction can be made into logical and physical layer of GPU architecture.
Programmer decomposes computational problem into atomic processes (threads)
that can be executed simultaneously. Since this partition usually results in creation of
hundreds, thousands or even millions if threads. For programmer convenience threads
can be organized inside blocks which in turn are part of blocks. Both, blocks and grids
are 3 dimensional structures. This spatial dimensions are introduced for easier problem
decomposition. As mentioned before: GPU is meant for graphics processing which is
usually related to processing 2D or 3D sets of data.
This grouping is associated not only with logical decomposition of problems, but
also with physical structure of GPU. A basic unit of execution on GPU is the warp.
Warp consist of 32 threads. Each thread in warp belongs to the same block. If the block
is bigger then warp size then threads are divided between several warps. The warps
are executed on the executional unit called Streaming Multiprocessors (SMs). Each
SM executes several warps (not necessarily from the same block). Physically, each SM
consist of 8 streaming processors (SP, CUDA cores) and 32 ”basic” ALUs. 8 SPs spend
4 clock cycles executing the same processor instruction enabling 32 threads in warp to
execute in parallel. Each of the threads in warp can (and usually do) have diÄerent
data supplied to them forming whats known as SIMD7
architecture.
6
FLOPS - Floating Point Operations Per Second
7
Single Instruction, Multiple Data
11
12.
Figure 5.1. Grid of thread blocks
http://docs.nvidia.com/cuda/cuda-c-programming-guide/
CUDA also provides rich memory hierarchy available for every thread. Each of
the memory spaces has it’s own characteristics. The fastest and the smallest memory
memory is the per-thread local memory. Unfortunately, local, register-based memory is
out of reach for CUDA programmer and is used automatically. Each thread in block
can make use of shared memory. This memory can be accessed by diÄerent threads
in block and is usually the main medium of inter-thread communication. The slowest
memory spaces (but available to every thread) are called global, constant and texture
respectively, each of them have diÄerent size and purpose but they are all persistent
across kernel launches by the same application.
12
13.
Figure 5.2. CUDA memory hierarchy
http://docs.nvidia.com/cuda/cuda-c-programming-guide/
13
14.
6. CPU Simulations
6.1. Sequential algorithm
The baseline for presented algorithms will be the sequential, CPU based code. The
simulation itself is executed by the algorithm presented in Listing 1.
Listing 1. Sequential algorithm for CPU
1 while ( monte_carlo_steps < MAX_MCS) {
2 if ( is_lattice_updated == FALSE && BOND_DENSITY(LATTICE) == 0.0 ) {
3 // If lattice is in ferromagnetic state , simulation can stop
4 break;
5 }
6 float C = TRIANGLE_DISTRIBUTION (MIU , SIGMA);
7 first_i = (int)(LATTICE_SIZE * randomUniform ());
8 last_i = (int)(first_i + (C * LATTICE_SIZE));
9 is_lattice_updated = FALSE; // ?
10 for (int i = 0; i < LATTICE_SIZE; i++) {
11 NEXT_STEP_LATTICE [i] = LATTICE[i];
12 if (( first_i <= i && i <= last_i )
13 || ( last_i >= LATTICE_SIZE && ( last_i % LATTICE_SIZE >= i || i >=
first_i ) )
14 ) {
15 int left = MOD((i-1), LATTICE_SIZE);
16 int right = MOD((i+1), LATTICE_SIZE);
17 // Neighbours are the same and different than the current spin
18 if ( LATTICE[left] == LATTICE[right] ) {
19 NEXT_STEP_LATTICE [i] = LATTICE[left ];
20 }
21 // Otherwise randomly flip the spin
22 else if ( W0 > randomUniform () ) {
23 NEXT_STEP_LATTICE [i] = FLIP_SPIN(LATTICE[i]);
24 }
25 lattice_update_counter ++;
26 }
27 if (LATTICE[i] != NEXT_STEP_LATTICE [i]) {
28 is_lattice_updated = TRUE;
29 }
30 }
31 monte_carlo_steps = (int)( lattice_update_counter / LATTICE_SIZE);
32 SWAP(LATTICE , NEXT_STEP_LATTICE , SWAP);
33 }
The code runs the simulation with initial conditions of MAX MCS, LATTICE SIZE,
LATTICE. LATTICE is set to be an array initialized in antiferromagnetic state (which
could be represented by a sequence of consecutive ones or zeroes). To explore solution
space (the combination of W0, MIU and SIGMA) we run each simulation one after another.
C/C++’s operator is in fact remainder from the division and not the modulo operator
in the mathematical sense. The most prominent diÄerence is that -1 % LATTICE SIZE
14
15.
== -1, whereas: MOD((-1), LATTICE SIZE) == LATTICE SIZE-1. Therefore, while ac-
cessing current spin’s neighbours MOD(x,N) macro is used (Listing 2).
Listing 2. Modulo function-like macro
1 #define MOD(x, N) (((x < 0) ? ((x % N) + N) : x) % N)
6.2. Random number generation on CPU
The CPU code uses GSL8
based Mersenne Twister9
. Usage of GSL-supplied MT is
shown in Listing 3.
Listing 3. GSL’s Mersenne Twister setup
1 #include <gsl/gsl_rng.h>
2 #include <gsl/gsl_randist.h>
3 // ...
4 const gsl_rng_type * T;
5 gsl_rng * r;
6 // ...
7 double randomUniform () {
8 return gsl_rng_uniform (r);
9 }
10 // ...
11 int main(int argc , char *argv []) {
12 gsl_rng_env_setup ();
13 T = gsl_rng_mt19937 ;
14 r = gsl_rng_alloc (T);
15 long seed = time (NULL) * getpid ();
16 gsl_rng_set(r, seed);
17 // simulation
18 // randomUniform () calls
19 }
6.3. CPU performance
The tests of CPU were conducted on quad-core AMD Phenom(tm) II X4 945
Processor with 4GB of RAM. Simulations occupied only one core at the time. The
results presented in Figure 6.1 will be used as a baseline for further comparisons (with
respective MAX MTS values).
8
GSL - GNU Scientiﬁc Library, http://www.gnu.org/software/gsl/
9
http://www.gnu.org/software/gsl/manual/html_node/Random-number-generator-algorithms.
html
15
16.
Figure 6.1. Execution times of CPU simulations with MAX MTS equal to 1 000 and 10 000.
Markers denote arithmetic mean of the 5 averages conducted. The curves ﬁtted
are 4-th degree polynomials.
16
17.
7. GPU Simulations - thread per simulation
7.1. Thread per simulation
CUDA provides C/C++-like language for executing code on GPU (CUDA C).
The code is compiled and CUDA compiler via use of speciﬁc language extensions
(e.g. device , host ) can distinguish the parts to be executed by CPU(host),
GPU(device) or both(global).
Listing 4. Thread per simulation algorithm
1 while ( monte_carlo_steps < MAX_MCS ) {
2 if( is_lattice_updated == FALSE && BOND_DENSITY(LATTICE)==0.0) {
3 // stop when lattice is in ferromagnetic state
4 break;
5 }
6 float C = TRIANGLE_DISTRIBUTION ((X / (float)MAX_X), (Y / (float)MAX_Y));
7 float W0 = Z / (float)MAX_Z;
8 first_i = (int)(LATTICE_SIZE * RANDOM (& state[BLOCK_ID ])) +
THREAD_LATTICE_INDEX ;
9 last_i = (int)(first_i + (C * LATTICE_SIZE)) + THREAD_LATTICE_INDEX ;
10 is_lattice_updated = FALSE; // ?
11 for ( int i = THREAD_LATTICE_INDEX ; i < LATTICE_SIZE+ THREAD_LATTICE_INDEX
; i++ ) {
12 NEXT_STEP_LATTICE [i] = LATTICE[i];
13 if (( first_i <= i && i <= last_i )
14 || ( last_i >= LATICE_SIZE+ THREAD_LATICE_INDEX && ( last_i % (
LATICE_SIZE+ THREAD_LATICE_INDEX ) >= i || i >= first_i ) )
15 ) {
16 int left = MOD((i-1), LATICE_SIZE) + THREAD_LATICE_INDEX ;
17 int right = MOD((i+1), LATICE_SIZE) + THREAD_LATICE_INDEX ;
18 // If neighbours are the same
19 if ( LATTICE[left] == LATTICE[right] ) {
20 NEXT_STEP_LATTICE [i] = LATTICE[left ];
21 }
22 // ... otherwise randomly flip the spin
23 else if ( W0 > RANDOM (& state[BLOCK_ID ])) {
24 NEXT_STEP_LATTICE [i] = FLIP_SPIN(LATTICE[i]);
25 }
26 lattice_update_counter ++;
27 }
28 if ( LATTICE[i] != NEXT_STEP_LATTICE [i] ) {
29 is_lattice_updated = TRUE;
30 }
31 }
32 monte_carlo_steps =(int)( lattice_update_counter /LATTICE_SIZE);
33 SWAP(LATTICE , NEXT_STEP_LATTICE , SWAP);
34 }
17
18.
7.2. Running the simulation
In order for the CUDA compiler and then GPU to execute the code correctly,
programmer has to follow some conventions for the program structure. For instance:
creating functions to be executed on GPU has to be preﬁxed with global or
device keyword. Moreover, call to a GPU function has to be done with <<<gridDim,
blockDim>>>. The framework for executing code on GPU is shown in Listing 5.
Listing 5. Exemplary foundation of GPU-executed code
1 // Imports
2 // Helper definitions etc.
3 global void generate kernel(
4 curandStateMtgp32 *state ,
5 short * LATTICE ,
6 short * NEXT_STEP_LATTICE ,
7 int * DEV_MCS_NEEDED ,
8 float * DEV_BOND_DENSITY
9 ) {
10 // Code to be executed by GPU
11 while ( monte_carlo_steps < MAX_MCS ) {
12 if ( is_lattice_updated == FALSE && BOND_DENSITY(LATTICE)==0.0 ) {
13 // stop when lattice is in ferromagnetic state
14 break;
15 }
16 // Rest of the simulation code
17 }
18 }
19 // ...
20 int main(int argc , char *argv []) {
21 // Initializations ...
22 generate kernel<<<gridDim blockDim>>>
23 devMTGPStates ,
24 DEV_LATTICES ,
25 DEV_NEXT_STEP_LATTICES ,
26 DEV_MCS_NEEDED ,
27 DEV_BOND_DENSITY
28 );
29 // Obtaining results
30 // Cleanup
31 }
7.3. Solution space
The important diÄerence over CPU version is the use of (X, Y, Z) which denote
position of the thread in logical structure provided by CUDA architecture. Threads can
be organized inside 3D structures called blocks and indexed using ”Cartesian” combina-
tion of {x, y, z}. Later, they could be referenced inside kernel with blockIdx.{x,y,z}.
Grid is also a 3D structure and similarly to blocks can be referenced inside kernel
18
19.
with gridIdx.{x,y,z}. This structuring is provided for programmer convenience and
is related to GPUs being devices meant for 2D and 3D graphics processing, where
such ”Cartesian” decomposition is quite natural. Although, blocks and grids are logical
structures, they are associated with physical properties of GPUs. This fact can (and
should, whenever possible) be used for problem decomposition in order to optimize
runtime performance.
Here, the (X, Y, Z) correspond to (MIU, SIGMA, W0), that are distributed with
(blockIdx.x, blockIdx.y, threadIdx.x). This was done in order to keep relatively
small number of threads in the block (see subsection 7.4). By this convention each
thread can calculate it’s own set of values of (MIU, SIGMA, W0). Listing 6 shows how
a thread can map it’s coordinates into initial parameters of simulation. For instance,
threads with it’s blockIdx == (100,100,0) will be executing simulations for MIU=1.0
and SIGMA=0.5 if MIU SIZE=100 and SIGMA SIZE=200.
Listing 6. Simulation parameters computation for each thread
1 #define MIU_START 0.0
2 #define MIU_END 1.0
3 #define MIU_SIZE 10
4 #define SIGMA_START 0.0
5 #define SIGMA_END 1.0
6 #define SIGMA_SIZE 10
7 // ...
8 #define X blockIdx.x
9 #define Y blockIdx.y
10 #define Z threadIdx.x
11 #define MAX_X MIU_SIZE
12 #define MAX_Y SIGMA_SIZE
13 // ...
14 __global__ void generate_kernel (
15 curandStateMtgp32 *state ,
16 short * LATTICE ,
17 short * NEXT_STEP_LATTICE ,
18 int * DEV_MCS_NEEDED ,
19 float * DEV_BOND_DENSITY
20 ) {
21 // ...
22 float C = TRIANGLE_DISTRIBUTION (X / MAX_X , Y / MAX_Y);
23 // ...
24 }
25 int main(int argc , char *argv []) {
26 dim3 blockDim(W0_SIZE ,1 ,1);
27 dim3 gridDim(MIU_SIZE ,SIGMA_SIZE ,1);
28 // ...
29 generate_kernel <<<gridDim , blockDim >>>(
30 // ...
31 )
32 // ...
33 }
19
20.
7.4. Random Number Generators
Important part of every Monte Carlo simulation is randomness. In order to keep the
simulation converge to the actual result, the quality of the Random Number Generator
(RNG) must be high. The de facto standard for scientiﬁc MC simulations is Mersenne
Twister10
[13]. There is a version of MT19937 optimized for GPGPU usage11
that was
included into CUDA as a cuRAND library12
. There are, however, some limitations of
built-in MT19937:
• 1 MTGP state per block
• Up to 256 threads per state
• Up to 200 states using included, pre-generated sequences
MT is called with curand uniform(state) and returns ﬂoating point number in
range (0, 1]. The values are uniformly distributed in range. To transform this sequence
of uniformly distributed numbers a special function (function-like macro) can be used
(Listing 7).
Listing 7. Transformation of uniform- into triangle distribution
1 #define TRIANGLE_DISTRIBUTION (miu , sigma) ({
2 float start = max(miu -sigma , 0.0);
3 float end = min(miu+sigma , 1.0);
4 float rand = (
5 curand_uniform (& state[BLOCK_ID ])
6 + curand_uniform (& state[BLOCK_ID ])
7 ) / 2.0;
8 ((end -start) * rand) + start;
9 })
7.5. Thread per simulation - static memory
In the algorithm presented in Listing 4 the memory usage is not optimized at all. It
is not only allocated in the global memory space, but also each time the program is run,
the host’s memory has to be allocated copied into device. Listing 8 shows the ineÖcient
memory allocations that occur in thread-per-simulation algorithm from subsection 7.1.
Listing 8. Dynamic allocation of memory
1 __global__ void generate_kernel (
2 curandStateMtgp32 *state ,
3 short * LATTICE,
10
MT19937, MT
11
MTGP, http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MTGP/index.html
12
http://docs.nvidia.com/cuda/curand/device-api-overview.html
20
21.
4 short * NEXT STEP LATTICE,
5 int * DEV_MCS_NEEDED ,
6 float * DEV_BOND_DENSITY
7 )
8 // ...
9 short * DEV_LATTICES;
10 short * DEV_NEXT_STEP_LATTICES ;
11 CUDA_CALL(cudaMalloc(
12 &DEV_LATTICES ,
13 THREADS_NEEDED * sizeof(short) * LATTICE_SIZE
14 ));
15 CUDA_CALL(cudaMalloc(
16 &DEV_NEXT_STEP_LATTICES ,
17 THREADS_NEEDED * sizeof(short) * LATTICE_SIZE
18 ));
19 // ...
20 generate_kernel <<<grid_size , block_size >>>(
21 devMTGPStates ,
22 DEV_LATTICES ,
23 DEV_NEXT_STEP_LATTICES ,
24 DEV_MCS_NEEDED ,
25 DEV_BOND_DENSITY
26 );
If the memory is allocated inside kernel code the need for time-consuming copying
between host and device disappears. It is possible to statically allocate memory in the
device code (Listing 9).
Listing 9. Static memory allocation inside kernel
1 __global__ void generate_kernel (
2 curandStateMtgp32 *state ,
3 int * DEV_MCS_NEEDED ,
4 float * DEV_BOND_DENSITY
5 ) {
6 short LATTICE_1[LATTICE_SIZE ];
7 short LATTICE_2[LATTICE_SIZE ];
8 short * LATTICE = LATTICE_1;
9 short * NEXT_STEP_LATTICE = LATTICE_2;
10 // ...
11 }
7.6. Comparison of static and dynamic memory use
Although quite simple, the following optimization does in fact improve the perfor-
mance of the simulations. The results of the static vs dynamic memory allocation are
illustrated in Figure 7.1.
All of the empirical tests of GPU code were done on GeForce GTX 570 GPU with
Intel i7 CPU.
21
22.
Figure 7.1. Speedup of static and dynamic (memorypool) memory allocations with MAX MTS
equal to 1 000 and 10 000. Markers denote arithmetic mean of the 5 averages
conducted. The curves ﬁtted are 4-th degree polynomials.
Figure 7.2 shows the results conducted for range of 1 up to 6 000 concurrent
simulations. Static memory approach is faster than dynamic memory in every trial
conducted. Moreover, as seen in Figure 7.2, static memory tends to maintain speedup
rather than lose it’s ”velocity” as it is in the case of dynamic memory approach (compare
ﬁtted curves above 40 000 concurrent simulations).
22
23.
Figure 7.2. Speedup of static and dynamic (memorypool) memory allocations with MAX MTS
equal to 1 000 and 10 000. Markers denote arithmetic mean of the 5 averages
conducted. The curves ﬁtted are 4-th degree polynomials.
23
24.
8. GPU Simulations - thread per spin
8.1. Thread per spin approach
CUDA C Best Practices Guide13
encourages the use of multiple threads for optimal
utilization of GPU cores. In this spirit, one can apply the approach where each spin
could be represented by single thread and each simulation takes up an entire block.
This idea is presented in Listing 10.
Listing 10. Thread per spin algorithm
1 while ( monte_carlo_steps < MAX_MCS) {
2 syncthreads();
3 if (threadIdx.x == 0) {
4 SWAP(LATTICE , NEXT_STEP_LATTICE , SWAP);
5 if ( BOND_DENSITY(LATTICE) == 0.0 ) {
6 // If lattice is ferromagnetic , simulation can stop
7 monte_carlo_steps = MAX_MCS;
8 break;
9 }
10 float C = TRIANGLE_DISTRIBUTION ((X / (float)MAX_X), (Y / (float)MAX_Y
));
11 first_i = (int)(LATTICE_SIZE * curand_uniform (& state[BLOCK_ID ]));
12 last_i = (int)(first_i + (C * LATTICE_SIZE));
13 monte_carlo_steps = (int)( lattice_update_counter / LATTICE_SIZE);
14 }
15 syncthreads();
16 NEXT_STEP_LATTICE [threadIdx.x] = LATTICE[threadIdx.x];
17 if (( first_i <= threadIdx.x && threadIdx.x <= last_i )
18 || ( last_i >= LATICE_SIZE && ( last_i % LATICE_SIZE >= threadIdx.x ||
threadIdx.x >= first_i ) )
19 ) {
20 short left = MOD(( threadIdx.x-1), LATICE_SIZE);
21 short right = MOD(( threadIdx.x+1), LATICE_SIZE);
22 // Neighbours are the same
23 if ( LATTICE[left] == LATTICE[right] ) {
24 NEXT_STEP_LATTICE [threadIdx.x] = LATTICE[left ];
25 }
26 // Otherwise randomly flip the spin
27 else if ( W0 > curand_uniform (& state[BLOCK_ID ])) {
28 NEXT_STEP_LATTICE [threadIdx.x] = FLIP_SPIN(LATTICE[threadIdx.x]);
29 }
30 atomicAdd(&lattice update counter,1);
31 }
32 }
13
http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/
24
25.
8.2. Concurrent execution
In the approach presented in Listing 10 some new features of CUDA are shown.
Namely, syncthreads(); which can be used to synchronize execution of threads. It
ensures that all threads in block will be executing the same instruction after passing the
syncthreads(); call. Launch of exactly LATTICE SIZE*MIU SIZE*SIGMA SIZE*W0 SIZE
threads is initialized. Each block is exactly LATTICE SIZE long (Listing 11).
Listing 11. Grid and block sizes
1 dim3 blockDim(LATTICE_SIZE ,1 ,1);
2 dim3 gridDim(MIU_SIZE ,SIGMA_SIZE ,W0_SIZE);
Each thread in block is running a single, block-wide simulation instance and execute
the same code. This introduces the problem - every thread will execute initialization
code like setting up W0, C, MIU etc. Some of this values (like C) are random, therefore,
running this code multiple times will produce diÄerent results. The situation, where even
one spin of the simulation is evaluated according to diÄerent W0 value is unacceptable.
Correct initial setup can be obtained by evaluating initialization by only one thread
(Listing 12).
Listing 12. Shared memory deﬁnitions
1 __global__ void generate_kernel (
2 curandStateMtgp32 *state ,
3 int * DEV_MCS_NEEDED ,
4 float * DEV_BOND_DENSITY
5 ) {
6 // ...
7 if (threadIdx.x == 0) {
8 LATTICE = LATTICE_1;
9 NEXT_STEP_LATTICE = LATTICE_2;
10 SWAP = NULL;
11 lattice_update_counter =0;
12 monte_carlo_steps =0;
13 W0 = Z/( float)MAX_Z;
14 }
15 __syncthreads ();
16 // ...
17 while ( monte_carlo_steps < MAX_MCS) {
18 // ...
19 }
20 }
Concurrent execution by multiple threads makes initialization of LATTICE easier and
faster. All of the threads are updating their own values. Block’s threads are accessing
memory in bulk and without conﬂicts which could be a potential source of speedup
(Listing 13).
Listing 13. LATTICE initialization
25
26.
1 __global__ void generate_kernel (
2 curandStateMtgp32 *state ,
3 int * DEV_MCS_NEEDED ,
4 float * DEV_BOND_DENSITY
5 ) {
6 // ...
7
8 if (threadIdx.x == 0) {
9 // Initialization
10 }
11 __syncthreads ();
12 // Initialize as antiferromagnetic
13 NEXT STEP LATTICE[threadIdx.x] = threadIdx.x&1;
14 while ( monte_carlo_steps < MAX_MCS) {
15 // ...
16 }
17 }
8.3. Thread communication
To ensure thread cooperation inside simulation, block-level communication is needed.
It can be obtained by means of shared memory. Shared memory is a type of memory
residing on-chip. It is about 100x faster14
than uncached global memory. Shared memory
is accessible to every thread in block.
Listing 14 ilustrates the deﬁnition of shared resources inside kernel. CUDA
compiler automatically allocates the on-chip memory for shared variables only once
(though the kernel is executed by every thread). All of the threads in block access the
same place in on-chip memory while accessing shared data.
Listing 14. Shared memory deﬁnitions
1 __global__ void generate_kernel (
2 curandStateMtgp32 *state ,
3 int * DEV_MCS_NEEDED ,
4 float * DEV_BOND_DENSITY
5 ) {
6 shared unsigned short LATTICE 1[LATTICE SIZE];
7 shared unsigned short LATTICE 2[LATTICE SIZE];
8 shared unsigned short first i, last i;
9 shared unsigned long long int lattice update counter;
10 shared unsigned long monte carlo steps;
11 shared float W0;
12
13 shared unsigned short * LATTICE;
14 shared unsigned short * NEXT STEP LATTICE;
15 shared unsigned short * SWAP;
16
14
http://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/
26
27.
17 // Initialization of LATTICE pointers , lattice_update_counter etc.
18
19 while ( monte_carlo_steps < MAX_MCS) {
20 // ...
21 }
22 }
8.4. Race conditions with shared memory
The issue of race conditions arise when multiple threads try to write to shared
memory. Writing on GPU is usually not an atomic operation. It actually consist of 3
diÄerent operations15
e.g. the incrementation of some number consist of:
1. Reading the value
2. Incrementing the value
3. Writing the new value
During the time required to perform this steps other threads can interrupt the
execution. Fortunately, CUDA does provide the programmer with set of atomic*()
functions. atomic*() ensures that any number of threads requesting read or write to
the same memory instance will be served properly.
The code presented in Listing 15 shows how to perform lattice update counter
incrementation to ensure correctness of results.
Listing 15. Atomic add
1 while ( monte_carlo_steps < MAX_MCS) {
2 // ...
3 NEXT_STEP_LATTICE [threadIdx.x] = LATTICE[threadIdx.x];
4 if (( first_i <= threadIdx.x && threadIdx.x <= last_i )
5 || ( last_i >= LATICE_SIZE && ( last_i % LATICE_SIZE >= threadIdx.x ||
threadIdx.x >= first_i ) )
6 ) {
7 // ...
8 atomicAdd(&lattice update counter,1);
9 }
10 }
8.5. Thread per spin approach - reduction
Reduction is the process of decreasing the number of elements. This ”deﬁnition”,
although vague, means that having multiple elements of some sort, we apply some
process to reduce the number of input elements.
15
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions
27
28.
Code presented in Listing 16, computes the bond density of LATTICE in each iteration.
Moreover, it is done sequentially by only one thread which could be ineÖcient.
Listing 16. Unoptimized iteration initialization
1 while ( monte_carlo_steps < MAX_MCS) {
2 __syncthreads ();
3 if (threadIdx.x == 0) {
4 SWAP(LATTICE , NEXT_STEP_LATTICE , SWAP);
5 if ( BOND DENSITY(LATTICE) == 0.0 ) {
6 // If lattice is in ferromagnetic state , simulation can stop
7 monte_carlo_steps = MAX_MCS;
8 break;
9 }
10 float C = TRIANGLE_DISTRIBUTION ((X / (float)MAX_X), (Y / (float)MAX_Y
));
11 first_i = (int)(LATTICE_SIZE * curand_uniform (& state[BLOCK_ID ]));
12 last_i = (int)(first_i + (C * LATTICE_SIZE));
13 monte_carlo_steps = (int)( lattice_update_counter / LATTICE_SIZE);
14 }
15 __syncthreads ();
16 // ...
17 }
Let’s recall that bond density (ﬂ = 0) iÄ16
all spins are either in SPIN UP or
SPIN DOWN position. From this simple observation we can conclude that the sum of
all LATTICE elements (represented in code by means of 0 and 1) would be equal to
0 (if all elements are zeroes) or LATTICE SIZE (all elements being ones) if LATTICE is
in the ferromagnetic state. Moreover, we can make use of multiple GPU threads and
use existing NEXT STEP LATTICE as it is not needed between iterations. In algorithm
presented in Listing 17 the sum of the LATTICE elements is calculated in log2(L) steps.
In case of L = 64 it would take 6 iterations, after which the summation results would
be stored in NEXT STEP LATTICE[0].
Listing 17. Parallel reduction
1 for (int i = LATTICE_SIZE /2; i != 0; i /= 2) {
2 if (threadIdx.x < i) {
3 NEXT_STEP_LATTICE [threadIdx.x] += NEXT_STEP_LATTICE [( threadIdx.x+i)];
4 }
5 __syncthreads ();
6 }
The approach from Listing 17 could even be extended to the process of calculating
the actual BOND DENSITY(LATTICE). This method (again, using auxiliary array of
NEXT STEP LATTICE) is presented in Listing 18.
Listing 18. Parallel reduction to calculate BOND DENSITY(LATTICE)
1 __syncthreads ();
16
iÄ - if and only if
28
29.
2 NEXT_STEP_LATTICE [threadIdx.x] = 2*abs(LATTICE[threadIdx.x]-LATTICE [(
threadIdx.x+1) % LATTICE_SIZE ]);
3 __syncthreads ();
4 for (int i = LATTICE_SIZE /2; i > 0; i /= 2) {
5 if (threadIdx.x < i) {
6 // Use NEXT_STEP_LATTICE as cache array
7 NEXT_STEP_LATTICE [threadIdx.x] += NEXT_STEP_LATTICE [threadIdx.x+i];
8 }
9 __syncthreads ();
10 }
8.6. Thread per spin approach - ﬂags
The possible way of optimizing the performance would be to avoid calculating
bond density during each execution of while loop altogether. If during update itera-
tion, none of the spins were updated, then LATTICE was simply shallow copied into
NEXT STEP LATTICE. One can suspect that this behavior could be caused by lattice
being in stationary state. If the stationary state in question is one of ferromagnetic
states the simulation can be stopped.
Listing 19 introduces a new variable: lattice update counter iter. This variable
will hold the information about how many spins were actually changed during simulation
iteration. If the change did occur, then the BOND DENSITY(LATTICE) will not be executed
at all. The if statement’s lattice update counter iter != 0 will be evaluated before
and if it’s condition is not satisﬁed therefore, (by means of C being lazy-evaluated
language) the part after && will not be reached. If, however, the change did not
occur (lattice update counter iter != 0) and lattice is in ferromagnetic state (BOND
DENSITY(LATTICE) == 0.0) the simulation should stop. Unfortunately, break; will
apply only to threadIdx.x == 0. In order to have other threads stop their work, we
could use check already performed by each thread before starting the actual work, that
is: set the monte carlo steps = MAX MCS. In this way we prevent other threads from
execution (and potentially interfering with the results).
Listing 19. Thread per spin with ﬂags
1 __global__ void generate_kernel (
2 curandStateMtgp32 *state ,
3 int * DEV_MCS_NEEDED ,
4 float * DEV_BOND_DENSITY
5 ) {
6 // Shared memory definitions
7 shared lattice update counter iter;
8 // Simulation initialization
9 if (threadIdx.x == 0) {
10 // ...
11 lattice update counter=0;
12 }
29
30.
13 __syncthreads ();
14 // Initialize as antiferromagnetic
15 NEXT_STEP_LATTICE [threadIdx.x] = threadIdx.x&1;
16 while ( monte_carlo_steps < MAX_MCS) {
17 __syncthreads ();
18 if (threadIdx.x == 0) {
19 // Iteration initialization
20 if ( lattice update counter iter == 0
21 && BOND DENSITY(LATTICE) == 0.0 ) {
22 // If ferromagnetic , simulation can stop
23 monte_carlo_steps = MAX_MCS;
24 break;
25 }
26 lattice update counter iter=0;
27 }
28 __syncthreads ();
29 NEXT_STEP_LATTICE [threadIdx.x] = LATTICE[threadIdx.x];
30 if (( first_i <= threadIdx.x && threadIdx.x <= last_i )
31 || ( last_i >= LATTICE_SIZE && ( last_i % LATTICE_SIZE >= threadIdx.
x || threadIdx.x >= first_i ) )
32 ) {
33 // Iteration update
34 }
35 if( NEXT_STEP_LATTICE [threadIdx.x]!= LATTICE[threadIdx.x]){
36 atomicAdd(&lattice update counter iter,1);
37 }
38 }
39 }
8.7. Thread-per-spin performance
As seen in Figure 8.1, each improvement over basic thread-per-spin method introduces
some kind of speedup. Noteworthy is the performance gap especially between use
of reduction and ﬂags. Apparently, using ﬂag to avoid per-iteration calculations of
BOND DENSITY(LATTICE) is signiﬁcantly faster then usage of version equipped with
highly optimized BOND DENSITY(LATTICE) algorithm.
30
31.
Figure 8.1. Speedup of static and dynamic (memorypool) memory allocations with MAX MTS
equal to 1 000 and 10 000. Markers denote arithmetic mean of the 5 averages
conducted. The curves ﬁtted are 4-th degree polynomials.
8.8. Thread-per-spin vs thread-per-simulation performance
Tests presented in Figure 8.2 and Figure 8.3 show the comparison between suggested
approaches. Unoptimized thread-per-spin approach turns out to be faster then thread-
per-simulation in every test under 20 000 concurrent simulations. Threads on the GPU
do not have as powerful processor at their disposal as those run on CPU. This leads to
the conclusion that most of the tasks conducted on GPU should be split onto separate
threads to parallelize the execution even by the expense of increased communication
time. However, above 20 000 simulations threshold, overhead provided by a huge amount
of threads and RNGs instances causes thread-per-spin to be worse performing than
thread-per-simulation approaches.
31
32.
Figure 8.2. Execution times of thread-per-spin and thread-per-simulation simulations with
MAX MTS equal to 1 000. Markers denote arithmetic mean of the 5 averages
conducted. The curves ﬁtted are 4-th degree polynomials.
Figure 8.3. Execution times of thread-per-spin and thread-per-simulation simulations with
MAX MTS equal to 10 000. Markers denote arithmetic mean of the 5 averages
conducted. The curves ﬁtted are 4-th degree polynomials.
32
33.
As it is in the case of execution times, also speedups of thread-per-simulations
are greater of those using thread-per-spin approach (Figure 8.4 and Figure 8.5). For
massive amounts of concurrent threads thread-per-spin simulations perform relatively
well gaining speedups of about 8-9x. Thread-per-simulation on the other hand shows an
impressive speedup of up to 28x. In the case of low number of threads thread-per-spin
approach shows better speedup below 20 000 threshold. However, for bigger simulations
(25 000 and more) thread-per-simulations show more promising results.
Figure 8.4. Speedups of thread-per-spin and thread-per-simulation simulations with MAX MTS
equal to 1 000. Markers denote arithmetic mean of the 5 averages conducted.
The curves ﬁtted are 4-th degree polynomials.
33
34.
Figure 8.5. Speedups of thread-per-spin and thread-per-simulation simulations with MAX MTS
equal to 10 000. Markers denote arithmetic mean of the 5 averages conducted.
The curves ﬁtted are 4-th degree polynomials.
9. Bond density for some W0 values
The calculations made during this project helped in development of some insight
into how triangular distribution could aÄect the phase transition. Some exemplary bond
density (ﬂ) plots are presented in Figure 9.1 and Figure 9.2.
34
35.
Figure 9.1. Bond density after 106 MCS, W0 = 0.9, MIU = [0, 0.25, 0.5, . . . , 1] and
SIGMA = [0, 0.25, 0.5, . . . , 1]
Figure 9.2. Bond density after 106 MCS, W0 = 0.6, MIU = [0, 0.25, 0.5, . . . , 1] and
SIGMA = [0, 0.25, 0.5, . . . , 1]
35
36.
10. Conclusions
CUDA does in fact expose an easy-to-use environment for harnessing the power of
present-day GPGPUs. The realization of the project helped in speeding up complex
and time-consuming calculations that take days on high-end CPUs. It helped getting
to know CUDA compiler and it’s most useful libraries.
Another important (although, not mentioned) element of this study was the usage of
scripting languages. Technologies such as Python17
enable: easy work distribution across
GPGPU workstations, harvesting the results, processing the data18
and plotting19
the
results for easy pattern recognition and presentation. Unfortunately, GPU architecture
requires programmer to really know the underlying hardware and various programming
techniques if he or she wants to obtain an optimal performance.
11. Future work
In the future, the developed CUDA program could be used to drive fully featured
study of the physical phenomena described in section 4. In order to do that, more
detailed data has to be gathered, including improved data resolution and higher number
of averages.
17
http://www.python.org/
18
http://www.numpy.org/
19
http://matplotlib.org/
36
37.
References
[1] C. Coulon, et al. Glauber dynamics in a single-chain magnet: From theory to real
systems Phys. Rev. B 69 (2004)
[2] L. Bogani, et al. Single chain magnets: where to from here? J. Mater Chem., 18,
(2008)
[3] H. Miyasaka, et. al. Slow Dynamics of the Magnetization in One- Dimensional
Coordination. Polymers: Single-Chain Magnets Inorg. Chem., 48, (2009)
[4] R.O. Kuzian, et. al. Ca2Y2Cu5O10: the ﬁrst frustrated quasi-1D ferromagnet close
to criticality, Phys. Rev. Letters, 109, (2012)
[5] K. Sznajd-Weron and S. Krupa. Inﬂow versus outﬂow zero-temperature dynamics
in one dimension, Phys. Rev. E 74, 031109 (2006)
[6] F. Radicchi, D. Vilone, and H. Meyer-Ortmanns. Phase Transition between Syn-
chronous and Asynchronous Updating Algorithms, J. Stat. Phys. 129, 593 (2007)
[7] B. Skorupa, K. Sznajd-Weron, and R. Topolnicki. Phase diagram for a zero-
temperature Glauber dynamics under partially synchronous updates, Phys. Rev. E
86, 051113 (2012)
[8] I. G. Yi and B. J. Kim. Phase transition in a one-dimensional Ising ferromagnet
at zero temperature using Glauber dynamics with a synchronous updating mode,
Phys. Rev. E 83, 033101 (2011)
[9] M. Evans, N. Hastings, B. Peacock. Statistical Distributions, 3rd ed. New York:
Wiley, pp. 187-188, (2000)
[10] E. Ising. Beitrag zur Theorie des Ferromagnetismus, Z. Phys. 31: 253-258, (1925)
[11] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller. Equations
of State Calculations by Fast Computing Machines, Journal of Chemical Physics
21 (6): 108-1092, (1953)
[12] W. Lenz, ”Beitrage zum Verstandnis der magnetischen Eigenschaften in festen
Korpern”, Physikalische Zeitschrift 21: 613-615, (1920)
[13] M. Matsumoto and T. Nishimura, ”Mersenne Twister: A 623-dimensionally equidis-
tributed uniform pseudorandom number generator”, ACM Trans. on Modeling and
Computer Simulation Vol. 8, No. 1, January pp.3-30 (1998)
37
Be the first to comment