POLITECHNIKA WROC£AWSKA
WYDZIA£ INFORMATYKI I ZARZ•DZANIA
GPGPU driven simulations of zero-temperature 1D
Ising model with...
Acknowledgments:
Dr inø. Dariusz Konieczny
Contents
1. Motivation 5
2. Target 5
3. Scope of work 5
4. Theoretical background and proposed model 6
4.1. Ising model . ...
8.3. Thread communication . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8.4. Race conditions with shared memory...
1. Motivation
In the presence of recent developments of SCM (Single Chain Magnets) [1–4] the
issue of criticality in 1D Is...
4. Theoretical background and proposed model
4.1. Ising model
Although initially proposed by Wilhelm Lenz, it was Ernst Is...
As stated in [8] phase transitions in synchronous updating modes and c-sequential
[7] ought to be rather continuous (in ca...
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
2.0
Figure 4.2. c could be any value in the interval [0; 1] but is most likely to ...
5. if i ˛ max
• i = i + 1
• Go to step 4
6. Stop
9
5. General Purpose Graphic Processing Units
5.1. History of General Purpose GPUs
Traditionally, in desktop computer GPU is...
per unit time (e.g. FLOPS6
) but also in terms of power and cost eÖciency.
5.2. CUDA Architecture
The underlying architect...
Figure 5.1. Grid of thread blocks
http://docs.nvidia.com/cuda/cuda-c-programming-guide/
CUDA also provides rich memory hie...
Figure 5.2. CUDA memory hierarchy
http://docs.nvidia.com/cuda/cuda-c-programming-guide/
13
6. CPU Simulations
6.1. Sequential algorithm
The baseline for presented algorithms will be the sequential, CPU based code....
== -1, whereas: MOD((-1), LATTICE SIZE) == LATTICE SIZE-1. Therefore, while ac-
cessing current spin’s neighbours MOD(x,N)...
Figure 6.1. Execution times of CPU simulations with MAX MTS equal to 1 000 and 10 000.
Markers denote arithmetic mean of t...
7. GPU Simulations - thread per simulation
7.1. Thread per simulation
CUDA provides C/C++-like language for executing code...
7.2. Running the simulation
In order for the CUDA compiler and then GPU to execute the code correctly,
programmer has to f...
with gridIdx.{x,y,z}. This structuring is provided for programmer convenience and
is related to GPUs being devices meant f...
7.4. Random Number Generators
Important part of every Monte Carlo simulation is randomness. In order to keep the
simulatio...
4 short * NEXT STEP LATTICE,
5 int * DEV_MCS_NEEDED ,
6 float * DEV_BOND_DENSITY
7 )
8 // ...
9 short * DEV_LATTICES;
10 s...
Figure 7.1. Speedup of static and dynamic (memorypool) memory allocations with MAX MTS
equal to 1 000 and 10 000. Markers ...
Figure 7.2. Speedup of static and dynamic (memorypool) memory allocations with MAX MTS
equal to 1 000 and 10 000. Markers ...
8. GPU Simulations - thread per spin
8.1. Thread per spin approach
CUDA C Best Practices Guide13
encourages the use of mul...
8.2. Concurrent execution
In the approach presented in Listing 10 some new features of CUDA are shown.
Namely, syncthreads...
1 __global__ void generate_kernel (
2 curandStateMtgp32 *state ,
3 int * DEV_MCS_NEEDED ,
4 float * DEV_BOND_DENSITY
5 ) {...
17 // Initialization of LATTICE pointers , lattice_update_counter etc.
18
19 while ( monte_carlo_steps < MAX_MCS) {
20 // ...
Code presented in Listing 16, computes the bond density of LATTICE in each iteration.
Moreover, it is done sequentially by...
2 NEXT_STEP_LATTICE [threadIdx.x] = 2*abs(LATTICE[threadIdx.x]-LATTICE [(
threadIdx.x+1) % LATTICE_SIZE ]);
3 __syncthread...
13 __syncthreads ();
14 // Initialize as antiferromagnetic
15 NEXT_STEP_LATTICE [threadIdx.x] = threadIdx.x&1;
16 while ( ...
Figure 8.1. Speedup of static and dynamic (memorypool) memory allocations with MAX MTS
equal to 1 000 and 10 000. Markers ...
Figure 8.2. Execution times of thread-per-spin and thread-per-simulation simulations with
MAX MTS equal to 1 000. Markers ...
As it is in the case of execution times, also speedups of thread-per-simulations
are greater of those using thread-per-spi...
Figure 8.5. Speedups of thread-per-spin and thread-per-simulation simulations with MAX MTS
equal to 10 000. Markers denote...
Figure 9.1. Bond density after 106 MCS, W0 = 0.9, MIU = [0, 0.25, 0.5, . . . , 1] and
SIGMA = [0, 0.25, 0.5, . . . , 1]
Fi...
10. Conclusions
CUDA does in fact expose an easy-to-use environment for harnessing the power of
present-day GPGPUs. The re...
References
[1] C. Coulon, et al. Glauber dynamics in a single-chain magnet: From theory to real
systems Phys. Rev. B 69 (2...
Upcoming SlideShare
Loading in...5
×

Final Thesis

170

Published on

GPGPU driven simulations of zero-temperature 1D Ising model with Glauber dynamics

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
170
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Final Thesis

  1. 1. POLITECHNIKA WROC£AWSKA WYDZIA£ INFORMATYKI I ZARZ•DZANIA GPGPU driven simulations of zero-temperature 1D Ising model with Glauber dynamics Daniel Kosalla FINAL THESIS under supervision of Dr inø. Dariusz Konieczny Wroc≥aw 2013
  2. 2. Acknowledgments: Dr inø. Dariusz Konieczny
  3. 3. Contents 1. Motivation 5 2. Target 5 3. Scope of work 5 4. Theoretical background and proposed model 6 4.1. Ising model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.2. Historic methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.3. Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.4. Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.5. Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.6. Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.7. Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 5. General Purpose Graphic Processing Units 10 5.1. History of General Purpose GPUs . . . . . . . . . . . . . . . . . . . . . . 10 5.2. CUDA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 6. CPU Simulations 14 6.1. Sequential algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 6.2. Random number generation on CPU . . . . . . . . . . . . . . . . . . . . 15 6.3. CPU performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 7. GPU Simulations - thread per simulation 17 7.1. Thread per simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 7.2. Running the simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 7.3. Solution space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 7.4. Random Number Generators . . . . . . . . . . . . . . . . . . . . . . . . . 20 7.5. Thread per simulation - static memory . . . . . . . . . . . . . . . . . . . 20 7.6. Comparison of static and dynamic memory use . . . . . . . . . . . . . . . 21 8. GPU Simulations - thread per spin 24 8.1. Thread per spin approach . . . . . . . . . . . . . . . . . . . . . . . . . . 24 8.2. Concurrent execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 iii
  4. 4. 8.3. Thread communication . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 8.4. Race conditions with shared memory . . . . . . . . . . . . . . . . . . . . 27 8.5. Thread per spin approach - reduction . . . . . . . . . . . . . . . . . . . . 27 8.6. Thread per spin approach - flags . . . . . . . . . . . . . . . . . . . . . . 29 8.7. Thread-per-spin performance . . . . . . . . . . . . . . . . . . . . . . . . 30 8.8. Thread-per-spin vs thread-per-simulation performance . . . . . . . . . . . 31 9. Bond density for some W0 values 34 10.Conclusions 36 11.Future work 36 Appendix 38 A. Sequential algorithm - CPU 39 B. Thread per simulation - no optimizations 43 C. Thread per simulation - static memory 48 D. Thread per spin - no optimizations 53 E. Thread per spin - parallel reduction 58 F. Thread per spin - update flag 63 iv
  5. 5. 1. Motivation In the presence of recent developments of SCM (Single Chain Magnets) [1–4] the issue of criticality in 1D Ising-like magnet chains has turned out to be an promising field of study [5–8]. Some practical applications has been already suggested [2]. Unfortunately, the details of general mechanism driving this changes in real world is yet to be discovered. Traditionaly, Monte Carlo Simulations regarding Ising model were conducted on CPUs1 . However, in the presence of powerful GPGPU’s2 new trend in scientific computations was started enabling more detailed and even faster calculations. 2. Target The following document describes developed GPGPU applications capable of pro- ducing insights into underlying physical problem, examination of diÄerent approaches of conducting Monte Carlo simulations on GPGPU and comparison between developed parallel GPGPU algorithms and sequential CPU-based approach. 3. Scope of work The scope of this document includes development of 5 parallel GPGPU algorithms, namely: • Thread-per-simulation algorithm • Thread-per-simulation algorithm with static memory • Thread-per-spin algorithm • Thread-per-spin algorithm with flags • Thread-per-spin algorithm with reduction 1 CPU - Central Processing Unit 2 GPGPU - General Purpose Graphics Processing Unit 5
  6. 6. 4. Theoretical background and proposed model 4.1. Ising model Although initially proposed by Wilhelm Lenz, it was Ernst Ising[10], who developed a mathematical model for ferromagnetic phenomena. Ising model is usually represented by means of lattice of spins - discrete variables {≠1, 1}, representing magnetic dipole moments of molecules in the material. The spins are then interacting with it’s neighbours, which may cause the phase transition of the whole lattice. 4.2. Historic methods Monte Carlo Simulation (MC) on Ising model consist of a sequence of lattice updates. Traditionally all (synchronous) or single (sequential) spins are updated in each iteration producing the lattice-state for future iterations. The update methods are based on the so called dynamics that are describing spin interactions. 4.3. Updating The idea of partially synchronous updating scheme has been suggested [5–7]. This c-synchronous mode has a fixed parameter of spins being updated in one step-time. However, one can imagine, that the number of updated spins/molecules (often referred to as cL, where: L denotes size of the chain and c œ (0, 1]) is changing as the simulation progresses. If so, then it is either linked to some characteristics of the system or may be expressed with some probability distribution (described in subsection 4.5). This approach of changing c parameter can be applied while choosing spins randomly as well as in cluster (subsection 4.6) but only the later will be considered in this document. 4.4. Simulations In the proposed model cL sequential updating is used with c due to provided distribution. The considered environment consist of one dimensional array of L spins si = ±1. Index of each spin is denoted by i = 1, 2, . . . , L. Periodic boundary conditions are assumed, i.e. sL+1 = s1. It has been shown in [8] that the system under synchronous Glauber dynamics reaches one of two absorbing states - ferromagnetic or antiferromagnetic. Therefore, let’s introduce density of bonds (fl) as an order parameter: fl = Lq i=1 (1 ≠ sisi+1) 2L (4.1) 6
  7. 7. As stated in [8] phase transitions in synchronous updating modes and c-sequential [7] ought to be rather continuous (in cases diÄerent then c = 1 for the later). Smooth phase transition can be observed in the Figure 4.1. Figure 4.1. The average density of active bonds in the stationary state < flst > as a function of W0 for c = 0.9 and several lattice sizes L. [7] B. Skorupa, K. Sznajd-Weron, and R. Topolnicki. Phase diagram for a zero- temperature Glauber dynamics under partially synchronous updates, Phys. Rev. E 86, 051113 (2012) The system is considered in low temperatures (T) and therefore T = 0 can be assumed. Metropolis algorithm can be considered as a special case of zero-temperature Glauber dynamics for 1/2 spins. Each spin is flipped (si = ≠si) with rate W(”E) per unit time. While T = 0: W(”E) = Y ____] ____[ 1 if ”E < 0, W0 if ”E = 0, 0 if ”E > 0 (4.3) In the case of T = 0, the ordering parameter W0 = [0; 1] (e.g. Glauber rate - W0 = 1/2 or Metropolis rate W0 = 1) is assumed to be constant. One can imagine that even W0 parameter can in fact be changed during simulation process but that’s out of scope of proposed model. System starts in the fully ferromagnetic state (fl = flf = 0). After each time-step changes are applied to the system and the next time-step is being evaluated. After predetermined number of time steps state of the system is investigated. If the chain has obtained antiferromagnetic state (fl = flaf = 1) or suÖciently large number of time-steps has been inconclusive then whole simulation is being shout down. 4.5. Distributions During the simulation c will not be fixed in time but rather change from [0; 1] according to triangular continuous probability distribution[9] presented in the Figure 4.2. While studying diÄerent initial conditions for simulations, distributions are to be adjusted in order to provide peak values in range {0, 1}. This is due to the fact that 7
  8. 8. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 Figure 4.2. c could be any value in the interval [0; 1] but is most likely to around value of c = 1/2. Other values possible but their probabilities are inversely proportional to their distance from c = 1/2. the value of 0.5 (as presented in the plot) would mean that in each time-step half of the spins gets to be updated. 4.6. Updating The following algorithms make use of triangular probability distribution to assign appropriate c value before each time step. After (on average) L updated spins each Monte Carlo Step (MCS) can be distinguished. 4.7. Algorithm Transformation of the above-mentioned rules into set of instructions could yield in following description or pseudocode (below): Update cL consecutive spins starting from randomly chosen one. Each change is saved to the new array rather than the old one. After each Stop updated spins are saved and new time-step can be started. 1. Assign c value with given distribution 2. Choose random value of i œ [0, L] 3. max = i + cL 4. si is i-th spin • if si = si+1 · si = si≠1 : – sÕ i = si+1 = si≠1 • otherwise: – Flip si with probability W0 8
  9. 9. 5. if i ˛ max • i = i + 1 • Go to step 4 6. Stop 9
  10. 10. 5. General Purpose Graphic Processing Units 5.1. History of General Purpose GPUs Traditionally, in desktop computer GPU is highly specialized electronic circuit designed to robustly handle 2D and 3D graphics. In 1992 Silicon Graphics, released OpenGL library. OpenGL was meant as standardised, platform-independent interface for writing 3D graphics. By the mid 1990s an increasing demand for 3D applications appeared in the customer market. It was NVIDIA who developed GeForce 256 and branded it as ”the word’s first GPU”3 . GeForce 256, although, one of many graphical accelerators was one that showed a very rapid increase in the field incorporating many features such as transform and lighting computations directly on graphics processor. The release of GPUs capable of coping with programmable pipelines attracted researchers to explore the possibility of using graphical processors outside their original use scheme. Although, early GPUs of early 2000s were programmable in a way that enabled for pixel manipulation, researchers noticed that since this manipulations could actually represent any kind of operations and pixels could virtually represent any kind of data. In the late 2006 NVIDIA revealed GeForce 8800 GTX, the first GPU built with CUDA Architecture. CUDA Architecture enables programmer to use every arithmetic logic unit4 on the GPU (as opposed to early days of GPGPU when the access to ALUs was granted only via the restricted and complicated interface of OpenGL and DirectX). The new family of GPUs started with 8800 GTX was built with IEEE compliant ALUs capable of single-precision floating-point arithmetics. Moreover, new ALUs were equipped not only with extended set of instructions that could be used in general purpose computing but also enabled for the arbitrary read and write operations to device memory. Few months after the lunch of 8800 GTX NVIDIA published a compiler that took standard C language extended with some additional keywords and transformed it into fully featured GPU code capable of general purpose processing. It is important to stress that currently used CUDA C is by far easier to use then OpenGL/DirectX. Programmers do not have to disguise their data for graphics and can use industry-standard C or even other languages like C#, Java or Python (via appropriate bindings). CUDA in now used in various fields of science raging from medical imaging, fluid dynamics to environmental science and others oÄering enormous, several-orders-of- magnitude speed ups5 . GPUs are not only faster then CPUs in terms of computed data 3 http://www.nvidia.com/page/geforce256.html 4 ALU - Arithmetic Logic Unit 5 http://www.nvidia.com/object/cuda-apps-flash-new-changed.html 10
  11. 11. per unit time (e.g. FLOPS6 ) but also in terms of power and cost eÖciency. 5.2. CUDA Architecture The underlying architecture of CUDA is driven by design decisions connected with GPU’s primary purpose, that is graphics processing. Graphics processing is usually highly parallel process. Therefore, GPU also works in parallel fashion. The important distinction can be made into logical and physical layer of GPU architecture. Programmer decomposes computational problem into atomic processes (threads) that can be executed simultaneously. Since this partition usually results in creation of hundreds, thousands or even millions if threads. For programmer convenience threads can be organized inside blocks which in turn are part of blocks. Both, blocks and grids are 3 dimensional structures. This spatial dimensions are introduced for easier problem decomposition. As mentioned before: GPU is meant for graphics processing which is usually related to processing 2D or 3D sets of data. This grouping is associated not only with logical decomposition of problems, but also with physical structure of GPU. A basic unit of execution on GPU is the warp. Warp consist of 32 threads. Each thread in warp belongs to the same block. If the block is bigger then warp size then threads are divided between several warps. The warps are executed on the executional unit called Streaming Multiprocessors (SMs). Each SM executes several warps (not necessarily from the same block). Physically, each SM consist of 8 streaming processors (SP, CUDA cores) and 32 ”basic” ALUs. 8 SPs spend 4 clock cycles executing the same processor instruction enabling 32 threads in warp to execute in parallel. Each of the threads in warp can (and usually do) have diÄerent data supplied to them forming whats known as SIMD7 architecture. 6 FLOPS - Floating Point Operations Per Second 7 Single Instruction, Multiple Data 11
  12. 12. Figure 5.1. Grid of thread blocks http://docs.nvidia.com/cuda/cuda-c-programming-guide/ CUDA also provides rich memory hierarchy available for every thread. Each of the memory spaces has it’s own characteristics. The fastest and the smallest memory memory is the per-thread local memory. Unfortunately, local, register-based memory is out of reach for CUDA programmer and is used automatically. Each thread in block can make use of shared memory. This memory can be accessed by diÄerent threads in block and is usually the main medium of inter-thread communication. The slowest memory spaces (but available to every thread) are called global, constant and texture respectively, each of them have diÄerent size and purpose but they are all persistent across kernel launches by the same application. 12
  13. 13. Figure 5.2. CUDA memory hierarchy http://docs.nvidia.com/cuda/cuda-c-programming-guide/ 13
  14. 14. 6. CPU Simulations 6.1. Sequential algorithm The baseline for presented algorithms will be the sequential, CPU based code. The simulation itself is executed by the algorithm presented in Listing 1. Listing 1. Sequential algorithm for CPU 1 while ( monte_carlo_steps < MAX_MCS) { 2 if ( is_lattice_updated == FALSE && BOND_DENSITY(LATTICE) == 0.0 ) { 3 // If lattice is in ferromagnetic state , simulation can stop 4 break; 5 } 6 float C = TRIANGLE_DISTRIBUTION (MIU , SIGMA); 7 first_i = (int)(LATTICE_SIZE * randomUniform ()); 8 last_i = (int)(first_i + (C * LATTICE_SIZE)); 9 is_lattice_updated = FALSE; // ? 10 for (int i = 0; i < LATTICE_SIZE; i++) { 11 NEXT_STEP_LATTICE [i] = LATTICE[i]; 12 if (( first_i <= i && i <= last_i ) 13 || ( last_i >= LATTICE_SIZE && ( last_i % LATTICE_SIZE >= i || i >= first_i ) ) 14 ) { 15 int left = MOD((i-1), LATTICE_SIZE); 16 int right = MOD((i+1), LATTICE_SIZE); 17 // Neighbours are the same and different than the current spin 18 if ( LATTICE[left] == LATTICE[right] ) { 19 NEXT_STEP_LATTICE [i] = LATTICE[left ]; 20 } 21 // Otherwise randomly flip the spin 22 else if ( W0 > randomUniform () ) { 23 NEXT_STEP_LATTICE [i] = FLIP_SPIN(LATTICE[i]); 24 } 25 lattice_update_counter ++; 26 } 27 if (LATTICE[i] != NEXT_STEP_LATTICE [i]) { 28 is_lattice_updated = TRUE; 29 } 30 } 31 monte_carlo_steps = (int)( lattice_update_counter / LATTICE_SIZE); 32 SWAP(LATTICE , NEXT_STEP_LATTICE , SWAP); 33 } The code runs the simulation with initial conditions of MAX MCS, LATTICE SIZE, LATTICE. LATTICE is set to be an array initialized in antiferromagnetic state (which could be represented by a sequence of consecutive ones or zeroes). To explore solution space (the combination of W0, MIU and SIGMA) we run each simulation one after another. C/C++’s operator is in fact remainder from the division and not the modulo operator in the mathematical sense. The most prominent diÄerence is that -1 % LATTICE SIZE 14
  15. 15. == -1, whereas: MOD((-1), LATTICE SIZE) == LATTICE SIZE-1. Therefore, while ac- cessing current spin’s neighbours MOD(x,N) macro is used (Listing 2). Listing 2. Modulo function-like macro 1 #define MOD(x, N) (((x < 0) ? ((x % N) + N) : x) % N) 6.2. Random number generation on CPU The CPU code uses GSL8 based Mersenne Twister9 . Usage of GSL-supplied MT is shown in Listing 3. Listing 3. GSL’s Mersenne Twister setup 1 #include <gsl/gsl_rng.h> 2 #include <gsl/gsl_randist.h> 3 // ... 4 const gsl_rng_type * T; 5 gsl_rng * r; 6 // ... 7 double randomUniform () { 8 return gsl_rng_uniform (r); 9 } 10 // ... 11 int main(int argc , char *argv []) { 12 gsl_rng_env_setup (); 13 T = gsl_rng_mt19937 ; 14 r = gsl_rng_alloc (T); 15 long seed = time (NULL) * getpid (); 16 gsl_rng_set(r, seed); 17 // simulation 18 // randomUniform () calls 19 } 6.3. CPU performance The tests of CPU were conducted on quad-core AMD Phenom(tm) II X4 945 Processor with 4GB of RAM. Simulations occupied only one core at the time. The results presented in Figure 6.1 will be used as a baseline for further comparisons (with respective MAX MTS values). 8 GSL - GNU Scientific Library, http://www.gnu.org/software/gsl/ 9 http://www.gnu.org/software/gsl/manual/html_node/Random-number-generator-algorithms. html 15
  16. 16. Figure 6.1. Execution times of CPU simulations with MAX MTS equal to 1 000 and 10 000. Markers denote arithmetic mean of the 5 averages conducted. The curves fitted are 4-th degree polynomials. 16
  17. 17. 7. GPU Simulations - thread per simulation 7.1. Thread per simulation CUDA provides C/C++-like language for executing code on GPU (CUDA C). The code is compiled and CUDA compiler via use of specific language extensions (e.g. device , host ) can distinguish the parts to be executed by CPU(host), GPU(device) or both(global). Listing 4. Thread per simulation algorithm 1 while ( monte_carlo_steps < MAX_MCS ) { 2 if( is_lattice_updated == FALSE && BOND_DENSITY(LATTICE)==0.0) { 3 // stop when lattice is in ferromagnetic state 4 break; 5 } 6 float C = TRIANGLE_DISTRIBUTION ((X / (float)MAX_X), (Y / (float)MAX_Y)); 7 float W0 = Z / (float)MAX_Z; 8 first_i = (int)(LATTICE_SIZE * RANDOM (& state[BLOCK_ID ])) + THREAD_LATTICE_INDEX ; 9 last_i = (int)(first_i + (C * LATTICE_SIZE)) + THREAD_LATTICE_INDEX ; 10 is_lattice_updated = FALSE; // ? 11 for ( int i = THREAD_LATTICE_INDEX ; i < LATTICE_SIZE+ THREAD_LATTICE_INDEX ; i++ ) { 12 NEXT_STEP_LATTICE [i] = LATTICE[i]; 13 if (( first_i <= i && i <= last_i ) 14 || ( last_i >= LATICE_SIZE+ THREAD_LATICE_INDEX && ( last_i % ( LATICE_SIZE+ THREAD_LATICE_INDEX ) >= i || i >= first_i ) ) 15 ) { 16 int left = MOD((i-1), LATICE_SIZE) + THREAD_LATICE_INDEX ; 17 int right = MOD((i+1), LATICE_SIZE) + THREAD_LATICE_INDEX ; 18 // If neighbours are the same 19 if ( LATTICE[left] == LATTICE[right] ) { 20 NEXT_STEP_LATTICE [i] = LATTICE[left ]; 21 } 22 // ... otherwise randomly flip the spin 23 else if ( W0 > RANDOM (& state[BLOCK_ID ])) { 24 NEXT_STEP_LATTICE [i] = FLIP_SPIN(LATTICE[i]); 25 } 26 lattice_update_counter ++; 27 } 28 if ( LATTICE[i] != NEXT_STEP_LATTICE [i] ) { 29 is_lattice_updated = TRUE; 30 } 31 } 32 monte_carlo_steps =(int)( lattice_update_counter /LATTICE_SIZE); 33 SWAP(LATTICE , NEXT_STEP_LATTICE , SWAP); 34 } 17
  18. 18. 7.2. Running the simulation In order for the CUDA compiler and then GPU to execute the code correctly, programmer has to follow some conventions for the program structure. For instance: creating functions to be executed on GPU has to be prefixed with global or device keyword. Moreover, call to a GPU function has to be done with <<<gridDim, blockDim>>>. The framework for executing code on GPU is shown in Listing 5. Listing 5. Exemplary foundation of GPU-executed code 1 // Imports 2 // Helper definitions etc. 3 global void generate kernel( 4 curandStateMtgp32 *state , 5 short * LATTICE , 6 short * NEXT_STEP_LATTICE , 7 int * DEV_MCS_NEEDED , 8 float * DEV_BOND_DENSITY 9 ) { 10 // Code to be executed by GPU 11 while ( monte_carlo_steps < MAX_MCS ) { 12 if ( is_lattice_updated == FALSE && BOND_DENSITY(LATTICE)==0.0 ) { 13 // stop when lattice is in ferromagnetic state 14 break; 15 } 16 // Rest of the simulation code 17 } 18 } 19 // ... 20 int main(int argc , char *argv []) { 21 // Initializations ... 22 generate kernel<<<gridDim blockDim>>> 23 devMTGPStates , 24 DEV_LATTICES , 25 DEV_NEXT_STEP_LATTICES , 26 DEV_MCS_NEEDED , 27 DEV_BOND_DENSITY 28 ); 29 // Obtaining results 30 // Cleanup 31 } 7.3. Solution space The important diÄerence over CPU version is the use of (X, Y, Z) which denote position of the thread in logical structure provided by CUDA architecture. Threads can be organized inside 3D structures called blocks and indexed using ”Cartesian” combina- tion of {x, y, z}. Later, they could be referenced inside kernel with blockIdx.{x,y,z}. Grid is also a 3D structure and similarly to blocks can be referenced inside kernel 18
  19. 19. with gridIdx.{x,y,z}. This structuring is provided for programmer convenience and is related to GPUs being devices meant for 2D and 3D graphics processing, where such ”Cartesian” decomposition is quite natural. Although, blocks and grids are logical structures, they are associated with physical properties of GPUs. This fact can (and should, whenever possible) be used for problem decomposition in order to optimize runtime performance. Here, the (X, Y, Z) correspond to (MIU, SIGMA, W0), that are distributed with (blockIdx.x, blockIdx.y, threadIdx.x). This was done in order to keep relatively small number of threads in the block (see subsection 7.4). By this convention each thread can calculate it’s own set of values of (MIU, SIGMA, W0). Listing 6 shows how a thread can map it’s coordinates into initial parameters of simulation. For instance, threads with it’s blockIdx == (100,100,0) will be executing simulations for MIU=1.0 and SIGMA=0.5 if MIU SIZE=100 and SIGMA SIZE=200. Listing 6. Simulation parameters computation for each thread 1 #define MIU_START 0.0 2 #define MIU_END 1.0 3 #define MIU_SIZE 10 4 #define SIGMA_START 0.0 5 #define SIGMA_END 1.0 6 #define SIGMA_SIZE 10 7 // ... 8 #define X blockIdx.x 9 #define Y blockIdx.y 10 #define Z threadIdx.x 11 #define MAX_X MIU_SIZE 12 #define MAX_Y SIGMA_SIZE 13 // ... 14 __global__ void generate_kernel ( 15 curandStateMtgp32 *state , 16 short * LATTICE , 17 short * NEXT_STEP_LATTICE , 18 int * DEV_MCS_NEEDED , 19 float * DEV_BOND_DENSITY 20 ) { 21 // ... 22 float C = TRIANGLE_DISTRIBUTION (X / MAX_X , Y / MAX_Y); 23 // ... 24 } 25 int main(int argc , char *argv []) { 26 dim3 blockDim(W0_SIZE ,1 ,1); 27 dim3 gridDim(MIU_SIZE ,SIGMA_SIZE ,1); 28 // ... 29 generate_kernel <<<gridDim , blockDim >>>( 30 // ... 31 ) 32 // ... 33 } 19
  20. 20. 7.4. Random Number Generators Important part of every Monte Carlo simulation is randomness. In order to keep the simulation converge to the actual result, the quality of the Random Number Generator (RNG) must be high. The de facto standard for scientific MC simulations is Mersenne Twister10 [13]. There is a version of MT19937 optimized for GPGPU usage11 that was included into CUDA as a cuRAND library12 . There are, however, some limitations of built-in MT19937: • 1 MTGP state per block • Up to 256 threads per state • Up to 200 states using included, pre-generated sequences MT is called with curand uniform(state) and returns floating point number in range (0, 1]. The values are uniformly distributed in range. To transform this sequence of uniformly distributed numbers a special function (function-like macro) can be used (Listing 7). Listing 7. Transformation of uniform- into triangle distribution 1 #define TRIANGLE_DISTRIBUTION (miu , sigma) ({ 2 float start = max(miu -sigma , 0.0); 3 float end = min(miu+sigma , 1.0); 4 float rand = ( 5 curand_uniform (& state[BLOCK_ID ]) 6 + curand_uniform (& state[BLOCK_ID ]) 7 ) / 2.0; 8 ((end -start) * rand) + start; 9 }) 7.5. Thread per simulation - static memory In the algorithm presented in Listing 4 the memory usage is not optimized at all. It is not only allocated in the global memory space, but also each time the program is run, the host’s memory has to be allocated copied into device. Listing 8 shows the ineÖcient memory allocations that occur in thread-per-simulation algorithm from subsection 7.1. Listing 8. Dynamic allocation of memory 1 __global__ void generate_kernel ( 2 curandStateMtgp32 *state , 3 short * LATTICE, 10 MT19937, MT 11 MTGP, http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MTGP/index.html 12 http://docs.nvidia.com/cuda/curand/device-api-overview.html 20
  21. 21. 4 short * NEXT STEP LATTICE, 5 int * DEV_MCS_NEEDED , 6 float * DEV_BOND_DENSITY 7 ) 8 // ... 9 short * DEV_LATTICES; 10 short * DEV_NEXT_STEP_LATTICES ; 11 CUDA_CALL(cudaMalloc( 12 &DEV_LATTICES , 13 THREADS_NEEDED * sizeof(short) * LATTICE_SIZE 14 )); 15 CUDA_CALL(cudaMalloc( 16 &DEV_NEXT_STEP_LATTICES , 17 THREADS_NEEDED * sizeof(short) * LATTICE_SIZE 18 )); 19 // ... 20 generate_kernel <<<grid_size , block_size >>>( 21 devMTGPStates , 22 DEV_LATTICES , 23 DEV_NEXT_STEP_LATTICES , 24 DEV_MCS_NEEDED , 25 DEV_BOND_DENSITY 26 ); If the memory is allocated inside kernel code the need for time-consuming copying between host and device disappears. It is possible to statically allocate memory in the device code (Listing 9). Listing 9. Static memory allocation inside kernel 1 __global__ void generate_kernel ( 2 curandStateMtgp32 *state , 3 int * DEV_MCS_NEEDED , 4 float * DEV_BOND_DENSITY 5 ) { 6 short LATTICE_1[LATTICE_SIZE ]; 7 short LATTICE_2[LATTICE_SIZE ]; 8 short * LATTICE = LATTICE_1; 9 short * NEXT_STEP_LATTICE = LATTICE_2; 10 // ... 11 } 7.6. Comparison of static and dynamic memory use Although quite simple, the following optimization does in fact improve the perfor- mance of the simulations. The results of the static vs dynamic memory allocation are illustrated in Figure 7.1. All of the empirical tests of GPU code were done on GeForce GTX 570 GPU with Intel i7 CPU. 21
  22. 22. Figure 7.1. Speedup of static and dynamic (memorypool) memory allocations with MAX MTS equal to 1 000 and 10 000. Markers denote arithmetic mean of the 5 averages conducted. The curves fitted are 4-th degree polynomials. Figure 7.2 shows the results conducted for range of 1 up to 6 000 concurrent simulations. Static memory approach is faster than dynamic memory in every trial conducted. Moreover, as seen in Figure 7.2, static memory tends to maintain speedup rather than lose it’s ”velocity” as it is in the case of dynamic memory approach (compare fitted curves above 40 000 concurrent simulations). 22
  23. 23. Figure 7.2. Speedup of static and dynamic (memorypool) memory allocations with MAX MTS equal to 1 000 and 10 000. Markers denote arithmetic mean of the 5 averages conducted. The curves fitted are 4-th degree polynomials. 23
  24. 24. 8. GPU Simulations - thread per spin 8.1. Thread per spin approach CUDA C Best Practices Guide13 encourages the use of multiple threads for optimal utilization of GPU cores. In this spirit, one can apply the approach where each spin could be represented by single thread and each simulation takes up an entire block. This idea is presented in Listing 10. Listing 10. Thread per spin algorithm 1 while ( monte_carlo_steps < MAX_MCS) { 2 syncthreads(); 3 if (threadIdx.x == 0) { 4 SWAP(LATTICE , NEXT_STEP_LATTICE , SWAP); 5 if ( BOND_DENSITY(LATTICE) == 0.0 ) { 6 // If lattice is ferromagnetic , simulation can stop 7 monte_carlo_steps = MAX_MCS; 8 break; 9 } 10 float C = TRIANGLE_DISTRIBUTION ((X / (float)MAX_X), (Y / (float)MAX_Y )); 11 first_i = (int)(LATTICE_SIZE * curand_uniform (& state[BLOCK_ID ])); 12 last_i = (int)(first_i + (C * LATTICE_SIZE)); 13 monte_carlo_steps = (int)( lattice_update_counter / LATTICE_SIZE); 14 } 15 syncthreads(); 16 NEXT_STEP_LATTICE [threadIdx.x] = LATTICE[threadIdx.x]; 17 if (( first_i <= threadIdx.x && threadIdx.x <= last_i ) 18 || ( last_i >= LATICE_SIZE && ( last_i % LATICE_SIZE >= threadIdx.x || threadIdx.x >= first_i ) ) 19 ) { 20 short left = MOD(( threadIdx.x-1), LATICE_SIZE); 21 short right = MOD(( threadIdx.x+1), LATICE_SIZE); 22 // Neighbours are the same 23 if ( LATTICE[left] == LATTICE[right] ) { 24 NEXT_STEP_LATTICE [threadIdx.x] = LATTICE[left ]; 25 } 26 // Otherwise randomly flip the spin 27 else if ( W0 > curand_uniform (& state[BLOCK_ID ])) { 28 NEXT_STEP_LATTICE [threadIdx.x] = FLIP_SPIN(LATTICE[threadIdx.x]); 29 } 30 atomicAdd(&lattice update counter,1); 31 } 32 } 13 http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/ 24
  25. 25. 8.2. Concurrent execution In the approach presented in Listing 10 some new features of CUDA are shown. Namely, syncthreads(); which can be used to synchronize execution of threads. It ensures that all threads in block will be executing the same instruction after passing the syncthreads(); call. Launch of exactly LATTICE SIZE*MIU SIZE*SIGMA SIZE*W0 SIZE threads is initialized. Each block is exactly LATTICE SIZE long (Listing 11). Listing 11. Grid and block sizes 1 dim3 blockDim(LATTICE_SIZE ,1 ,1); 2 dim3 gridDim(MIU_SIZE ,SIGMA_SIZE ,W0_SIZE); Each thread in block is running a single, block-wide simulation instance and execute the same code. This introduces the problem - every thread will execute initialization code like setting up W0, C, MIU etc. Some of this values (like C) are random, therefore, running this code multiple times will produce diÄerent results. The situation, where even one spin of the simulation is evaluated according to diÄerent W0 value is unacceptable. Correct initial setup can be obtained by evaluating initialization by only one thread (Listing 12). Listing 12. Shared memory definitions 1 __global__ void generate_kernel ( 2 curandStateMtgp32 *state , 3 int * DEV_MCS_NEEDED , 4 float * DEV_BOND_DENSITY 5 ) { 6 // ... 7 if (threadIdx.x == 0) { 8 LATTICE = LATTICE_1; 9 NEXT_STEP_LATTICE = LATTICE_2; 10 SWAP = NULL; 11 lattice_update_counter =0; 12 monte_carlo_steps =0; 13 W0 = Z/( float)MAX_Z; 14 } 15 __syncthreads (); 16 // ... 17 while ( monte_carlo_steps < MAX_MCS) { 18 // ... 19 } 20 } Concurrent execution by multiple threads makes initialization of LATTICE easier and faster. All of the threads are updating their own values. Block’s threads are accessing memory in bulk and without conflicts which could be a potential source of speedup (Listing 13). Listing 13. LATTICE initialization 25
  26. 26. 1 __global__ void generate_kernel ( 2 curandStateMtgp32 *state , 3 int * DEV_MCS_NEEDED , 4 float * DEV_BOND_DENSITY 5 ) { 6 // ... 7 8 if (threadIdx.x == 0) { 9 // Initialization 10 } 11 __syncthreads (); 12 // Initialize as antiferromagnetic 13 NEXT STEP LATTICE[threadIdx.x] = threadIdx.x&1; 14 while ( monte_carlo_steps < MAX_MCS) { 15 // ... 16 } 17 } 8.3. Thread communication To ensure thread cooperation inside simulation, block-level communication is needed. It can be obtained by means of shared memory. Shared memory is a type of memory residing on-chip. It is about 100x faster14 than uncached global memory. Shared memory is accessible to every thread in block. Listing 14 ilustrates the definition of shared resources inside kernel. CUDA compiler automatically allocates the on-chip memory for shared variables only once (though the kernel is executed by every thread). All of the threads in block access the same place in on-chip memory while accessing shared data. Listing 14. Shared memory definitions 1 __global__ void generate_kernel ( 2 curandStateMtgp32 *state , 3 int * DEV_MCS_NEEDED , 4 float * DEV_BOND_DENSITY 5 ) { 6 shared unsigned short LATTICE 1[LATTICE SIZE]; 7 shared unsigned short LATTICE 2[LATTICE SIZE]; 8 shared unsigned short first i, last i; 9 shared unsigned long long int lattice update counter; 10 shared unsigned long monte carlo steps; 11 shared float W0; 12 13 shared unsigned short * LATTICE; 14 shared unsigned short * NEXT STEP LATTICE; 15 shared unsigned short * SWAP; 16 14 http://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/ 26
  27. 27. 17 // Initialization of LATTICE pointers , lattice_update_counter etc. 18 19 while ( monte_carlo_steps < MAX_MCS) { 20 // ... 21 } 22 } 8.4. Race conditions with shared memory The issue of race conditions arise when multiple threads try to write to shared memory. Writing on GPU is usually not an atomic operation. It actually consist of 3 diÄerent operations15 e.g. the incrementation of some number consist of: 1. Reading the value 2. Incrementing the value 3. Writing the new value During the time required to perform this steps other threads can interrupt the execution. Fortunately, CUDA does provide the programmer with set of atomic*() functions. atomic*() ensures that any number of threads requesting read or write to the same memory instance will be served properly. The code presented in Listing 15 shows how to perform lattice update counter incrementation to ensure correctness of results. Listing 15. Atomic add 1 while ( monte_carlo_steps < MAX_MCS) { 2 // ... 3 NEXT_STEP_LATTICE [threadIdx.x] = LATTICE[threadIdx.x]; 4 if (( first_i <= threadIdx.x && threadIdx.x <= last_i ) 5 || ( last_i >= LATICE_SIZE && ( last_i % LATICE_SIZE >= threadIdx.x || threadIdx.x >= first_i ) ) 6 ) { 7 // ... 8 atomicAdd(&lattice update counter,1); 9 } 10 } 8.5. Thread per spin approach - reduction Reduction is the process of decreasing the number of elements. This ”definition”, although vague, means that having multiple elements of some sort, we apply some process to reduce the number of input elements. 15 http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions 27
  28. 28. Code presented in Listing 16, computes the bond density of LATTICE in each iteration. Moreover, it is done sequentially by only one thread which could be ineÖcient. Listing 16. Unoptimized iteration initialization 1 while ( monte_carlo_steps < MAX_MCS) { 2 __syncthreads (); 3 if (threadIdx.x == 0) { 4 SWAP(LATTICE , NEXT_STEP_LATTICE , SWAP); 5 if ( BOND DENSITY(LATTICE) == 0.0 ) { 6 // If lattice is in ferromagnetic state , simulation can stop 7 monte_carlo_steps = MAX_MCS; 8 break; 9 } 10 float C = TRIANGLE_DISTRIBUTION ((X / (float)MAX_X), (Y / (float)MAX_Y )); 11 first_i = (int)(LATTICE_SIZE * curand_uniform (& state[BLOCK_ID ])); 12 last_i = (int)(first_i + (C * LATTICE_SIZE)); 13 monte_carlo_steps = (int)( lattice_update_counter / LATTICE_SIZE); 14 } 15 __syncthreads (); 16 // ... 17 } Let’s recall that bond density (fl = 0) iÄ16 all spins are either in SPIN UP or SPIN DOWN position. From this simple observation we can conclude that the sum of all LATTICE elements (represented in code by means of 0 and 1) would be equal to 0 (if all elements are zeroes) or LATTICE SIZE (all elements being ones) if LATTICE is in the ferromagnetic state. Moreover, we can make use of multiple GPU threads and use existing NEXT STEP LATTICE as it is not needed between iterations. In algorithm presented in Listing 17 the sum of the LATTICE elements is calculated in log2(L) steps. In case of L = 64 it would take 6 iterations, after which the summation results would be stored in NEXT STEP LATTICE[0]. Listing 17. Parallel reduction 1 for (int i = LATTICE_SIZE /2; i != 0; i /= 2) { 2 if (threadIdx.x < i) { 3 NEXT_STEP_LATTICE [threadIdx.x] += NEXT_STEP_LATTICE [( threadIdx.x+i)]; 4 } 5 __syncthreads (); 6 } The approach from Listing 17 could even be extended to the process of calculating the actual BOND DENSITY(LATTICE). This method (again, using auxiliary array of NEXT STEP LATTICE) is presented in Listing 18. Listing 18. Parallel reduction to calculate BOND DENSITY(LATTICE) 1 __syncthreads (); 16 iÄ - if and only if 28
  29. 29. 2 NEXT_STEP_LATTICE [threadIdx.x] = 2*abs(LATTICE[threadIdx.x]-LATTICE [( threadIdx.x+1) % LATTICE_SIZE ]); 3 __syncthreads (); 4 for (int i = LATTICE_SIZE /2; i > 0; i /= 2) { 5 if (threadIdx.x < i) { 6 // Use NEXT_STEP_LATTICE as cache array 7 NEXT_STEP_LATTICE [threadIdx.x] += NEXT_STEP_LATTICE [threadIdx.x+i]; 8 } 9 __syncthreads (); 10 } 8.6. Thread per spin approach - flags The possible way of optimizing the performance would be to avoid calculating bond density during each execution of while loop altogether. If during update itera- tion, none of the spins were updated, then LATTICE was simply shallow copied into NEXT STEP LATTICE. One can suspect that this behavior could be caused by lattice being in stationary state. If the stationary state in question is one of ferromagnetic states the simulation can be stopped. Listing 19 introduces a new variable: lattice update counter iter. This variable will hold the information about how many spins were actually changed during simulation iteration. If the change did occur, then the BOND DENSITY(LATTICE) will not be executed at all. The if statement’s lattice update counter iter != 0 will be evaluated before and if it’s condition is not satisfied therefore, (by means of C being lazy-evaluated language) the part after && will not be reached. If, however, the change did not occur (lattice update counter iter != 0) and lattice is in ferromagnetic state (BOND DENSITY(LATTICE) == 0.0) the simulation should stop. Unfortunately, break; will apply only to threadIdx.x == 0. In order to have other threads stop their work, we could use check already performed by each thread before starting the actual work, that is: set the monte carlo steps = MAX MCS. In this way we prevent other threads from execution (and potentially interfering with the results). Listing 19. Thread per spin with flags 1 __global__ void generate_kernel ( 2 curandStateMtgp32 *state , 3 int * DEV_MCS_NEEDED , 4 float * DEV_BOND_DENSITY 5 ) { 6 // Shared memory definitions 7 shared lattice update counter iter; 8 // Simulation initialization 9 if (threadIdx.x == 0) { 10 // ... 11 lattice update counter=0; 12 } 29
  30. 30. 13 __syncthreads (); 14 // Initialize as antiferromagnetic 15 NEXT_STEP_LATTICE [threadIdx.x] = threadIdx.x&1; 16 while ( monte_carlo_steps < MAX_MCS) { 17 __syncthreads (); 18 if (threadIdx.x == 0) { 19 // Iteration initialization 20 if ( lattice update counter iter == 0 21 && BOND DENSITY(LATTICE) == 0.0 ) { 22 // If ferromagnetic , simulation can stop 23 monte_carlo_steps = MAX_MCS; 24 break; 25 } 26 lattice update counter iter=0; 27 } 28 __syncthreads (); 29 NEXT_STEP_LATTICE [threadIdx.x] = LATTICE[threadIdx.x]; 30 if (( first_i <= threadIdx.x && threadIdx.x <= last_i ) 31 || ( last_i >= LATTICE_SIZE && ( last_i % LATTICE_SIZE >= threadIdx. x || threadIdx.x >= first_i ) ) 32 ) { 33 // Iteration update 34 } 35 if( NEXT_STEP_LATTICE [threadIdx.x]!= LATTICE[threadIdx.x]){ 36 atomicAdd(&lattice update counter iter,1); 37 } 38 } 39 } 8.7. Thread-per-spin performance As seen in Figure 8.1, each improvement over basic thread-per-spin method introduces some kind of speedup. Noteworthy is the performance gap especially between use of reduction and flags. Apparently, using flag to avoid per-iteration calculations of BOND DENSITY(LATTICE) is significantly faster then usage of version equipped with highly optimized BOND DENSITY(LATTICE) algorithm. 30
  31. 31. Figure 8.1. Speedup of static and dynamic (memorypool) memory allocations with MAX MTS equal to 1 000 and 10 000. Markers denote arithmetic mean of the 5 averages conducted. The curves fitted are 4-th degree polynomials. 8.8. Thread-per-spin vs thread-per-simulation performance Tests presented in Figure 8.2 and Figure 8.3 show the comparison between suggested approaches. Unoptimized thread-per-spin approach turns out to be faster then thread- per-simulation in every test under 20 000 concurrent simulations. Threads on the GPU do not have as powerful processor at their disposal as those run on CPU. This leads to the conclusion that most of the tasks conducted on GPU should be split onto separate threads to parallelize the execution even by the expense of increased communication time. However, above 20 000 simulations threshold, overhead provided by a huge amount of threads and RNGs instances causes thread-per-spin to be worse performing than thread-per-simulation approaches. 31
  32. 32. Figure 8.2. Execution times of thread-per-spin and thread-per-simulation simulations with MAX MTS equal to 1 000. Markers denote arithmetic mean of the 5 averages conducted. The curves fitted are 4-th degree polynomials. Figure 8.3. Execution times of thread-per-spin and thread-per-simulation simulations with MAX MTS equal to 10 000. Markers denote arithmetic mean of the 5 averages conducted. The curves fitted are 4-th degree polynomials. 32
  33. 33. As it is in the case of execution times, also speedups of thread-per-simulations are greater of those using thread-per-spin approach (Figure 8.4 and Figure 8.5). For massive amounts of concurrent threads thread-per-spin simulations perform relatively well gaining speedups of about 8-9x. Thread-per-simulation on the other hand shows an impressive speedup of up to 28x. In the case of low number of threads thread-per-spin approach shows better speedup below 20 000 threshold. However, for bigger simulations (25 000 and more) thread-per-simulations show more promising results. Figure 8.4. Speedups of thread-per-spin and thread-per-simulation simulations with MAX MTS equal to 1 000. Markers denote arithmetic mean of the 5 averages conducted. The curves fitted are 4-th degree polynomials. 33
  34. 34. Figure 8.5. Speedups of thread-per-spin and thread-per-simulation simulations with MAX MTS equal to 10 000. Markers denote arithmetic mean of the 5 averages conducted. The curves fitted are 4-th degree polynomials. 9. Bond density for some W0 values The calculations made during this project helped in development of some insight into how triangular distribution could aÄect the phase transition. Some exemplary bond density (fl) plots are presented in Figure 9.1 and Figure 9.2. 34
  35. 35. Figure 9.1. Bond density after 106 MCS, W0 = 0.9, MIU = [0, 0.25, 0.5, . . . , 1] and SIGMA = [0, 0.25, 0.5, . . . , 1] Figure 9.2. Bond density after 106 MCS, W0 = 0.6, MIU = [0, 0.25, 0.5, . . . , 1] and SIGMA = [0, 0.25, 0.5, . . . , 1] 35
  36. 36. 10. Conclusions CUDA does in fact expose an easy-to-use environment for harnessing the power of present-day GPGPUs. The realization of the project helped in speeding up complex and time-consuming calculations that take days on high-end CPUs. It helped getting to know CUDA compiler and it’s most useful libraries. Another important (although, not mentioned) element of this study was the usage of scripting languages. Technologies such as Python17 enable: easy work distribution across GPGPU workstations, harvesting the results, processing the data18 and plotting19 the results for easy pattern recognition and presentation. Unfortunately, GPU architecture requires programmer to really know the underlying hardware and various programming techniques if he or she wants to obtain an optimal performance. 11. Future work In the future, the developed CUDA program could be used to drive fully featured study of the physical phenomena described in section 4. In order to do that, more detailed data has to be gathered, including improved data resolution and higher number of averages. 17 http://www.python.org/ 18 http://www.numpy.org/ 19 http://matplotlib.org/ 36
  37. 37. References [1] C. Coulon, et al. Glauber dynamics in a single-chain magnet: From theory to real systems Phys. Rev. B 69 (2004) [2] L. Bogani, et al. Single chain magnets: where to from here? J. Mater Chem., 18, (2008) [3] H. Miyasaka, et. al. Slow Dynamics of the Magnetization in One- Dimensional Coordination. Polymers: Single-Chain Magnets Inorg. Chem., 48, (2009) [4] R.O. Kuzian, et. al. Ca2Y2Cu5O10: the first frustrated quasi-1D ferromagnet close to criticality, Phys. Rev. Letters, 109, (2012) [5] K. Sznajd-Weron and S. Krupa. Inflow versus outflow zero-temperature dynamics in one dimension, Phys. Rev. E 74, 031109 (2006) [6] F. Radicchi, D. Vilone, and H. Meyer-Ortmanns. Phase Transition between Syn- chronous and Asynchronous Updating Algorithms, J. Stat. Phys. 129, 593 (2007) [7] B. Skorupa, K. Sznajd-Weron, and R. Topolnicki. Phase diagram for a zero- temperature Glauber dynamics under partially synchronous updates, Phys. Rev. E 86, 051113 (2012) [8] I. G. Yi and B. J. Kim. Phase transition in a one-dimensional Ising ferromagnet at zero temperature using Glauber dynamics with a synchronous updating mode, Phys. Rev. E 83, 033101 (2011) [9] M. Evans, N. Hastings, B. Peacock. Statistical Distributions, 3rd ed. New York: Wiley, pp. 187-188, (2000) [10] E. Ising. Beitrag zur Theorie des Ferromagnetismus, Z. Phys. 31: 253-258, (1925) [11] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller. Equations of State Calculations by Fast Computing Machines, Journal of Chemical Physics 21 (6): 108-1092, (1953) [12] W. Lenz, ”Beitrage zum Verstandnis der magnetischen Eigenschaften in festen Korpern”, Physikalische Zeitschrift 21: 613-615, (1920) [13] M. Matsumoto and T. Nishimura, ”Mersenne Twister: A 623-dimensionally equidis- tributed uniform pseudorandom number generator”, ACM Trans. on Modeling and Computer Simulation Vol. 8, No. 1, January pp.3-30 (1998) 37

×