Parallel Simulation of Urban Dynamics on the GPU Ivan Blečić, Arnaldo Cecchini, Giuseppe A. Trunfio - Department of Architecture, Planning and Design, University of Sassari, Alghero
Seventh International Conference onGeographical Analysis, Urban Modeling, Spatial Statistics, GEOG-AN-MOD 2012 Parallel Simulation ofUrban Dynamics on the GPU Ivan Blecic, Arnaldo Cecchini and Giuseppe A. Trunfio Department of Architecture, Planning and Design University of Sassari
Introduction• A number of geosimulation models have been developed to better understand and predict urban growth, land-use and landscape changes.• Some trends can be recognized from the literature: – increasing size of the areas under study, which can often go beyond the traditional scale of a city, covering wider regional and nation territory or even an entire continent; – such models tend to be more and more sophisticated, also because they can take advantage of the increased availability of high resolution remote sensing data; – Automatic and computationally expensive calibration processes are often required, involving large search spaces and many parameters.• As a result, real world applications of such models often require long computing times.
Introduction• Geosimulation models are often computationally intensive;• In spite of this, few studies exist in the literature on the application of parallel computing to geosimulation models – (e.g. the recent work by Guan and Clarke where a general- purpose parallel library was developed and applied to speed up the well known CA model SLEUTH);• We apply GPGPU (General-Purpose computing on Graphics Processing Units) to a widely used CA approach for land-use simulation based on the concept of transition potentials
GPGPU• GPGPU (General-Purpose computing on Graphics Processing Units): using Graphics Processing Units for standard computation• Why computing using Graphics Processing Units ? • the computational power of devices enabling GPGPU has exceeded that of the standard CPUs by more than one order of magnitude; • the price of a typical high-end GPU is comparable to the price of a standard CPU; CPUs GPUs
GPGPU• Why computing using Graphics Processing Units ? • There has been a rapid increase in the programmability of GPU devices, which has facilitated the porting of many scientific applications leading to relevant parallel speedups• Main alternatives from the programming point of view: • nVidia CUDA: C-language Compute Unified Device Architecture is a popular programming model introduced in 2006 by nVidia Corporation for their GPUs • openCL: an open standard maintained by the Khronos group with the backing of major graphics hardware vendors as well as large computer industry vendors
GPUs• Modern GPUs are multiprocessors with a highly efficient hardware-coded multi-threading support.• The key capability of a GPU unit is thus to execute thousands of threads running the same function concurrently on different data.• Hence, the computational power provided by such an architecture can be fully exploited through a fine grained data-parallel approach when the same computation can be independently carried out on different elements of a dataset.
GPUs• We use the GPGPU platform provided by nVidia – it consists of a group of Streaming Multiprocessors (SMs); – each SM can support some co-resident concurrent threads; – each SM consists of multiple Scalar Processor (SP) cores. SM
CUDA C-language Compute Unified Device Architecture• In a typical CUDA program, sequential host instructions are combined with parallel GPU code.• In CUDA, the GPU activation is obtained by writing device functions in C language, which are called kernels: – when a kernel is invoked by the CPU, a number of threads (e.g. typically several thousands) execute the kernel code in parallel on different data; Kernels are organized in blocks
CUDA• The GPU can access different types of memory.• The device global memory can deliver significantly (e.g. one order of magnitude) higher memory bandwidth than the main computer memory;• Unfortunately, the GPU global memory is typically linked to the GPU card through a relatively slow bus
Two GPGPU accelerated models for Simulating Land-Use Dynamics• Two versions of a typical Cellular Automata (CA) model for land use dynamics have been parallelized for the GPU: – a constrained cellular automata model (CCA); – and the corresponding unconstrained version (UCA).• Both models are based on the well known concept of transition potential: – in the CCA the aggregate level of demand for every land use is fixed by an exogenous constraint at each time step; – in the UCA the amount of cells that are in a certain state at each time step only depends on the internal model parameters and model structure;
CCA and UCA simulation of land use changePlanning regulation, Accessibility, neighbourhood effectsuitability, etc. (interactions between urban functions) Transition potentials Land use requests in the area Land use at time t Land use at time t+1
CCA and UCA simulation of land use change• Step 1 for both UCA and CCA: – transition potential computation (on a local basis);• Step 2 for UCA (on a local basis): – of all the possible land uses, a cell is transformed into the one having the highest transition potential;• Step 2 for CCA (on a non-local basis): – transforming each cell into the state with the highest potential, given the constraint of the overall number of cells in each state imposed by the exogenous trend for that step;
GPGPU Parallelization with CUDA: design choices• One or more CUDA computational kernels (i.e. threads) are assigned to each cell of the automaton; – to define the kernels a key step consists of identifying all the sets of instructions that can be executed independently of each other on the different cells of the automaton;• Most of the automaton data is stored in the GPU global memory. This involves: – CPU-GPU memory copy operation before the beginning of the simulation and GPU-CPU memory copy at the end of the simulation; – at the end of each CA step a device-to-device memory copy operation is used to re-initialise the current values of the CA state with the next values.
GPGPU Parallelization of UCA• In the UCA model, the computation performed at each step by each cell consists of two phases: 1. the computation of the transition potentials and 2. the assignment of a new land use;• Since both can be carried out independently for each cell, they were included in a single kernel, thus avoiding the overhead related to invocation of an additional kernel.
GPGPU Parallelization of CCA• Also in the CCA each cell computes its transition potential; However, the downwards scanning of the list of cells ranked according to their higher potential (lines 4-5) must be carried out according to the list order, one cell at a time (inherently sequential)• As a land-use demand is satisfied, a new ranking of cells must be performed before any further cell transition.• The constraints on the total number of cells represents a strong condition of dependency between the cells.
GPGPU Parallelization of CCA• A different constrained allocation procedure has been devised, which is able to better exploit the GPU while maintaining the essential characteristics of the original constrained approach.• The proposed parallel constrained allocation tries to process in parallel blocks of cells that have their highest potential for the same land use;• More details of the algorithms in the paper
Computational results: hardware• The sequential UCA and CCA reference versions were run on a desktop computer equipped with a 2.66 Ghz Intel Core 2 Quad CPU;• The parallel versions were run on the following GPUs:
Computational results: test cases• Two different datasets: – the first concerns the area of the city of Florence and is composed of 242 × 151 cells of size corresponding to 100 m; – the second represents the urban area of Athens and is composed of 321 × 391 cells of size 100 m. – 30 simulation steps (i.e. 30 years of future land use projection); – for the CCA, a constant 3% increment, referred to the initial number of cells, was adopted as constraint for each active land use.• In both the CCA and UCA, the effort involved in the computation of transition potentials is almost proportional to the number of neighbouring cells. – for this reason three different neighbourhood radius were considered, namely r = 10 cells, r = 15 cells and r = 20 cells.
Computational results: conclusions• The gain in terms of computing time is impressive.• As expected, the speedup of the UCA model was always superior to that achieved on the CCA model.• Improvement are still possible, since not all typical GPGPU optimization strategies have been implemented and more powerful GPUs are available;• The main advantage lies in enabling an accurate calibration, which otherwise may not be possible in some cases involving models operating at regional or continental scale