FAST MAP PROJECTION ON CUDA Yanwei Zhao Institute of Computing Technology Chinese Academy of Sciences July 29, 2011
Outline Institute  of Computing Technology, Chinese Academy of  Sciences
Outline Institute  of Computing Technology, Chinese Academy of  Sciences
Map Projection Establish the relationship between two different coordinate systems. geographical coordinates -> planar cartesian map space coordinate system Complicated and time consuming arithmetic operations. Fast answer with desired accuracy-> Slow exact answer It's need to be accelerated for interactive GIS scenarios. Institute  of Computing Technology, Chinese Academy of  Sciences
GPGPU (The  general purpose computing on graphics processing units ) GPGPU is a young area of research. Advantage of GPU Flexibility Power processing Low cost GPGPU in applications other than 3D graphics  GPU accelerates critical path of application Institute  of Computing Technology, Chinese Academy of  Sciences
CUDA ( Common Unified Device Architecture ) NVIDIA's parallel computing architecture  C base programming language and development toolkit  Advantage: Programmer can focus on the important  issues rather than an unfamiliar language  No need of graphics APIs and write efficient parallel code Institute  of Computing Technology, Chinese Academy of  Sciences
The characteristic of Map Projection Huge amount of coordinates to handle The complexity of arithmetic operations The requirement of a realtime response Institute  of Computing Technology, Chinese Academy of  Sciences
Our proposals using the new technology CUDA on the GPU Take Universal Transverse Mercator (UTM) projection as an example Performance: Improvement of up to 6x to 8x (include transfer time) Speed up 70x to 90x (not include transfer time) Institute  of Computing Technology, Chinese Academy of  Sciences
Outline Institute  of Computing Technology, Chinese Academy of  Sciences
Algorithm framework Striped partitioning Matrix distribution Institute  of Computing Technology, Chinese Academy of  Sciences
Striped partitioning Define the number of block and thread: Block_num,Thread_num CUDA built-in parameters: GridDim, BlockDim Geographic feature number: fn Each block runs features:  fn/GridDim.x Institute  of Computing Technology, Chinese Academy of  Sciences
Striped partitioning For surrounding loop: Blocks and features Block -> Feature[i] i = blockidx.x*(fn/GridDim.x)  (1) Block -> next Feature[k] k = i + fn/GridDim.x  (2) For inner loop: Threads and coordinates thread->coord[j] j = threadIdx.x thread->next coord[k] k = j +Thread_num Institute  of Computing Technology, Chinese Academy of  Sciences
Striped partitioning For surrounding loop: Blocks and features Block -> Feature[i] i = blockidx.x*(fn/GridDim.x) Block -> next Feature[k] k = i + fn/GridDim.x For inner loop: Threads and coordinates thread->coord[j] j = threadIdx.x  (1)   thread->next coord[k] k = j +Thread_num  (2) Institute  of Computing Technology, Chinese Academy of  Sciences
Matrix distribution Institute  of Computing Technology, Chinese Academy of  Sciences Define the number of block and thread: grid(br,bc), block(tr,tc) Each block run k features, where: (1) Feature[i]: (2) (3)
Matrix distribution Each block run s coordnates, where: (1) coord[j]: Institute  of Computing Technology, Chinese Academy of  Sciences
Outline Institute  of Computing Technology, Chinese Academy of  Sciences
Experiment Environment Hardware: CPU:  Intel Core2 Duo CPU E8500 at 3.18GHz with 2GB of internal memory GPU:  NVIDIA GeForce 9800 GTX+ graphics card which has 512MB memory, 128 CUDA cores and 16 multiprocessors Software: Microsoft Windows XP Pro SP2 Microsoft Visual Studio 2005 NVIDIA driver 2.2, CUDA sdk 2.2 and CUDA toolkit 2.2 Institute  of Computing Technology, Chinese Academy of  Sciences
The data parallel degree total CPU time : initialization and file reading time  serial projection time Institute  of Computing Technology, Chinese Academy of  Sciences
The data parallel degree total CPU time : initialization and file reading time  serial projection time Map projection can achieve more than 90 percent of parallelism. Institute  of Computing Technology, Chinese Academy of  Sciences
Comparing with CPU Block_num=64  Thread_num=512 Institute  of Computing Technology, Chinese Academy of  Sciences
Comparing with CPU Total time = map projection time + data transfer time Institute  of Computing Technology, Chinese Academy of  Sciences
Comparing with CPU If consider the total time, the performance can obtain 6x to 8x. Institute  of Computing Technology, Chinese Academy of  Sciences
Comparing with CPU If only compare map projection time, we can obtain 70x to 90x speedups. Institute  of Computing Technology, Chinese Academy of  Sciences
The performance of different task assignments striped partitioning :  Block_num =64,  Thread_num =512 matrix distribution:  dim_grid (32,32) = 32*32 blocks dim_block (256,256) = 256*256 threads Institute  of Computing Technology, Chinese Academy of  Sciences
The performance of different task assignments striped partitioning :  Block_num =64,  Thread_num =512 matrix distribution:  dim_grid (32,32) = 32*32 blocks dim_block (256,256) = 256*256 threads Striped: 6x to 8x Matrix: 4x to 6x Institute  of Computing Technology, Chinese Academy of  Sciences
The performance of different task assignments Matrix  Striped Institute  of Computing Technology, Chinese Academy of  Sciences
The performance of different task assignments Matrix  Striped All threads in the block accessing consecutive memory. it can only ensure each row of threads in the block handle consecutive data Institute  of Computing Technology, Chinese Academy of  Sciences
Outline Institute  of Computing Technology, Chinese Academy of  Sciences
Conclusion and Future work Implement a fast map projection method. CUDA-enabled GPUs high speed-up compared to the CPU-based method the power of modern GPU is able to considerably speed up in the field of geoscience DEM-based spatial interpolation raster-based spatial analysis Future work: GPU implementation of other GIS application Institute  of Computing Technology, Chinese Academy of  Sciences
Thank you! Q & A  Yanwei Zhao Institute of Computing Technology Contact: zhaoyanwei@ict.ac.cn Institute  of Computing Technology, Chinese Academy of  Sciences

FAST MAP PROJECTION ON CUDA.ppt

  • 1.
    FAST MAP PROJECTIONON CUDA Yanwei Zhao Institute of Computing Technology Chinese Academy of Sciences July 29, 2011
  • 2.
    Outline Institute of Computing Technology, Chinese Academy of Sciences
  • 3.
    Outline Institute of Computing Technology, Chinese Academy of Sciences
  • 4.
    Map Projection Establishthe relationship between two different coordinate systems. geographical coordinates -> planar cartesian map space coordinate system Complicated and time consuming arithmetic operations. Fast answer with desired accuracy-> Slow exact answer It's need to be accelerated for interactive GIS scenarios. Institute of Computing Technology, Chinese Academy of Sciences
  • 5.
    GPGPU (The general purpose computing on graphics processing units ) GPGPU is a young area of research. Advantage of GPU Flexibility Power processing Low cost GPGPU in applications other than 3D graphics GPU accelerates critical path of application Institute of Computing Technology, Chinese Academy of Sciences
  • 6.
    CUDA ( CommonUnified Device Architecture ) NVIDIA's parallel computing architecture C base programming language and development toolkit Advantage: Programmer can focus on the important issues rather than an unfamiliar language No need of graphics APIs and write efficient parallel code Institute of Computing Technology, Chinese Academy of Sciences
  • 7.
    The characteristic ofMap Projection Huge amount of coordinates to handle The complexity of arithmetic operations The requirement of a realtime response Institute of Computing Technology, Chinese Academy of Sciences
  • 8.
    Our proposals usingthe new technology CUDA on the GPU Take Universal Transverse Mercator (UTM) projection as an example Performance: Improvement of up to 6x to 8x (include transfer time) Speed up 70x to 90x (not include transfer time) Institute of Computing Technology, Chinese Academy of Sciences
  • 9.
    Outline Institute of Computing Technology, Chinese Academy of Sciences
  • 10.
    Algorithm framework Stripedpartitioning Matrix distribution Institute of Computing Technology, Chinese Academy of Sciences
  • 11.
    Striped partitioning Definethe number of block and thread: Block_num,Thread_num CUDA built-in parameters: GridDim, BlockDim Geographic feature number: fn Each block runs features: fn/GridDim.x Institute of Computing Technology, Chinese Academy of Sciences
  • 12.
    Striped partitioning Forsurrounding loop: Blocks and features Block -> Feature[i] i = blockidx.x*(fn/GridDim.x) (1) Block -> next Feature[k] k = i + fn/GridDim.x (2) For inner loop: Threads and coordinates thread->coord[j] j = threadIdx.x thread->next coord[k] k = j +Thread_num Institute of Computing Technology, Chinese Academy of Sciences
  • 13.
    Striped partitioning Forsurrounding loop: Blocks and features Block -> Feature[i] i = blockidx.x*(fn/GridDim.x) Block -> next Feature[k] k = i + fn/GridDim.x For inner loop: Threads and coordinates thread->coord[j] j = threadIdx.x (1) thread->next coord[k] k = j +Thread_num (2) Institute of Computing Technology, Chinese Academy of Sciences
  • 14.
    Matrix distribution Institute of Computing Technology, Chinese Academy of Sciences Define the number of block and thread: grid(br,bc), block(tr,tc) Each block run k features, where: (1) Feature[i]: (2) (3)
  • 15.
    Matrix distribution Eachblock run s coordnates, where: (1) coord[j]: Institute of Computing Technology, Chinese Academy of Sciences
  • 16.
    Outline Institute of Computing Technology, Chinese Academy of Sciences
  • 17.
    Experiment Environment Hardware:CPU: Intel Core2 Duo CPU E8500 at 3.18GHz with 2GB of internal memory GPU: NVIDIA GeForce 9800 GTX+ graphics card which has 512MB memory, 128 CUDA cores and 16 multiprocessors Software: Microsoft Windows XP Pro SP2 Microsoft Visual Studio 2005 NVIDIA driver 2.2, CUDA sdk 2.2 and CUDA toolkit 2.2 Institute of Computing Technology, Chinese Academy of Sciences
  • 18.
    The data paralleldegree total CPU time : initialization and file reading time serial projection time Institute of Computing Technology, Chinese Academy of Sciences
  • 19.
    The data paralleldegree total CPU time : initialization and file reading time serial projection time Map projection can achieve more than 90 percent of parallelism. Institute of Computing Technology, Chinese Academy of Sciences
  • 20.
    Comparing with CPUBlock_num=64 Thread_num=512 Institute of Computing Technology, Chinese Academy of Sciences
  • 21.
    Comparing with CPUTotal time = map projection time + data transfer time Institute of Computing Technology, Chinese Academy of Sciences
  • 22.
    Comparing with CPUIf consider the total time, the performance can obtain 6x to 8x. Institute of Computing Technology, Chinese Academy of Sciences
  • 23.
    Comparing with CPUIf only compare map projection time, we can obtain 70x to 90x speedups. Institute of Computing Technology, Chinese Academy of Sciences
  • 24.
    The performance ofdifferent task assignments striped partitioning : Block_num =64, Thread_num =512 matrix distribution: dim_grid (32,32) = 32*32 blocks dim_block (256,256) = 256*256 threads Institute of Computing Technology, Chinese Academy of Sciences
  • 25.
    The performance ofdifferent task assignments striped partitioning : Block_num =64, Thread_num =512 matrix distribution: dim_grid (32,32) = 32*32 blocks dim_block (256,256) = 256*256 threads Striped: 6x to 8x Matrix: 4x to 6x Institute of Computing Technology, Chinese Academy of Sciences
  • 26.
    The performance ofdifferent task assignments Matrix Striped Institute of Computing Technology, Chinese Academy of Sciences
  • 27.
    The performance ofdifferent task assignments Matrix Striped All threads in the block accessing consecutive memory. it can only ensure each row of threads in the block handle consecutive data Institute of Computing Technology, Chinese Academy of Sciences
  • 28.
    Outline Institute of Computing Technology, Chinese Academy of Sciences
  • 29.
    Conclusion and Futurework Implement a fast map projection method. CUDA-enabled GPUs high speed-up compared to the CPU-based method the power of modern GPU is able to considerably speed up in the field of geoscience DEM-based spatial interpolation raster-based spatial analysis Future work: GPU implementation of other GIS application Institute of Computing Technology, Chinese Academy of Sciences
  • 30.
    Thank you! Q& A Yanwei Zhao Institute of Computing Technology Contact: zhaoyanwei@ict.ac.cn Institute of Computing Technology, Chinese Academy of Sciences