GPU DE   いろんな   ちょっとした問題解いてみた
                 Terumi YAMADA
自己紹介

•   山田てるみ(研修中

•   SIMD大好きっ子

    •   Twitter: telmin_orca
もくじ
•   自己紹介

•   前フリ

•   巡回セールスマン問題解いてみた

•   Aobench走らせてみた

•   まとめ
前フリ
OpenCL
OpenCLとは?

  ステマ
OpenCL
OpenCLとは?

  ステマ
OpenCLとは
OpenCLとは
OpenCLとは


Heterogeneous
Heterogeneous?
NVIDIA

•   Geforce GTX 580


    •   Fermi


    •   512 CUDA core


    •   3GB RAM


    •   PCIe 2.0
AMD

•   Radeon HD 7970


    •   GCN


    •   2048 Streaming Processor


    •   3GB RAM


    •   PCIe 3.0
HOST


•   Intel Core i7 2600K


    •   SandyBridge


    •   8GB RAM
巡回セールスマン問題解いてみた
巡回セールスマン問題?
解法

•   遺伝的アルゴリズム

•   蟻コロニー最適化

•   μ-opt法

    •   LK法
2-opt法

i   k            i   k




l   j            l   j
Parallel 2-opt


•   SIMD 2-opt法のGPGPUへの適応と評価

    •   第74回情報処理学会全国大会 GPUセッション
重いのは?
  i   k   i   k




  l   j   l   j
CPU -> GPU

            経路長計算
最短経路選択
           最短経路交換
Result
       CPU      NVIDIA     AMD

10万   152.241   114.02    2472.06

12万   235.05    168.58    3487.41

14万   296.395   266.211

16万   427.161   328.547
…?
       CPU      NVIDIA     AMD

10万   152.241   114.02    2472.06

12万   235.05    168.58    3487.41

14万   296.395   266.211

16万   427.161   328.547
…?
       CPU      NVIDIA     AMD

10万   152.241   114.02    2472.06

12万   235.05    168.58    3487.41

14万   296.395   266.211

16万   427.161   328.547
…?
       CPU      NVIDIA     AMD

10万   152.241   114.02    2472.06

12万   235.05    168.58    3487.41

14万   296.395   266.211

16万   427.161   328.547
                            \(^o^)/
Aobench 走らせてみた
Aobench?

•   Ambient Occlution benchmark.


    •   @syoyo氏制作


    •   浮動小数点演算のベンチマーク
Ambient Occlution


•   Global Illumination

    •   間接光

    •   結構重い
重いのは?


•   Intersection

    •   Sphere * 3 + Plane = 4

        •   AO sample 64 * 64 = 256
CPU -> GPU
Result

              CPU      NVIDIA   AMD

 256 * 256
               6.30    0.057    0.061
  64 * 64
 512 * 512
              24.58    0.213    0.131
  64 * 64
1024 * 1024
              96.735   0.831    0.4462
  64 * 64
:
: が :   //: /:::|::',|::'、:::::::::\:.:\.:.:.ヽ:.:.:\:.:..\::::::::::::\、::::\    : : 
:
: 何 :  /!::|::l:::: /|:::l:ヽ:\::ヽ:.:\:.:\.:::ヽ:.:.:ヽ:.:.:.:\::::::::::::\ ̄   : : 
:
: だ :   |/l::|::|::|: ト、:::::::::、、:ヽ、:.:.:.:::::::::::::::ヽ::::.:ヽ:.:.:.:.\:.:.:.ヽ:::\.   : 
: :
: か :   |::|::/l::|::|r-ヽ:::::ヽ(ヽー,―\::::::、::::::::::ヽ::.:.::::::.:::::::ヾ. ̄   : : 
:
:    :   }//l::|:::|{(:::)ヾ、:::ヽ \!(:::) ヽ,:::ヽ:::::::::::::::::::::::::::::::::::ヾ、   : 
: :
: わ :.   |/l::|::|:::|ヽ==''" \:ヽ、ヽ=='" |:::::::::::::::::::::::::::::::::::ヽ、::::\
  か     / ',|::|:::|   /   `゛       |!::::::::::::::::::::::::::::ト、::ト、_` ゛`
  ら      l::!::::ト、  '、 _         ||::::::::::::::::::::::::ト:ヽヾ| | ̄ ̄ ̄
`ヽ、
  な     r'"´||',::::',                 |:::::/l:::::|\:::ト、ヾ | |     / / \
  い   /   ll ',::', 、 ーこニ=-       /!::/ ヽ:::|  ヾ、  ノ ノ  /  ,イ   
ヽ、
Device type: Unknown
                                         ???
Max resource 2D width/height: 16384/16384
Total GPU memory size: 3072 MB
Total CPU cached space size: 508 MB
Total CPU uncached space size: 1788 MB
GPU engine clock: 925 MHz
GPU memory clock: 1375 MHz
Number of timing loops: 100
[      16 bytes] CPU->GPU= 800.000 KB/sec, GPU->CPU 533.333 KB/sec
[      32 bytes] CPU->GPU= 1.600 MB/sec, GPU->CPU 1.067 MB/sec
[      64 bytes] CPU->GPU= 2.133 MB/sec, GPU->CPU 2.133 MB/sec
[     128 bytes] CPU->GPU= 2.560 MB/sec, GPU->CPU 4.267 MB/sec
[     256 bytes] CPU->GPU= 8.533 MB/sec, GPU->CPU 8.533 MB/sec
[     512 bytes] CPU->GPU= 17.067 MB/sec, GPU->CPU 25.600 MB/sec
[    1024 bytes] CPU->GPU= 51.200 MB/sec, GPU->CPU 34.133 MB/sec
[    2048 bytes] CPU->GPU= 102.400 MB/sec, GPU->CPU 68.267 MB/sec
[    4096 bytes] CPU->GPU= 204.800 MB/sec, GPU->CPU 204.800 MB/sec
[    8192 bytes] CPU->GPU= 409.600 MB/sec, GPU->CPU 409.600 MB/sec
[ 16384 bytes] CPU->GPU= 409.600 MB/sec, GPU->CPU 819.200 MB/sec
[ 32768 bytes] CPU->GPU= 1.638 GB/sec, GPU->CPU 1.638 GB/sec
[ 65536 bytes] CPU->GPU= 2.185 GB/sec, GPU->CPU 3.277 GB/sec
...
[ 4194304 bytes] CPU->GPU= 6.658 GB/sec, GPU->CPU 4.033 GB/sec
[ 8388608 bytes] CPU->GPU= 6.658 GB/sec, GPU->CPU 3.884 GB/sec
[ 16777216 bytes] CPU->GPU= 6.684 GB/sec, GPU->CPU 3.233 GB/sec
[ 33554432 bytes] CPU->GPU= 6.697 GB/sec, GPU->CPU 2.993 GB/sec
[ 67108864 bytes] CPU->GPU= 6.697 GB/sec, GPU->CPU 2.870 GB/sec
[ 134217728 bytes] CPU->GPU= 6.704 GB/sec, GPU->CPU 2.789 GB/sec
[ 268435456 bytes] CPU->GPU= 6.699 GB/sec, GPU->CPU 2.767 GB/sec
[ 536870912 bytes] CPU->GPU= 6.705 GB/sec, GPU->CPU 2.797 GB/sec
[1073741824 bytes] CPU->GPU= 6.705 GB/sec, GPU->CPU 2.771 GB/sec
calResAllocRemote2D() returned an error when trying to allocate 1874853888 bytes (uncached)!
Peak CPU->GPU Bandwidth = 6.705 GB/sec [data size = 536870912 bytes]
Peak GPU->CPU Bandwidth = 4.369 GB/sec [data size = 131072 bytes]
????
GeForce GTX 580

Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory, direct access
 Transfer Size (Bytes)    Bandwidth(MB/s)
 33554432               5561.7

Device to Host Bandwidth, 1 Device(s), Paged memory, direct access
 Transfer Size (Bytes)    Bandwidth(MB/s)
 33554432               5466.2

Device to Device Bandwidth, 1 Device(s)
 Transfer Size (Bytes)   Bandwidth(MB/s)
 33554432               138261.9
?????
__kernel void map_test(__global int* src,__global int* dst,const int limit)
{
  int id = get_global_id(0);

    if(id > limit) return;

    dst[id] = src[limit - 1 - id];
}
?????
__kernel void map_test(__global int* src,__global int* dst,const int limit)
{
  int id = get_global_id(0);

    if(id > limit) return;

    dst[id] = src[limit - 1 - id];
}                                               1000 ~
!

         NVIDIA      AMD

         0.355824   1.70634
 1000
          0.16186    0.7224
         3.54601    14.1305
10000
          1.697      6.1982
         35.4747    128.583
100000
          16.213    58.0289
まとめ
•   GPGPUやるならGeforce GTX 580

•   Radeon HD 7970は…

    •   スロースターター 足に爆弾

        •   カーネルが大きくなれば…

GPGPU deいろんな問題解いてみた

  • 1.
    GPU DE いろんな ちょっとした問題解いてみた Terumi YAMADA
  • 2.
    自己紹介 • 山田てるみ(研修中 • SIMD大好きっ子 • Twitter: telmin_orca
  • 3.
    もくじ • 自己紹介 • 前フリ • 巡回セールスマン問題解いてみた • Aobench走らせてみた • まとめ
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
    NVIDIA • Geforce GTX 580 • Fermi • 512 CUDA core • 3GB RAM • PCIe 2.0
  • 12.
    AMD • Radeon HD 7970 • GCN • 2048 Streaming Processor • 3GB RAM • PCIe 3.0
  • 13.
    HOST • Intel Core i7 2600K • SandyBridge • 8GB RAM
  • 14.
  • 15.
  • 16.
    解法 • 遺伝的アルゴリズム • 蟻コロニー最適化 • μ-opt法 • LK法
  • 17.
    2-opt法 i k i k l j l j
  • 18.
    Parallel 2-opt • SIMD 2-opt法のGPGPUへの適応と評価 • 第74回情報処理学会全国大会 GPUセッション
  • 20.
    重いのは? i k i k l j l j
  • 21.
    CPU -> GPU 経路長計算 最短経路選択 最短経路交換
  • 22.
    Result CPU NVIDIA AMD 10万 152.241 114.02 2472.06 12万 235.05 168.58 3487.41 14万 296.395 266.211 16万 427.161 328.547
  • 23.
    …? CPU NVIDIA AMD 10万 152.241 114.02 2472.06 12万 235.05 168.58 3487.41 14万 296.395 266.211 16万 427.161 328.547
  • 24.
    …? CPU NVIDIA AMD 10万 152.241 114.02 2472.06 12万 235.05 168.58 3487.41 14万 296.395 266.211 16万 427.161 328.547
  • 25.
    …? CPU NVIDIA AMD 10万 152.241 114.02 2472.06 12万 235.05 168.58 3487.41 14万 296.395 266.211 16万 427.161 328.547 \(^o^)/
  • 26.
  • 27.
    Aobench? • Ambient Occlution benchmark. • @syoyo氏制作 • 浮動小数点演算のベンチマーク
  • 28.
    Ambient Occlution • Global Illumination • 間接光 • 結構重い
  • 29.
    重いのは? • Intersection • Sphere * 3 + Plane = 4 • AO sample 64 * 64 = 256
  • 30.
  • 31.
    Result CPU NVIDIA AMD 256 * 256 6.30 0.057 0.061 64 * 64 512 * 512 24.58 0.213 0.131 64 * 64 1024 * 1024 96.735 0.831 0.4462 64 * 64
  • 32.
    : : が :   //: /:::|::',|::'、:::::::::\:.:\.:.:.ヽ:.:.:\:.:..\::::::::::::\、::::\   : :  : : 何 :  /!::|::l:::: /|:::l:ヽ:\::ヽ:.:\:.:\.:::ヽ:.:.:ヽ:.:.:.:\::::::::::::\ ̄   : :  : : だ :   |/l::|::|::|: ト、:::::::::、、:ヽ、:.:.:.:::::::::::::::ヽ::::.:ヽ:.:.:.:.\:.:.:.ヽ:::\.   :  : : : か :   |::|::/l::|::|r-ヽ:::::ヽ(ヽー,―\::::::、::::::::::ヽ::.:.::::::.:::::::ヾ. ̄   : :  : :    :   }//l::|:::|{(:::)ヾ、:::ヽ \!(:::) ヽ,:::ヽ:::::::::::::::::::::::::::::::::::ヾ、   :  : : : わ :.   |/l::|::|:::|ヽ==''" \:ヽ、ヽ=='" |:::::::::::::::::::::::::::::::::::ヽ、::::\   か     / ',|::|:::|   /   `゛       |!::::::::::::::::::::::::::::ト、::ト、_` ゛`   ら      l::!::::ト、  '、 _         ||::::::::::::::::::::::::ト:ヽヾ| | ̄ ̄ ̄ `ヽ、   な     r'"´||',::::',                 |:::::/l:::::|\:::ト、ヾ | |     / / \   い   /   ll ',::', 、 ーこニ=-       /!::/ ヽ:::|  ヾ、  ノ ノ  /  ,イ    ヽ、
  • 33.
    Device type: Unknown ??? Max resource 2D width/height: 16384/16384 Total GPU memory size: 3072 MB Total CPU cached space size: 508 MB Total CPU uncached space size: 1788 MB GPU engine clock: 925 MHz GPU memory clock: 1375 MHz Number of timing loops: 100 [ 16 bytes] CPU->GPU= 800.000 KB/sec, GPU->CPU 533.333 KB/sec [ 32 bytes] CPU->GPU= 1.600 MB/sec, GPU->CPU 1.067 MB/sec [ 64 bytes] CPU->GPU= 2.133 MB/sec, GPU->CPU 2.133 MB/sec [ 128 bytes] CPU->GPU= 2.560 MB/sec, GPU->CPU 4.267 MB/sec [ 256 bytes] CPU->GPU= 8.533 MB/sec, GPU->CPU 8.533 MB/sec [ 512 bytes] CPU->GPU= 17.067 MB/sec, GPU->CPU 25.600 MB/sec [ 1024 bytes] CPU->GPU= 51.200 MB/sec, GPU->CPU 34.133 MB/sec [ 2048 bytes] CPU->GPU= 102.400 MB/sec, GPU->CPU 68.267 MB/sec [ 4096 bytes] CPU->GPU= 204.800 MB/sec, GPU->CPU 204.800 MB/sec [ 8192 bytes] CPU->GPU= 409.600 MB/sec, GPU->CPU 409.600 MB/sec [ 16384 bytes] CPU->GPU= 409.600 MB/sec, GPU->CPU 819.200 MB/sec [ 32768 bytes] CPU->GPU= 1.638 GB/sec, GPU->CPU 1.638 GB/sec [ 65536 bytes] CPU->GPU= 2.185 GB/sec, GPU->CPU 3.277 GB/sec ... [ 4194304 bytes] CPU->GPU= 6.658 GB/sec, GPU->CPU 4.033 GB/sec [ 8388608 bytes] CPU->GPU= 6.658 GB/sec, GPU->CPU 3.884 GB/sec [ 16777216 bytes] CPU->GPU= 6.684 GB/sec, GPU->CPU 3.233 GB/sec [ 33554432 bytes] CPU->GPU= 6.697 GB/sec, GPU->CPU 2.993 GB/sec [ 67108864 bytes] CPU->GPU= 6.697 GB/sec, GPU->CPU 2.870 GB/sec [ 134217728 bytes] CPU->GPU= 6.704 GB/sec, GPU->CPU 2.789 GB/sec [ 268435456 bytes] CPU->GPU= 6.699 GB/sec, GPU->CPU 2.767 GB/sec [ 536870912 bytes] CPU->GPU= 6.705 GB/sec, GPU->CPU 2.797 GB/sec [1073741824 bytes] CPU->GPU= 6.705 GB/sec, GPU->CPU 2.771 GB/sec calResAllocRemote2D() returned an error when trying to allocate 1874853888 bytes (uncached)! Peak CPU->GPU Bandwidth = 6.705 GB/sec [data size = 536870912 bytes] Peak GPU->CPU Bandwidth = 4.369 GB/sec [data size = 131072 bytes]
  • 34.
    ???? GeForce GTX 580 QuickMode Host to Device Bandwidth, 1 Device(s), Paged memory, direct access Transfer Size (Bytes) Bandwidth(MB/s) 33554432 5561.7 Device to Host Bandwidth, 1 Device(s), Paged memory, direct access Transfer Size (Bytes) Bandwidth(MB/s) 33554432 5466.2 Device to Device Bandwidth, 1 Device(s) Transfer Size (Bytes) Bandwidth(MB/s) 33554432 138261.9
  • 35.
    ????? __kernel void map_test(__globalint* src,__global int* dst,const int limit) { int id = get_global_id(0); if(id > limit) return; dst[id] = src[limit - 1 - id]; }
  • 36.
    ????? __kernel void map_test(__globalint* src,__global int* dst,const int limit) { int id = get_global_id(0); if(id > limit) return; dst[id] = src[limit - 1 - id]; } 1000 ~
  • 37.
    ! NVIDIA AMD 0.355824 1.70634 1000 0.16186 0.7224 3.54601 14.1305 10000 1.697 6.1982 35.4747 128.583 100000 16.213 58.0289
  • 38.
  • 39.
    GPGPUやるならGeforce GTX 580 • Radeon HD 7970は… • スロースターター 足に爆弾 • カーネルが大きくなれば…