Performing SQL with SSD-to-GPU P2P Transfer
かぴばらの旦那 / Herr.Wasserschwein
<kaigai@kaigai.gr.jp>
The PG-Strom Project
Feedbacks under the PG-Strom v1.0 development
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer2
Application
Storage
Query
Optimizer
Query
Executor
PG-Strom
Extension
SQL Parser
Storage Manager
GPU
計算集約的ワークロード
computing intensive workloads
• 統計解析、科学技術計算、
マーケティング、etc...
• statistics, science, marketing, ...
 by PL/CUDA + Matrix-Array
I/O集約的ワークロード
(i/o intensive workloads)
• DWH, ETL, Reporting, ...
 by SSD-to-GPU P2P DMA
The PG-Strom Project
x86サーバのアーキテクチャ / Architecture of x86 server
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer3
RAMRAM
PCI Bus
NVMe-SSD GPU
PCIe x16PCIe x4~x8
PCH
その他低速デバイス
(other slow devices)
The PG-Strom Project
x86サーバのアーキテクチャ / Architecture of x86 server
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer4
PCH
RAMRAM
PCI Bus
NVMe-SSD GPU
PCIe x16PCIe x4~x8
I/O READ
disk
block
disk
buffer
カタログスペック
(catalog spec)
2GB~6GB/s
その他低速デバイス
(other slow devices)
The PG-Strom Project
x86サーバのアーキテクチャ / Architecture of x86 server
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer5
PCH
RAMRAM
PCI Bus
NVMe-SSD GPU
PCIe x16PCIe x4~x8
disk
buffer
その他低速デバイス
(other slow devices)
The PG-Strom Project
やりたい事 / What I want to do
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer6
PCH
RAMRAM
PCI Bus
NVMe-SSD GPU
PCIe x16PCIe x4~x8
result
buffer
disk
block
その他低速デバイス
(other slow devices)
The PG-Strom Project
やりたい事 / What I want to do
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer7
PCH
RAM
PCI Bus
NVMe-SSD GPU
PCIe x16PCIe x4~x8
Large PostgreSQL Tables
Small Inner Tables
WHERE句
JOIN
GROUP BY
データサイズが
すごく小さく!
(making data-size
much smaller!)
その他低速デバイス
(other slow devices)
The PG-Strom Project
やりたい事 / What I want to do
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer8
PCH
RAM
PCI Bus
NVMe-SSD GPU
PCIe x16PCIe x4~x8
Large PostgreSQL Tables
Small Inner Tables
WHERE句
JOIN
GROUP BY
ここまで完成
(works completed)
その他低速デバイス
(other slow devices)
The PG-Strom Project
要素技術 / Element Technology: GPUDirect RDMA by NVIDIA
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer9
GPUのデバイスメモリを、物理アドレス空間
にマップするためのAPI
(API to map GPU’s device memory
on physical address space of the host system)

ストレージからのDMA転送先に
GPU上のデバイスメモリを指定できる。
(GPU’s device memory can be used for the
destination address of DMA from the storage)
The PG-Strom Project
NVMe-Stromドライバ / NVMe-Strom Driver
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer10
pg-strom
NVMe-Strom
VFS
Page Cache
NVMe SSD
Driver
nvidia
driver
PostgreSQL
/proc/nvme-strom
read(2)
User
Space
Kernel
Space
The PG-Strom Project
NVMe-Stromドライバ / NVMe-Strom Driver
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer11
pg-strom
NVMe-Strom
VFS
Page Cache
NVMe SSD
Driver
nvidia
driver
GPU
device
memory
PostgreSQL
cuMemAlloc()
/proc/nvme-strom
read(2)
User
Space
Kernel
Space
The PG-Strom Project
NVMe-Stromドライバ / NVMe-Strom Driver
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer12
pg-strom
NVMe-Strom
VFS
Page Cache
NVMe SSD
Driver
nvidia
driver
GPU
device
memory
GPU
device
memory
PostgreSQL
cuMemAlloc()
/proc/nvme-strom
ioctl(2)
read(2)
User
Space
Kernel
Space
The PG-Strom Project
NVMe-Stromドライバ / NVMe-Strom Driver
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer13
pg-strom
NVMe-Strom
VFS
Page Cache
NVMe SSD
Driver
nvidia
driver
GPU
device
memory
GPU
device
memory
PostgreSQL
file offset
block number
cuMemAlloc()
/proc/nvme-strom
ioctl(2)
read(2)
User
Space
Kernel
Space
The PG-Strom Project
NVMe-Stromドライバ / NVMe-Strom Driver
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer14
pg-strom
NVMe-Strom
VFS
Page Cache
NVMe SSD
Driver
nvidia
driver
GPU
device
memory
GPU
device
memory
PostgreSQL
file offset
DMA
request
block number
cuMemAlloc()
/proc/nvme-strom
ioctl(2)
read(2)
User
Space
Kernel
Space
The PG-Strom Project
NVMe-Stromドライバ / NVMe-Strom Driver
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer15
pg-strom
NVMe-Strom
VFS
Page Cache
NVMe SSD
Driver
nvidia
driver
GPU
device
memory
GPU
device
memory
PostgreSQL
file offset
DMA
request
block number
SSD-to-GPU Peer-to-Peer DMA
cuMemAlloc()
/proc/nvme-strom
ioctl(2)
read(2)
User
Space
Kernel
Space
The PG-Strom Project
単純I/O性能 / Raw I/O Performance
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer16
 32MB x 6個のバッファを使用。バッファが空になる度に非同期DMAをキック
 6 of 32MB buffers were used. Async DMA was kicked per
 測定環境 / Environment
 CPU: Xeon E5-2670 v3, RAM: 64GB
 Intel SSD 750 (400GB; PCI-E x4)
 NVIDIA Tesla K20c (2496core; 706MHz, 5GB GDDR5; 208GB/s)
 OS: CentOS 7 (3.10.0-327.18.2.el7.x86_64), Filesystem: Ext4
カタログスペック!
(catalog spec!!)
The PG-Strom Project
測定に使用したNVMe-SSD / NVMe-SSD for this measurement
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer17
容量
(capacity)
順次128KB
読出し
(Seq Read)
順次128KB
書込み
(Seq Write)
ランダム4KB
読出し
(Random Read)
ランダム4KB
書込み
(Random Write)
インターフェース
(Interface)
400GB 2,200MB/s 900MB/s 430,000 IOPS 230,000 IOPS PCIe 3.0 x4
800GB 2,100MB/s 800MB/s 420,000 IOPS 210,000 IOPS PCIe 3.0 x4
1.2TB 2,500MB/s 1,200MB/s 460,000 IOPS 290,000 IOPS PCIe 3.0 x4
 これ以外に、Samsung PM1725 NVMe SSD (1.6TB, 6GB/s) での動作報告あり。
(working at Samsung PM1725 NVMe SSD (1.6TB, 6GB/s) was reported)
 Raw-I/OのSSD-to-GPUで5634MB/sを記録との報告
(It said the raw-I/O SSD-to-GPU worked with 5634MB/s)
 https://github.com/kaigai/nvme-kmod/issues/1
The PG-Strom Project
SQLスキャン性能 / SQL Scan Performance
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer18
▌この測定結果から分かる事 / What this measurement tells us
 既存ストレージ層の性能限界 / Performance limit of the storage layer
 64GB / 140sec = 468MB/s  Raw-I/O性能(587MB/s)に20%程度の追加コスト
(Extra 20% cost in addition to the raw-i/o throughput (587MB/s))
 NVMe-Stromによる改善 / Improvement by NVMe-Strom
 スループット / Throughput: 64GB / 43sec = 1524MB/s
Existing
Limit
The PG-Strom Project
測定に使用したクエリ / Query for the measurement
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer19
CREATE TABLE t_64g (id int not null,
x float not null,
y float not null,
z float not null,
memo text);
INSERT INTO t_64g (SELECT x, random()*1000, random()*1000,
random()*1000, md5(x::text)
FROM generate_series(1,700000000) x);
postgres=# ¥d+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------------------+-------+--------+---------+-------------
public | t | table | kaigai | 965 MB |
public | t_64g | table | kaigai | 66 GB |
Query-1) Scan query with a simple WHERE-clause
SELECT * FROM t WHERE x BETWEEN y-2 AND y+2;
Query-2) Scan query with a complicated WHERE-clause
SELECT * FROM t_64g WHERE sqrt((x-200)^2 + (y-300)^2 +
(z-400)^2) < 10;
Query-3) Scan query with text matching
SELECT * FROM t WHERE memo LIKE '%abcd%';
The PG-Strom Project
開発ロードマップ / Development Roadmap
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer20
• GPUデバイスメモリのホストマッピングと、SSD-to-GPU P2P DMA要求の発行
• Host mapping of GPU device memory, and P2P DMA request for SSD-to-GPU transfer
① NVMe-Strom driver: the basic functionality
• PostgreSQL v9.6のCPU並列対応と、NVMe-Stromを使ったP2Pのデータロード
• hybrid parallel, and peer-to-peer data loading by NVMe-StromSupport of CPU+GPU
② PG-Strom: Integration with GpuScan + NVMe-Strom
• PostgreSQL v9.6の新オプティマイザ対応と、スキャン実装のGpuScanとの統合
• Support of new optimizer in PostgreSQL v9.6, and integration with GpuScan for simple scan
③ PG-Strom: JOIN/GROUP BY Support
• RAID-0/1区画に対するストライピングREAD / Striping READ on RAID-0/1 volumes
④ NVMe-Strom driver: RAID-0/1 support
• テスト、テスト、テスト、デバッグ / Test, Test, Test, Debug
⑤ 品質改善・安定化 / Quality improvement and stabilization
⑥ PG-Strom v2.0!! (2017/2Q~3Q)
いまココ!!
The PG-Strom Project
PG-Strom v2.0のターゲット / Target on PG-Strom v2.0
PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer21
PCI-E x8
5.0GB/s
PCI-E
PCI-E x16
~10GB/s
シングルノードで最大20GB/sのデータ処理能力を目指す
(towards 20GB/s data processing capability per node)
Dual NVMe-SSD
+RAID0/1対応
10GB/sのスループットで
SSDブロックをGPUへロード
数千コアによる
GPU並列処理
PCI-E
PCI-E x8
5.0GB/s
PCI-E x8
5.0GB/s
PCI-E x16
~10GB/s
PCI-E x8
5.0GB/s
Dual NVMe-SSD
+RAID0/1 Support
Loading SSD blocks to GPU
with 10GB/s throughput
GPU Parallels by
thousands cores
乞うご期待
don’t miss it

pgconfasia2016 lt ssd2gpu

  • 1.
    Performing SQL withSSD-to-GPU P2P Transfer かぴばらの旦那 / Herr.Wasserschwein <kaigai@kaigai.gr.jp>
  • 2.
    The PG-Strom Project Feedbacksunder the PG-Strom v1.0 development PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer2 Application Storage Query Optimizer Query Executor PG-Strom Extension SQL Parser Storage Manager GPU 計算集約的ワークロード computing intensive workloads • 統計解析、科学技術計算、 マーケティング、etc... • statistics, science, marketing, ...  by PL/CUDA + Matrix-Array I/O集約的ワークロード (i/o intensive workloads) • DWH, ETL, Reporting, ...  by SSD-to-GPU P2P DMA
  • 3.
    The PG-Strom Project x86サーバのアーキテクチャ/ Architecture of x86 server PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer3 RAMRAM PCI Bus NVMe-SSD GPU PCIe x16PCIe x4~x8 PCH その他低速デバイス (other slow devices)
  • 4.
    The PG-Strom Project x86サーバのアーキテクチャ/ Architecture of x86 server PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer4 PCH RAMRAM PCI Bus NVMe-SSD GPU PCIe x16PCIe x4~x8 I/O READ disk block disk buffer カタログスペック (catalog spec) 2GB~6GB/s その他低速デバイス (other slow devices)
  • 5.
    The PG-Strom Project x86サーバのアーキテクチャ/ Architecture of x86 server PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer5 PCH RAMRAM PCI Bus NVMe-SSD GPU PCIe x16PCIe x4~x8 disk buffer その他低速デバイス (other slow devices)
  • 6.
    The PG-Strom Project やりたい事/ What I want to do PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer6 PCH RAMRAM PCI Bus NVMe-SSD GPU PCIe x16PCIe x4~x8 result buffer disk block その他低速デバイス (other slow devices)
  • 7.
    The PG-Strom Project やりたい事/ What I want to do PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer7 PCH RAM PCI Bus NVMe-SSD GPU PCIe x16PCIe x4~x8 Large PostgreSQL Tables Small Inner Tables WHERE句 JOIN GROUP BY データサイズが すごく小さく! (making data-size much smaller!) その他低速デバイス (other slow devices)
  • 8.
    The PG-Strom Project やりたい事/ What I want to do PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer8 PCH RAM PCI Bus NVMe-SSD GPU PCIe x16PCIe x4~x8 Large PostgreSQL Tables Small Inner Tables WHERE句 JOIN GROUP BY ここまで完成 (works completed) その他低速デバイス (other slow devices)
  • 9.
    The PG-Strom Project 要素技術/ Element Technology: GPUDirect RDMA by NVIDIA PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer9 GPUのデバイスメモリを、物理アドレス空間 にマップするためのAPI (API to map GPU’s device memory on physical address space of the host system)  ストレージからのDMA転送先に GPU上のデバイスメモリを指定できる。 (GPU’s device memory can be used for the destination address of DMA from the storage)
  • 10.
    The PG-Strom Project NVMe-Stromドライバ/ NVMe-Strom Driver PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer10 pg-strom NVMe-Strom VFS Page Cache NVMe SSD Driver nvidia driver PostgreSQL /proc/nvme-strom read(2) User Space Kernel Space
  • 11.
    The PG-Strom Project NVMe-Stromドライバ/ NVMe-Strom Driver PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer11 pg-strom NVMe-Strom VFS Page Cache NVMe SSD Driver nvidia driver GPU device memory PostgreSQL cuMemAlloc() /proc/nvme-strom read(2) User Space Kernel Space
  • 12.
    The PG-Strom Project NVMe-Stromドライバ/ NVMe-Strom Driver PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer12 pg-strom NVMe-Strom VFS Page Cache NVMe SSD Driver nvidia driver GPU device memory GPU device memory PostgreSQL cuMemAlloc() /proc/nvme-strom ioctl(2) read(2) User Space Kernel Space
  • 13.
    The PG-Strom Project NVMe-Stromドライバ/ NVMe-Strom Driver PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer13 pg-strom NVMe-Strom VFS Page Cache NVMe SSD Driver nvidia driver GPU device memory GPU device memory PostgreSQL file offset block number cuMemAlloc() /proc/nvme-strom ioctl(2) read(2) User Space Kernel Space
  • 14.
    The PG-Strom Project NVMe-Stromドライバ/ NVMe-Strom Driver PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer14 pg-strom NVMe-Strom VFS Page Cache NVMe SSD Driver nvidia driver GPU device memory GPU device memory PostgreSQL file offset DMA request block number cuMemAlloc() /proc/nvme-strom ioctl(2) read(2) User Space Kernel Space
  • 15.
    The PG-Strom Project NVMe-Stromドライバ/ NVMe-Strom Driver PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer15 pg-strom NVMe-Strom VFS Page Cache NVMe SSD Driver nvidia driver GPU device memory GPU device memory PostgreSQL file offset DMA request block number SSD-to-GPU Peer-to-Peer DMA cuMemAlloc() /proc/nvme-strom ioctl(2) read(2) User Space Kernel Space
  • 16.
    The PG-Strom Project 単純I/O性能/ Raw I/O Performance PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer16  32MB x 6個のバッファを使用。バッファが空になる度に非同期DMAをキック  6 of 32MB buffers were used. Async DMA was kicked per  測定環境 / Environment  CPU: Xeon E5-2670 v3, RAM: 64GB  Intel SSD 750 (400GB; PCI-E x4)  NVIDIA Tesla K20c (2496core; 706MHz, 5GB GDDR5; 208GB/s)  OS: CentOS 7 (3.10.0-327.18.2.el7.x86_64), Filesystem: Ext4 カタログスペック! (catalog spec!!)
  • 17.
    The PG-Strom Project 測定に使用したNVMe-SSD/ NVMe-SSD for this measurement PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer17 容量 (capacity) 順次128KB 読出し (Seq Read) 順次128KB 書込み (Seq Write) ランダム4KB 読出し (Random Read) ランダム4KB 書込み (Random Write) インターフェース (Interface) 400GB 2,200MB/s 900MB/s 430,000 IOPS 230,000 IOPS PCIe 3.0 x4 800GB 2,100MB/s 800MB/s 420,000 IOPS 210,000 IOPS PCIe 3.0 x4 1.2TB 2,500MB/s 1,200MB/s 460,000 IOPS 290,000 IOPS PCIe 3.0 x4  これ以外に、Samsung PM1725 NVMe SSD (1.6TB, 6GB/s) での動作報告あり。 (working at Samsung PM1725 NVMe SSD (1.6TB, 6GB/s) was reported)  Raw-I/OのSSD-to-GPUで5634MB/sを記録との報告 (It said the raw-I/O SSD-to-GPU worked with 5634MB/s)  https://github.com/kaigai/nvme-kmod/issues/1
  • 18.
    The PG-Strom Project SQLスキャン性能/ SQL Scan Performance PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer18 ▌この測定結果から分かる事 / What this measurement tells us  既存ストレージ層の性能限界 / Performance limit of the storage layer  64GB / 140sec = 468MB/s  Raw-I/O性能(587MB/s)に20%程度の追加コスト (Extra 20% cost in addition to the raw-i/o throughput (587MB/s))  NVMe-Stromによる改善 / Improvement by NVMe-Strom  スループット / Throughput: 64GB / 43sec = 1524MB/s Existing Limit
  • 19.
    The PG-Strom Project 測定に使用したクエリ/ Query for the measurement PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer19 CREATE TABLE t_64g (id int not null, x float not null, y float not null, z float not null, memo text); INSERT INTO t_64g (SELECT x, random()*1000, random()*1000, random()*1000, md5(x::text) FROM generate_series(1,700000000) x); postgres=# ¥d+ List of relations Schema | Name | Type | Owner | Size | Description --------+------------------+-------+--------+---------+------------- public | t | table | kaigai | 965 MB | public | t_64g | table | kaigai | 66 GB | Query-1) Scan query with a simple WHERE-clause SELECT * FROM t WHERE x BETWEEN y-2 AND y+2; Query-2) Scan query with a complicated WHERE-clause SELECT * FROM t_64g WHERE sqrt((x-200)^2 + (y-300)^2 + (z-400)^2) < 10; Query-3) Scan query with text matching SELECT * FROM t WHERE memo LIKE '%abcd%';
  • 20.
    The PG-Strom Project 開発ロードマップ/ Development Roadmap PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer20 • GPUデバイスメモリのホストマッピングと、SSD-to-GPU P2P DMA要求の発行 • Host mapping of GPU device memory, and P2P DMA request for SSD-to-GPU transfer ① NVMe-Strom driver: the basic functionality • PostgreSQL v9.6のCPU並列対応と、NVMe-Stromを使ったP2Pのデータロード • hybrid parallel, and peer-to-peer data loading by NVMe-StromSupport of CPU+GPU ② PG-Strom: Integration with GpuScan + NVMe-Strom • PostgreSQL v9.6の新オプティマイザ対応と、スキャン実装のGpuScanとの統合 • Support of new optimizer in PostgreSQL v9.6, and integration with GpuScan for simple scan ③ PG-Strom: JOIN/GROUP BY Support • RAID-0/1区画に対するストライピングREAD / Striping READ on RAID-0/1 volumes ④ NVMe-Strom driver: RAID-0/1 support • テスト、テスト、テスト、デバッグ / Test, Test, Test, Debug ⑤ 品質改善・安定化 / Quality improvement and stabilization ⑥ PG-Strom v2.0!! (2017/2Q~3Q) いまココ!!
  • 21.
    The PG-Strom Project PG-Stromv2.0のターゲット / Target on PG-Strom v2.0 PGconf.ASIA2017 - LT / Performing SQL with SSD-to-GPU P2P Transfer21 PCI-E x8 5.0GB/s PCI-E PCI-E x16 ~10GB/s シングルノードで最大20GB/sのデータ処理能力を目指す (towards 20GB/s data processing capability per node) Dual NVMe-SSD +RAID0/1対応 10GB/sのスループットで SSDブロックをGPUへロード 数千コアによる GPU並列処理 PCI-E PCI-E x8 5.0GB/s PCI-E x8 5.0GB/s PCI-E x16 ~10GB/s PCI-E x8 5.0GB/s Dual NVMe-SSD +RAID0/1 Support Loading SSD blocks to GPU with 10GB/s throughput GPU Parallels by thousands cores
  • 22.