PGconf.ASIA Unconference Session - Partition-wise Join & Visibility-Map

PGconf.ASIA Unconference Session
～Partition-wise Join & Visibility-Map～
HeteroDB,Inc
Chief Architect & CEO
KaiGai Kohei <kaigai@heterodb.com>

about HeteroDB
Discussion Materials for Foreign Partners2
Corporate overview
 Name HeteroDB,Inc
 Established 4th-Jul-2017
 Location Shinagawa, Tokyo, Japan
 Businesses Sales of accelerated database product
Technical consulting on GPU&DB region
By the heterogeneous-computing technology on the database area,
we provides a useful, fast and cost-effective data analytics platform
for all the people who need the power of analytics.
CEO Profile
 KaiGai Kohei – He has contributed for PostgreSQL and Linux kernel
development in the OSS community more than ten years, especially,
for security and database federation features of PostgreSQL.
 Award of “Genius Programmer” by IPA MITOH program (2007)
 The top-5 posters finalist at GPU Technology Conference 2017.

Features of RDBMS
✓ High-availability / Clustering
✓ DB administration and backup
✓ Transaction control
✓ BI and visualization
➔ We can use the products that
support PostgreSQL as-is.
Core technology – PG-Strom
PG-Strom: An extension module for PostgreSQL, to accelerate SQL
workloads by the thousands cores and wide-band memory of GPU.
GPU
Big-data Analytics
PG-Strom
NVME and GPU accelerates PostgreSQL beyond the limitation -PGconf.EU 2018-3
Mass data loading from
the storage device rapidly
Machine-learning & Statistics

SSD-to-GPU Direct SQL Execution
and Visibility-map

A usual composition of x86_64 server
GPUSSD
CPU
RAM
HDD
N/W

Data flow to process a massive amount of data
CPU RAM
SSD GPU
PCIe
PostgreSQL
Data Blocks
Normal Data Flow
All the records, including junks, must be loaded
onto RAM once, because software cannot check
necessity of the rows prior to the data loading.
So, amount of the I/O traffic over PCIe bus tends
to be large.
Unless records are not loaded to CPU/RAM once, over the PCIe bus,
software cannot check its necessity even if it is “junk”.

Core Feature: SSD-to-GPU Direct SQL
CPU RAM
SSD GPU
PCIe
PostgreSQL
Data Blocks
NVIDIA GPUDirect RDMA
It allows to load the data blocks on NVME-SSD
to GPU using peer-to-peer DMA over PCIe-bus;
bypassing CPU/RAM. WHERE-clause
JOIN
GROUP BY
Run SQL by GPU
to reduce the data size
Data Size: Small
v2.0

Element technology - NVIDIA GPUDirect RDMA
Physical
address space
PCIe BAR1 Area
GPU
device
memory
RAM
NVMe-SSD Infiniband
HBA
PCIe device
GPUDirect RDMA
It enables to map GPU device
memory on physical address
space of the host system
Once “physical address of GPU device memory”
appears, we can use is as source or destination
address of DMA with PCIe devices.
0xf0000000
0xe0000000
DMA Request
SRC: 1200th sector
LEN: 40 sectors
DST: 0xe0200000

Benchmark Results – single-node version
2172.3 2159.6 2158.9 2086.0 2127.2 2104.3
1920.3
2023.4 2101.1 2126.9
1900.0 1960.3
2072.1
6149.4 6279.3 6282.5
5985.6 6055.3 6152.5
5479.3
6051.2 6061.5 6074.2
5813.7 5871.8 5800.1
0
1000
2000
3000
4000
5000
6000
7000
Q1_1 Q1_2 Q1_3 Q2_1 Q2_2 Q2_3 Q3_1 Q3_2 Q3_3 Q3_4 Q4_1 Q4_2 Q4_3
QueryProcessingThroughput[MB/sec]
Star Schema Benchmark on NVMe-SSD + md-raid0
PgSQL9.6(SSDx3) PGStrom2.0(SSDx3) H/W Spec (3xSSD)
SSD-to-GPU Direct SQL pulls out an awesome performance close to the H/W spec
 Measurement by the Star Schema Benchmark; which is a set of typical batch / reporting workloads.
 CPU: Intel Xeon E5-2650v4, RAM: 128GB, GPU: NVIDIA Tesla P40, SSD: Intel 750 (400GB; SeqRead 2.2GB/s)x3
 Size of dataset is 353GB (sf: 401), to ensure I/O bounds workload

SSD-to-GPU Direct SQL - Software Stack
Filesystem
(ext4, xfs)
nvme_strom
kernel module
NVMe SSD
NVIDIA Tesla GPU
PostgreSQL
pg_strom
extension
read(2) ioctl(2)
Hardware
Layer
Operating
System
Software
Layer
Database
Software
Layer
blk-mq
nvme pcie nvme rdma
Network HBA
File-offset to NVME-SSD sector
number translation
NVMEoF Target
(JBOF)
NVMe
Request
■ Other software component
■ Our developed software
■ Hardware
NVME over Fabric
nvme_strom v2.0 supports
NVME-over-Fabric (RDMA).

uninitialized
File
BLK-100: unknown
BLK-101: unknown
BLK-102: unknown
BLK-103: unknown
BLK-104: unknown
BLK-105: unknown
BLK-106: unknown
BLK-107: unknown
BLK-108: unknown
BLK-109: unknown
Consistency with disk buffers of PostgreSQL / Linux kernel
BLK-100: uncached
BLK-101: cached by PG
BLK-102: uncached
BLK-103: uncached
BLK-104: cached by OS
BLK-106: uncached
BLK-107: uncached
BLK-109: uncached
BLCKSZ
(=8KB)
Transfer Size
Per Request
BLCKSZ *
NChunks
BLK-100: uncached
BLK-102: uncached
BLK-103: uncached
BLK-106: uncached
BLK-107: uncached
BLK-109: uncached
unused
SSD-to-GPU
P2P DMA
Userspace DMA
Buffer (RAM)
Device Memory
(GPU)
CUDA API
(userspace)
cuMemcpyHtoDAsync
① PG-Strom checks PG’s shared buffer of the blocks to read
✓ If block is already cached on the shared buffer of PostgreSQL.
✓ If block may not be all-visible using visibility-map.
② NVME-Strom checks OS’s page cache of the blocks to read
✓ If block is already cached on the page cache of Linux kernel.
✓ But, on fast-SSD mode, page cache shall not be copied unless it is not diry.
③ Then, kicks SSD-to-GPU P2P DMA on the uncached blocks

To be discussed (1)
▌Which case we can read the blocks using SSD-to-GPU Direct SQL
 Block is ALL-VISIBLE
 Block is not ALL-VISIBLE, but we can determine the MVCC visibility without
commit log by t_infomask.
▌zheap will eliminate necessity of the visibility map
▌Solution: A lightweight hook on SetHintBits()
 Extension can make its own structure to manage the state of blocks
asynchronously.
 The hook allows to invalidate once a block becomes not-safe for direct read.
 Other use scenario:
A hybrid data store (row & column) can use the hook for invalidation of the
range which is already transformed to columnar.
Visibility map can be a stable infrastructure?

I/O Path optimization on PCIe-bus
and Partition-wise Join

Consideration for the hardware configuration (1/2)
PCIe-switch can make CPU more relaxed.
CPU CPU
PLX
SSD GPU
PLX
SSD GPU
PLX
SSD GPU
PLX
SSD GPU
SCAN SCAN SCAN SCAN
JOIN JOIN JOIN JOIN
GROUP BY GROUP BY GROUP BY GROUP BY
Very small amount of data
GATHER GATHER

Consideration for the hardware configuration (2/2)
HeteroDB Products Brief (1H-2019 edition)15
Supermicro
SYS-4029TRT2
x96 lane
PCIe
switch
x96 lane
PCIe
switch
CPU2 CPU1
QPI
Gen3
x16
Gen3 x16
for each
slot
Gen3 x16
for each
slotGen3
x16
▌HPC Server
▌I/O Expansion Box
NEC ExpEther 40G
(4slots)
Network
Switch
4 slots of
PCIe Gen3 x8
PCIe
Swich
40Gb
Ethernet
CPU
NIC
Extra I/O Boxes

Hardware optimized table partition layout
lineorder
lineorder_p0
lineorder_p1
lineorder_p2
reminder=0
reminder=1
reminder=2
customer date
supplier parts
tablespace: nvme0
tablespace: nvme1
tablespace: nvme2
key
INSERT Hashed key
hash = f(key)
hash % 3 = 2
hash % 3 = 0
Raw data
1053GB
Partial data
351GB
Partial data
351GB
Partial data
351GB
Partition leaf for each I/O expansion box on behalf of the tablespaces

Multi GPU/SSD and Partition Configuration (1/2)
HeteroDB Products Brief (1H-2019 edition)17
NEC Express5800/R120g-2m
CPU: Intel Xeon E5-2603 v4 (6C, 1.7GHz)
RAM: 64GB
OS: Red Hat Enterprise Linux 7
(kernel: 3.10.0-862.9.1.el7.x86_64)
CUDA-9.2.148 + driver 396.44
DB: PostgreSQL 11beta3 + PG-Strom v2.1devel
lineorder_a
(351GB)
lineorder_b
(351GB)
lineorder_c
(351GB)
NEC ExpEther (40Gb; 4slots版)
I/F: PCIe 3.0 x8 (x16幅) x4スロット
+ internal PCIe switch
N/W: 40Gb-ethernet
NVIDIA Tesla P40
# of cores: 3840 (1.3GHz)
Device RAM: 24GB (347GB/s, GDDR5)
CC: 6.1 (Pascal, GP104)
I/F: PCIe 3.0 x16
Intel DC P4600 (2.0TB; HHHL)
SeqRead: 3200MB/s
SeqWrite: 1575MB/s
RandRead: 610k IOPS
RandWrite: 196k IOPS
I/F: PCIe 3.0 x4
customer date
supplier parts

Multi GPU/SSD and Partition Configuration (2/2) - Benchmark
 13 SSBM queries to 1055GB database in total (a.k.a 351GB per I/O expansion box)
 Raw I/O data transfer without SQL execution was up to 9GB/s.
In other words, SQL execution was faster than simple storage read with raw-I/O.
2,388 2,477 2,493 2,502 2,739 2,831
1,865
2,268 2,442 2,418
1,789 1,848
2,202
13,401 13,534 13,536 13,330
12,696
12,965
12,533
11,498
12,312 12,419 12,414 12,622 12,594
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
Q1_1 Q1_2 Q1_3 Q2_1 Q2_2 Q2_3 Q3_1 Q3_2 Q3_3 Q3_4 Q4_1 Q4_2 Q4_3
QueryExecutionThroughput[MB/s]
Star Schema Benchmark for PgSQL v11beta3 / PG-Strom v2.1devel on NEC ExpEther x3
PostgreSQL v11beta3 PG-Strom v2.1devel Raw I/O Limitation
max 13.5GB/s for query execution performance with 3x I/O expansion boxes!!

Issue of Partition-wise Join (1/4)
lineorder
lineorder_p0
lineorder_p1
lineorder_p2
reminder=0
reminder=1
reminder=2
customer date
supplier parts
tablespace: nvme0
tablespace: nvme1
tablespace: nvme2
New in PostgreSQL v11: Parallel scan of the partition leafs
Scan
Scan
Scan
Gather
Join
Agg
Query
Results
Scan
Massive records
makes hard to gather

lineorder
lineorder_p0
lineorder_p1
lineorder_p2
reminder=0
reminder=1
reminder=2
customer date
supplier parts
tablespace: nvme0
tablespace: nvme1
tablespace: nvme2
Preferable: Gathering the partition-leafs after JOIN / GROUP BY
Join
Gather
Agg
Query
Results
Scan
Scan
PreAgg
Join
Scan
PreAgg
Join
Scan
PreAgg

▌INNER JOIN / RIGHT OUTER JOIN
 It works if planner distribute customer to individual leaf of the lineorder.
 No enhancement on the executor side is needed.
 customer × (lineorder_p0 + lineorder_p1 + lineorder_p2)
= (customer × lineorder_p0) + (customer×lineorder_p1) + (customer×lineorder_p2)
▌LEFT OUTER JOIN / FULL OUTER JOIN
 Left side of the join needs to track unreferenced tuples on a shared structure.
lineorder_p2
customer
lineorder_p1
lineorder_p0
JOIN

ssbm =# EXPLAIN SELECT sum(lo_extendedprice*lo_discount) as revenue
FROM lineorder,date1
WHERE lo_orderdate = d_datekey
AND d_year = 1993
AND lo_discount between 1 and 3
AND lo_quantity < 25;
QUERY PLAN
------------------------------------------------------------------------------
Aggregate
-> Gather
Workers Planned: 9
-> Parallel Append
-> Parallel Custom Scan (GpuPreAgg)
Reduction: NoGroup
Combined GpuJoin: enabled
GPU Preference: GPU2 (Tesla P40)
-> Parallel Custom Scan (GpuJoin) on lineorder_p2
Outer Scan: lineorder_p2
Outer Scan Filter: ((lo_discount >= '1'::numeric) AND (lo_discount <= '3'::numeric)
AND (lo_quantity < '25'::numeric))
Depth 1: GpuHashJoin (nrows 102760469...45490403)
HashKeys: lineorder_p2.lo_orderdate
JoinQuals: (lineorder_p2.lo_orderdate = date1.d_datekey)
KDS-Hash (size: 66.03KB)
NVMe-Strom: enabled
-> Seq Scan on date1
Filter: (d_year = 1993)
-> Parallel Custom Scan (GpuPreAgg)
Reduction: NoGroup
Combined GpuJoin: enabled
:
Portion to be executed
on the 3rd I/O expansion box.

To be discussed (2)
① Built-in planner support to construct partition-wise JOIN
② Shared data structure for the distributed INNER tables
lineorder_p2
customer
lineorder_p1
lineorder_p0
Hash of
customer
Hash of
customer
Hash of
customer
A shared structure
other Hash-node
distributed
can reference

PGconf.ASIA Unconference Session - Partition-wise Join & Visibility-Map

PGconf.ASIA Unconference Session - Partition-wise Join & Visibility-Map

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to PGconf.ASIA Unconference Session - Partition-wise Join & Visibility-Map

Similar to PGconf.ASIA Unconference Session - Partition-wise Join & Visibility-Map (20)

More from Kohei KaiGai

More from Kohei KaiGai (20)

Recently uploaded

Recently uploaded (20)

PGconf.ASIA Unconference Session - Partition-wise Join & Visibility-Map