SlideShare a Scribd company logo
Copyright©2018 NTT Corp. All Rights Reserved.
Introducing PMDK into PostgreSQL
Challenges and implementations towards PMEM-generation
elephant
Takashi Menjo†, Yoshimi Ichiyanagi†
†NTT Software Innovation Center
PGCon 2018, Day 1 (May 31, 2018)
2Copyright©2018 NTT Corp. All Rights Reserved.
• Worked on system software
• Distributed block storage (Sheepdog)
• Operating system (Linux)
• First time to dive into PostgreSQL
• Try to refine open-source software by a new storage and a
new library
• Choose PostgreSQL because the NTT group promotes it
My background
Any discussions and comments are welcome :-)
3Copyright©2018 NTT Corp. All Rights Reserved.
1. Introduction
• Persistent Memory (PMEM), DAX for files, and PMDK
2. Hacks and evaluation
i. XLOG segment file (Write-Ahead Logging)
ii. Relation segment file (Table, Index, …)
3. Tips related to PMEM
• Programming, benchmark, and operation
4. Conclusion
Overview of my talk
4Copyright©2018 NTT Corp. All Rights Reserved.
1. Introduction
• Persistent Memory (PMEM), DAX for files, and PMDK
2. Hacks and evaluation
i. XLOG segment file (Write-Ahead Logging)
ii. Relation segment file (Table, Index, …)
3. Tips related to PMEM
• Programming, benchmark, and operation
4. Conclusion
Overview of my talk
5Copyright©2018 NTT Corp. All Rights Reserved.
• Emerging memory-like storage device
• Non-volatile, byte-addressable, and as fast as DRAM
• Several types released or announced
• NVDIMM-N (Micron, HPE, Dell, ...)
• Based on DRAM and NAND flash
• HPE Scalable Persistent Memory
• Intel Persistent Memory (in 2018?)
Persistent Memory (PMEM)
We use this.
6Copyright©2018 NTT Corp. All Rights Reserved.
How PMEM is fast
Source: J. Arulraj and A. Pavlo. How to Build a Non-Volatile Memory Database Management System (Table 1). Proc. SIGMOD '17.
7Copyright©2018 NTT Corp. All Rights Reserved.
Databases on the way to PMEM
“Non-volatile Memory Logging”
in PGCon 2016 [4]
[1] https://www.snia.org/sites/default/files/PM-Summit/2017/presentations/Tom_Talpey_Persistent_Memory_in_Windows_Server_2016.pdf
[2] https://www.snia.org/sites/default/files/PM-Summit/2017/presentations/Zora_Caklovic_Bringing_Persistent_Memory_Technology_to_SAP_HANA.pdf
[3] https://mariadb.com/about-us/newsroom/press-releases/mariadb-launches-innovation-labs-explore-and-conquer-new-frontiers
[4] https://www.pgcon.org/2016/schedule/attachments/430_Non-volatile_Memory_Logging.pdf
Microsoft SQL Server 2016
utilizes NVDIMM-N for Tail-of-Log [1]
SAP HANA is leveraging
PMEM for Main Store [2]
MariaDB Labs is
collaborating with Intel [3]
8Copyright©2018 NTT Corp. All Rights Reserved.
• Hardware support
• BIOS detecting and configuring PMEM
• ACPI 6.0 or later: NFIT
• Asynchronous DRAM Refresh (ADR)
• :
• Software support
• Operating system (device drivers)
• Direct-Access for files (DAX)
• Linux (ext4 and xfs) and Windows (NTFS)
• Persistent Memory Development Kit (PMDK)
• Linux and Windows, on x64
What we need to use PMEM
For NVDIMM, at least.
9Copyright©2018 NTT Corp. All Rights Reserved.
DAX, PMDK, and what we did
Memory-
mapped file
PostgreSQL
PMEM
Library
(PMDK)
DAX-enabled
filesystem
Traditional
filesystem
Page cache
PostgreSQL PostgreSQL
User
Kernel
File I/O
APIs
File I/O
APIs
Load/Store
instructions
Context
switch
Context
switch
No context
switch
10Copyright©2018 NTT Corp. All Rights Reserved.
DAX, PMDK, and what we did
Memory-
mapped file
PostgreSQL
PMEM
Library
(PMDK)
DAX-enabled
filesystem
Traditional
filesystem
Page cache
PostgreSQL PostgreSQL
User
Kernel
File I/O
APIs
File I/O
APIs
Load/Store
instructions
Context
switch
Context
switch
What we did
No context
switch
11Copyright©2018 NTT Corp. All Rights Reserved.
• With DAX only
• Use PMEM faster without change of the application
• With DAX and PMDK
• Improve the performance of I/O-intensive workload
• By reducing context switches and the overhead of API calls
• Micro-benchmark
• DAX+PMDK is 2.5X as fast as
DAX-only
• Try to introduce
PMDK into PostgreSQL
Benefits of DAX and PMDK
(HPE NVDIMM, Linux kernel 4.16, ext4, PMDK 1.4)
2.5X
12Copyright©2018 NTT Corp. All Rights Reserved.
1. Introduction
• Persistent Memory (PMEM), DAX for files, and PMDK
2. Hacks and evaluation
i. XLOG segment file (Write-Ahead Logging)
ii. Relation segment file (Table, Index, …)
3. Tips related to PMEM
• Programming, benchmark, and operation
4. Conclusion
Overview of my talk
13Copyright©2018 NTT Corp. All Rights Reserved.
• How to hack
 Replace read/write calls with memory copy
 Easier way, reasonable for our first step
• Have data structures on DRAM persist on PMEM directly,
similar to in-memory database
• What we hack
 i) XLOG segment files
 Critical for transaction performance
 ii) Relation segment files
 Many writes occur during checkpoint
• Other files (CLOG, pg_control, ...)
Approach
14Copyright©2018 NTT Corp. All Rights Reserved.
read/write (POSIX) libpmem (PMDK)
Open fd = open
(path, ...);
pmem = pmem_map_file
(path, len, ...);
Write nbytes = write
(fd, buf, count);
pmem_memcpy_nodrain
(pmem, buf, count);
Sync ret = fdatasync(fd); pmem_drain();
Read nbytes = read
(fd, buf, count);
memcpy // from <string.h>
(buf, pmem, count);
Close ret = close(fd); ret = pmem_unmap(pmem, len);
How to hack
 Replace read/write calls with memory copy
15Copyright©2018 NTT Corp. All Rights Reserved.
• Contains Write-Ahead Log records
• Guarantees durability of updates
• By having the records persist before committing transaction
• Fixed length (16-MiB per file)
• Each file has a monotonically increasing “segment number”
• Critical for transaction performance
• Backend cannot commit a transaction before the commit log
record persists on storage
• A transaction takes less time if the record persists sooner
i) XLOG segment file
16Copyright©2018 NTT Corp. All Rights Reserved.
• Memory-map every segment file
• Fixed-length (16-MiB) file is highly compatible with memory-
mapping
• Memory-copy to it from the XLOG buffer
• Patch <backend/access/xlog.c> and so on
• 15 files changed, 847 insertions, 174 deletions
• Available on pgsql-hackers mailing list (search “PMDK”)
i) How we hack XLOG
17Copyright©2018 NTT Corp. All Rights Reserved.
i) Evaluation setup
Hardware
CPU E5-2667 v4 x 2 (8 cores per node)
DRAM [Node0/1] 32 GiB each
PMEM (NVDIMM-N) [Node0] 64 GiB (HPE 8GB NVDIMM x 8)
NVMe SSD [Node0] Intel SSD DC P3600 400GB
Optane SSD [Node0] Intel Optane SSD DC 4800X 750GB
Software
Distro Ubuntu 17.10
Linux kernel 4.16
PMDK 1.4
Filesystem ext4 (DAX available)
PostgreSQL base a467832 (master @ Mar 18, 2018)
postgresql.conf
{max,min}_wal_size 20GB
shared_buffers 16085MB
checkpoint_timeout 12min
checkpoint_completion_target 0.7
32 GiB
DRAM
64 GiB
PMEM
32 GiB
DRAM
CPU
CPU
Node0
Node1
Server
Client
NUMA node
NVMe
SSD
Optane
SSD
18Copyright©2018 NTT Corp. All Rights Reserved.
• Compare transaction throughput by using pgbench
• pgbench -i -s 200
• pgbench -M prepared -h /tmp -p 5432 -c 32 -j 32 -T 1800
• The checkpoint runs twice during the benchmark
• Run 3 times to find the median value for the result
• Conditions:
i) Evaluation of XLOG hacks
(a) (b) (c) (d) (e)
wal_sync_method fdatasync open_datasync fdatasync open_datasync libpmem (New)
PostgreSQL No hack No hack
Hacked
with PMDK
FS for XLOG ext4 (No DAX) ext4 (DAX enabled)
Device for XLOG Optane SSD PMEM
PGDATA NVMe SSD / ext4 (No DAX)
19Copyright©2018 NTT Corp. All Rights Reserved.
i) Results
Higherisbetter
+3%
20Copyright©2018 NTT Corp. All Rights Reserved.
• Improve transaction throughput by 3%
• Roughly the same improvement as Yoshimi reported on
pgsql-hackers
• Seems small in the percentage, but not-so-small in the
absolute value (+1,200 TPS)
• Future work
• Performance profiling
• Searching for a query pattern for which our hack is more
effective
i) Discussion
Abstract:
21Copyright©2018 NTT Corp. All Rights Reserved.
1. Introduction
• Persistent Memory (PMEM), DAX for files, and PMDK
2. Hacks and evaluation
i. XLOG segment file (Write-Ahead Logging)
ii. Relation segment file (Table, Index, …)
3. Tips related to PMEM
• Programming, benchmark, and operation
4. Conclusion
Overview of my talk
22Copyright©2018 NTT Corp. All Rights Reserved.
• So-called data file (or checkpoint file)
• Table, Index, TOAST, Materialized View, ...
• Variable length up to 1-GiB
• A huge table and so forth consist of multiple segment files
• Critical for checkpoint duration
• Dirty pages on the shared buffer are written back to the
segment files
• A checkpoint takes less time if the pages are written back
sooner
ii) Relation segment file
23Copyright©2018 NTT Corp. All Rights Reserved.
• Memory-map only every 1-GiB segment file
• Memory-mapped file cannot extend or shrink
• Remapping the file seems difficult for me to implement
• Memory-copy to it from the shared buffer
• Patch <backend/storage/smgr/md.c> and so on
• 2 files changed, 152 insertions
• Under test, not published yet
ii) How we hack Relation
24Copyright©2018 NTT Corp. All Rights Reserved.
ii) Evaluation setup
Hardware
CPU E5-2667 v4 x 2 (8 cores per node)
DRAM [Node0/1] 32 GiB each
PMEM (NVDIMM-N) [Node0] 64 GiB (HPE 8GB NVDIMM x 8)
NVMe SSD [Node0] Intel SSD DC P3600 400GB
Optane SSD [Node0] Intel Optane SSD DC 4800X 750GB
Software
Distro Ubuntu 17.10
Linux kernel 4.16
PMDK 1.4
Filesystem ext4 (DAX available)
PostgreSQL base a467832 (master @ Mar 18, 2018)
postgresql.conf
{max,min}_wal_size 20GB
shared_buffers 16085MB
checkpoint_timeout 1d
checkpoint_completion_target 0.0
32 GiB
DRAM
64 GiB
PMEM
32 GiB
DRAM
CPU
CPU
Node0
Node1
Server
Client
NUMA node
NVMe
SSD
Optane
SSD
<= To complete ckpt ASAP
<= Not to kick ckpt automatically
25Copyright©2018 NTT Corp. All Rights Reserved.
• Compare checkpoint duration time as follows:
• Conditions:
ii) Evaluation of Relation hacks
(a) (b) (c)
PostgreSQL No hack No hack Hacked with PMDK
FS for PGDATA ext4 (No DAX) ext4 (DAX enabled)
Device for PGDATA Optane SSD PMEM
Profile by Linux perf? No Yes
Server
pgbench
Rel. seg.
Server
Rel. seg.
-i -s 800
Flush
Server
pgbench
Rel. seg.
Server
psql
Rel. seg.
7.37 GiB
CHECKPOINTRestart A few min.
log_
filename
total=?
Shared
buffer
26Copyright©2018 NTT Corp. All Rights Reserved.
ii) Results
Lowerisbetter
-30%
27Copyright©2018 NTT Corp. All Rights Reserved.
ii) Profiling
-20 points: Overhead of write() (Skyblue)
-7 points: Context switches (Green)
-3 points: Memory copy
(Deep blue and brown)
+5 points: LockBufHdr (Orange)
28Copyright©2018 NTT Corp. All Rights Reserved.
• Shorten checkpoint duration by 30%
• The server can give its computing resource to the other
purposes
• Reduce the overhead of system calls and the
context switches
• Benefits of using memory-mapped files!
• The time of LockBufHdr became rather longer
• Open issue…
ii) Discussion
29Copyright©2018 NTT Corp. All Rights Reserved.
• (i) Improve transaction throughput by 3%
• With 1,000-line hack for WAL
• (ii) Shorten checkpoint duration by 30%
• With 150-line hack for Relation
• We must bring out more potential from PMEM
• Not so bad in an easier way, but far from “2.5X” in micro-
benchmark
• I think another way is to have data structures on DRAM
persist on PMEM directly
Conclusion of evaluation
30Copyright©2018 NTT Corp. All Rights Reserved.
1. Introduction
• Persistent Memory (PMEM), DAX for files, and PMDK
2. Hacks and evaluation
i. XLOG segment file (Write-Ahead Logging)
ii. Relation segment file (Table, Index, …)
3. Tips related to PMEM
• Programming, benchmark, and operation
4. Conclusion
Overview of my talk
31Copyright©2018 NTT Corp. All Rights Reserved.
• The data should reach nothing but PMEM
• Don’t stop at half, volatile middle layer such as CPU caches
• Or it will be lost when the program or system crashes
• x64 offers two instruction families
• CLFLUSH – Flush data out of CPU
caches to memory
• MOVNT – Store data to memory,
bypassing CPU caches
• PMDK supports both
• pmem_flush
• pmem_memcpy_nodrain
CPU cache flush and cache-bypassing store
CPU
Caches
PMEM
MOVNTCLFLUSH
32Copyright©2018 NTT Corp. All Rights Reserved.
• The two are not compatible
• Memory-mapped file cannot be extended while being
mapped
• Neither naive way is perfect
• Remapping a segment file on extend is time-consuming
• Pre-allocating maximum size (1GiB per segment) wastes
PMEM free space
• We must rethink traditional data structure towards
PMEM era
Memory-mapped file and Relation extension
33Copyright©2018 NTT Corp. All Rights Reserved.
• Hugepage will improve performance of PMEM
• By reducing page walk and Translation Lookaside Buffer
(TLB) miss
• PMDK on x64 considers hugepage
• By aligning the mapping address on hugepage boundary
(2MiB or 1GiB) when the file is large enough
• Pre-warming page table for PMEM will also make
the performance better
• By reducing page fault on main runtime
Page table and hugepage
34Copyright©2018 NTT Corp. All Rights Reserved.
• Critical for stable performance on multi process-
ing system
• Accessing to local memory (DRAM and/or PMEM) is fast
while remote is slow
• This applies to PCIe SSD, but PMEM is more sensitive
• Binding processes to a certain node
• numactl --cpunodebind=X --membind=X
-- pgctl ...
Controlling NUMA effect
35Copyright©2018 NTT Corp. All Rights Reserved.
• Something other than storage access could be
hotspot of transaction when we use PMEM
• Such as...
• Concurrency control such as locking
• Redundant internal memory copy
• Pre-processing such as SQL parse
• We fell into this trap and avoided it by prepared statement
New common sense of hotspot
PMEM
accessOthers
Disk accessOthers
🔥
🔥
(Conceptual figure)
36Copyright©2018 NTT Corp. All Rights Reserved.
1. Introduction
• Persistent Memory (PMEM), DAX for files, and PMDK
2. Hacks and evaluation
i. XLOG segment file (Write-Ahead Logging)
ii. Relation segment file (Table, Index, …)
3. Tips related to PMEM
• Programming, benchmark, and operation
4. Conclusion
Overview of my talk
37Copyright©2018 NTT Corp. All Rights Reserved.
• Applied PMDK into PostgreSQL
• In an easier way to use memory-mapped files and memory
copy
• Achieved not-so-bad results
• +3% transaction throughput
• -30% checkpoint duration
• Showed tips related to PMEM
• PMEM will change software design drastically
• We should change software and our mind to bring out
PMEM’s potential much more
• Let’s try PMEM and PMDK :)
Conclusion
38Copyright©2018 NTT Corp. All Rights Reserved.
Backup slides
39Copyright©2018 NTT Corp. All Rights Reserved.
How to hack
Disk
XLOG
buffer
Shared
buffer
Rel.XLOG
write()
PMEM
XLOG
buffer
Shared
buffer
Rel.XLOG
write()
PMEM
XLOG
buffer
Shared
buffer
Rel.XLOG
Memory
copy
Our HacksCurrent
PostgreSQL
On DAX-enabled
filesystem
Commit Checkpoint
Memory-
mapped
40Copyright©2018 NTT Corp. All Rights Reserved.
• Defines behavior between user space and OS
• Here we focus on NVM.PM.FILE mode
SNIA NVM Programming Model
https://www.snia.org/sites/default/files/technical_work/final/NVMProgrammingModel_v1.2.pdf
41Copyright©2018 NTT Corp. All Rights Reserved.
read/write memory-mapped file PMDK (libpmem)
Open fd = open(
path, flags, mode);
fd = open(
path, flags, mode);
pmem = pmem_map_file(
path, len, flags, mode,
...);
Allocate - // map cannot be extended
// so pre-allocate the file
err = posix_fallocate(
fd, 0, len);
Map - pmem = mmap(
NULL, len, ..., fd, -1);
(Close) - // accessing file via mapped
// address; not fd any more
ret = close(fd);
Write nbytes = write(
fd, buf, count);
memcpy(
pmem, buf, count);
// bypassing cache if able
// instead of flushing it
pmem_memcpy_nodrain(
pmem, buf, count);Flush - for(i=0; i<count; i+=64)
_mm_clflush(pmem[i]);
Sync ret = fdatasync(fd); _mm_sfence(); pmem_drain();
Read nbytes = read(
fd, buf, count);
memcpy(
buf, pmem, count);
memcpy(
buf, pmem, count);
Unmap - ret = munmap(pmem, len); ret = pmem_unmap(pmem, len);
Close ret = close(fd); - -
API walkthrough
Blue: Intel Intrinsics; Red: PMDK (libpmem)

More Related Content

What's hot

HKG18-411 - Introduction to OpenAMP which is an open source solution for hete...
HKG18-411 - Introduction to OpenAMP which is an open source solution for hete...HKG18-411 - Introduction to OpenAMP which is an open source solution for hete...
HKG18-411 - Introduction to OpenAMP which is an open source solution for hete...
Linaro
 
A Look Inside Google’s Data Center Networks
A Look Inside Google’s Data Center NetworksA Look Inside Google’s Data Center Networks
A Look Inside Google’s Data Center Networks
Ryousei Takano
 
Ceph Day Beijing- Ceph Community Update
Ceph Day Beijing- Ceph Community UpdateCeph Day Beijing- Ceph Community Update
Ceph Day Beijing- Ceph Community Update
Danielle Womboldt
 
Using VPP and SRIO-V with Clear Containers
Using VPP and SRIO-V with Clear ContainersUsing VPP and SRIO-V with Clear Containers
Using VPP and SRIO-V with Clear Containers
Michelle Holley
 
Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509
Linaro
 
Fueling the datasphere how RISC-V enables the storage ecosystem
Fueling the datasphere   how RISC-V enables the storage ecosystemFueling the datasphere   how RISC-V enables the storage ecosystem
Fueling the datasphere how RISC-V enables the storage ecosystem
RISC-V International
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
inside-BigData.com
 
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era
Flow-centric Computing - A Datacenter Architecture in the Post Moore EraFlow-centric Computing - A Datacenter Architecture in the Post Moore Era
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era
Ryousei Takano
 
JAWS-UG HPC #17 - Supercomputing'19 参加報告 - PFN 福田圭祐
JAWS-UG HPC #17 - Supercomputing'19 参加報告 - PFN 福田圭祐JAWS-UG HPC #17 - Supercomputing'19 参加報告 - PFN 福田圭祐
JAWS-UG HPC #17 - Supercomputing'19 参加報告 - PFN 福田圭祐
Preferred Networks
 
HPC on OpenStack
HPC on OpenStackHPC on OpenStack
HPC on OpenStack
Erich Birngruber
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Linaro
 
Linaro Connect San Francisco 2017 - Welcome Keynote by George Grey | #SFO17
Linaro Connect San Francisco 2017 - Welcome Keynote by George Grey | #SFO17Linaro Connect San Francisco 2017 - Welcome Keynote by George Grey | #SFO17
Linaro Connect San Francisco 2017 - Welcome Keynote by George Grey | #SFO17
Linaro
 
BXI: Bull eXascale Interconnect
BXI: Bull eXascale InterconnectBXI: Bull eXascale Interconnect
BXI: Bull eXascale Interconnect
inside-BigData.com
 
Named Data Networking Operational Aspects - IoT as a Use-case
Named Data Networking Operational Aspects - IoT as a Use-caseNamed Data Networking Operational Aspects - IoT as a Use-case
Named Data Networking Operational Aspects - IoT as a Use-case
Rute C. Sofia
 
DOME 64-bit μDataCenter
DOME 64-bit μDataCenterDOME 64-bit μDataCenter
DOME 64-bit μDataCenter
inside-BigData.com
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
George Markomanolis
 
Ceph Day Beijing - Welcome to Beijing Ceph Day
Ceph Day Beijing - Welcome to Beijing Ceph DayCeph Day Beijing - Welcome to Beijing Ceph Day
Ceph Day Beijing - Welcome to Beijing Ceph Day
Danielle Womboldt
 
Ceph Day Beijing - Storage Modernization with Intel and Ceph
Ceph Day Beijing - Storage Modernization with Intel and CephCeph Day Beijing - Storage Modernization with Intel and Ceph
Ceph Day Beijing - Storage Modernization with Intel and Ceph
Danielle Womboldt
 
Symmetric Crypto for DPDK - Declan Doherty
Symmetric Crypto for DPDK - Declan DohertySymmetric Crypto for DPDK - Declan Doherty
Symmetric Crypto for DPDK - Declan Doherty
harryvanhaaren
 

What's hot (19)

HKG18-411 - Introduction to OpenAMP which is an open source solution for hete...
HKG18-411 - Introduction to OpenAMP which is an open source solution for hete...HKG18-411 - Introduction to OpenAMP which is an open source solution for hete...
HKG18-411 - Introduction to OpenAMP which is an open source solution for hete...
 
A Look Inside Google’s Data Center Networks
A Look Inside Google’s Data Center NetworksA Look Inside Google’s Data Center Networks
A Look Inside Google’s Data Center Networks
 
Ceph Day Beijing- Ceph Community Update
Ceph Day Beijing- Ceph Community UpdateCeph Day Beijing- Ceph Community Update
Ceph Day Beijing- Ceph Community Update
 
Using VPP and SRIO-V with Clear Containers
Using VPP and SRIO-V with Clear ContainersUsing VPP and SRIO-V with Clear Containers
Using VPP and SRIO-V with Clear Containers
 
Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509
 
Fueling the datasphere how RISC-V enables the storage ecosystem
Fueling the datasphere   how RISC-V enables the storage ecosystemFueling the datasphere   how RISC-V enables the storage ecosystem
Fueling the datasphere how RISC-V enables the storage ecosystem
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era
Flow-centric Computing - A Datacenter Architecture in the Post Moore EraFlow-centric Computing - A Datacenter Architecture in the Post Moore Era
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era
 
JAWS-UG HPC #17 - Supercomputing'19 参加報告 - PFN 福田圭祐
JAWS-UG HPC #17 - Supercomputing'19 参加報告 - PFN 福田圭祐JAWS-UG HPC #17 - Supercomputing'19 参加報告 - PFN 福田圭祐
JAWS-UG HPC #17 - Supercomputing'19 参加報告 - PFN 福田圭祐
 
HPC on OpenStack
HPC on OpenStackHPC on OpenStack
HPC on OpenStack
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
 
Linaro Connect San Francisco 2017 - Welcome Keynote by George Grey | #SFO17
Linaro Connect San Francisco 2017 - Welcome Keynote by George Grey | #SFO17Linaro Connect San Francisco 2017 - Welcome Keynote by George Grey | #SFO17
Linaro Connect San Francisco 2017 - Welcome Keynote by George Grey | #SFO17
 
BXI: Bull eXascale Interconnect
BXI: Bull eXascale InterconnectBXI: Bull eXascale Interconnect
BXI: Bull eXascale Interconnect
 
Named Data Networking Operational Aspects - IoT as a Use-case
Named Data Networking Operational Aspects - IoT as a Use-caseNamed Data Networking Operational Aspects - IoT as a Use-case
Named Data Networking Operational Aspects - IoT as a Use-case
 
DOME 64-bit μDataCenter
DOME 64-bit μDataCenterDOME 64-bit μDataCenter
DOME 64-bit μDataCenter
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
Ceph Day Beijing - Welcome to Beijing Ceph Day
Ceph Day Beijing - Welcome to Beijing Ceph DayCeph Day Beijing - Welcome to Beijing Ceph Day
Ceph Day Beijing - Welcome to Beijing Ceph Day
 
Ceph Day Beijing - Storage Modernization with Intel and Ceph
Ceph Day Beijing - Storage Modernization with Intel and CephCeph Day Beijing - Storage Modernization with Intel and Ceph
Ceph Day Beijing - Storage Modernization with Intel and Ceph
 
Symmetric Crypto for DPDK - Declan Doherty
Symmetric Crypto for DPDK - Declan DohertySymmetric Crypto for DPDK - Declan Doherty
Symmetric Crypto for DPDK - Declan Doherty
 

Similar to Introducing PMDK into PostgreSQL

Genomics Deployments - How to Get Right with Software Defined Storage
 Genomics Deployments -  How to Get Right with Software Defined Storage Genomics Deployments -  How to Get Right with Software Defined Storage
Genomics Deployments - How to Get Right with Software Defined Storage
Sandeep Patil
 
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios
 
PL22 - Backup and Restore Performance.pptx
PL22 - Backup and Restore Performance.pptxPL22 - Backup and Restore Performance.pptx
PL22 - Backup and Restore Performance.pptx
Vinicius M Grippa
 
VM-aware Adaptive Storage Cache Prefetching
VM-aware Adaptive Storage Cache PrefetchingVM-aware Adaptive Storage Cache Prefetching
VM-aware Adaptive Storage Cache Prefetching
Shinagawa Laboratory, The University of Tokyo
 
Final presentasi gnome asia
Final presentasi gnome asiaFinal presentasi gnome asia
Final presentasi gnome asia
Anton Siswo
 
C++ Programming and the Persistent Memory Developers Kit
C++ Programming and the Persistent Memory Developers KitC++ Programming and the Persistent Memory Developers Kit
C++ Programming and the Persistent Memory Developers Kit
Intel® Software
 
Archivematica Technical Training Diagnostics Guide (September 2018)
Archivematica Technical Training Diagnostics Guide (September 2018)Archivematica Technical Training Diagnostics Guide (September 2018)
Archivematica Technical Training Diagnostics Guide (September 2018)
Artefactual Systems - Archivematica
 
Zendcon scaling magento
Zendcon scaling magentoZendcon scaling magento
Zendcon scaling magento
Mathew Beane
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
Kohei KaiGai
 
Varnish http accelerator
Varnish http acceleratorVarnish http accelerator
Varnish http acceleratorno no
 
Implementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch governmentImplementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch government
DoKC
 
MySQL backup and restore performance
MySQL backup and restore performanceMySQL backup and restore performance
MySQL backup and restore performance
Vinicius M Grippa
 
Are your ready for in memory applications?
Are your ready for in memory applications?Are your ready for in memory applications?
Are your ready for in memory applications?
G2MCommunications
 
The Forefront of the Development for NVDIMM on Linux Kernel
The Forefront of the Development for NVDIMM on Linux KernelThe Forefront of the Development for NVDIMM on Linux Kernel
The Forefront of the Development for NVDIMM on Linux Kernel
Yasunori Goto
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
Ferdinand Jamitzky
 
Persistent Memory Development Kit (PMDK) Essentials: Part 2
Persistent Memory Development Kit (PMDK) Essentials: Part 2Persistent Memory Development Kit (PMDK) Essentials: Part 2
Persistent Memory Development Kit (PMDK) Essentials: Part 2
Intel® Software
 
Persistent Memory Development Kit (PMDK) Essentials: Part 1
Persistent Memory Development Kit (PMDK) Essentials: Part 1Persistent Memory Development Kit (PMDK) Essentials: Part 1
Persistent Memory Development Kit (PMDK) Essentials: Part 1
Intel® Software
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Dr. Fabio Baruffa
 
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Shuo LI
 
Webinar: Three Reasons Why NAS is No Good for AI and Machine Learning
Webinar: Three Reasons Why NAS is No Good for AI and Machine LearningWebinar: Three Reasons Why NAS is No Good for AI and Machine Learning
Webinar: Three Reasons Why NAS is No Good for AI and Machine Learning
Storage Switzerland
 

Similar to Introducing PMDK into PostgreSQL (20)

Genomics Deployments - How to Get Right with Software Defined Storage
 Genomics Deployments -  How to Get Right with Software Defined Storage Genomics Deployments -  How to Get Right with Software Defined Storage
Genomics Deployments - How to Get Right with Software Defined Storage
 
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...
 
PL22 - Backup and Restore Performance.pptx
PL22 - Backup and Restore Performance.pptxPL22 - Backup and Restore Performance.pptx
PL22 - Backup and Restore Performance.pptx
 
VM-aware Adaptive Storage Cache Prefetching
VM-aware Adaptive Storage Cache PrefetchingVM-aware Adaptive Storage Cache Prefetching
VM-aware Adaptive Storage Cache Prefetching
 
Final presentasi gnome asia
Final presentasi gnome asiaFinal presentasi gnome asia
Final presentasi gnome asia
 
C++ Programming and the Persistent Memory Developers Kit
C++ Programming and the Persistent Memory Developers KitC++ Programming and the Persistent Memory Developers Kit
C++ Programming and the Persistent Memory Developers Kit
 
Archivematica Technical Training Diagnostics Guide (September 2018)
Archivematica Technical Training Diagnostics Guide (September 2018)Archivematica Technical Training Diagnostics Guide (September 2018)
Archivematica Technical Training Diagnostics Guide (September 2018)
 
Zendcon scaling magento
Zendcon scaling magentoZendcon scaling magento
Zendcon scaling magento
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
 
Varnish http accelerator
Varnish http acceleratorVarnish http accelerator
Varnish http accelerator
 
Implementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch governmentImplementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch government
 
MySQL backup and restore performance
MySQL backup and restore performanceMySQL backup and restore performance
MySQL backup and restore performance
 
Are your ready for in memory applications?
Are your ready for in memory applications?Are your ready for in memory applications?
Are your ready for in memory applications?
 
The Forefront of the Development for NVDIMM on Linux Kernel
The Forefront of the Development for NVDIMM on Linux KernelThe Forefront of the Development for NVDIMM on Linux Kernel
The Forefront of the Development for NVDIMM on Linux Kernel
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
Persistent Memory Development Kit (PMDK) Essentials: Part 2
Persistent Memory Development Kit (PMDK) Essentials: Part 2Persistent Memory Development Kit (PMDK) Essentials: Part 2
Persistent Memory Development Kit (PMDK) Essentials: Part 2
 
Persistent Memory Development Kit (PMDK) Essentials: Part 1
Persistent Memory Development Kit (PMDK) Essentials: Part 1Persistent Memory Development Kit (PMDK) Essentials: Part 1
Persistent Memory Development Kit (PMDK) Essentials: Part 1
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
 
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
 
Webinar: Three Reasons Why NAS is No Good for AI and Machine Learning
Webinar: Three Reasons Why NAS is No Good for AI and Machine LearningWebinar: Three Reasons Why NAS is No Good for AI and Machine Learning
Webinar: Three Reasons Why NAS is No Good for AI and Machine Learning
 

More from NTT Software Innovation Center

A Global Data Infrastructure for Data Sharing Between Businesses
A Global Data Infrastructure for Data Sharing Between BusinessesA Global Data Infrastructure for Data Sharing Between Businesses
A Global Data Infrastructure for Data Sharing Between Businesses
NTT Software Innovation Center
 
企業間データ流通のための国際データ基盤
企業間データ流通のための国際データ基盤企業間データ流通のための国際データ基盤
企業間データ流通のための国際データ基盤
NTT Software Innovation Center
 
企業間データ流通のための国際データ基盤
企業間データ流通のための国際データ基盤企業間データ流通のための国際データ基盤
企業間データ流通のための国際データ基盤
NTT Software Innovation Center
 
不揮発WALバッファ
不揮発WALバッファ不揮発WALバッファ
不揮発WALバッファ
NTT Software Innovation Center
 
企業間データ流通のための国際基盤
企業間データ流通のための国際基盤企業間データ流通のための国際基盤
企業間データ流通のための国際基盤
NTT Software Innovation Center
 
企業間データ流通のための国際基盤
企業間データ流通のための国際基盤企業間データ流通のための国際基盤
企業間データ流通のための国際基盤
NTT Software Innovation Center
 
Hybrid Computing Platform for Combinatorial Optimization with the Coherent Is...
Hybrid Computing Platform for Combinatorial Optimization with the Coherent Is...Hybrid Computing Platform for Combinatorial Optimization with the Coherent Is...
Hybrid Computing Platform for Combinatorial Optimization with the Coherent Is...
NTT Software Innovation Center
 
2-in-1 Cluster Integration: Batch and Interactive GPU Computing
2-in-1 Cluster Integration: Batch and Interactive GPU Computing2-in-1 Cluster Integration: Batch and Interactive GPU Computing
2-in-1 Cluster Integration: Batch and Interactive GPU Computing
NTT Software Innovation Center
 
Hybrid Sourcing for Overcoming “Digital Cliff 2025”
Hybrid Sourcing for Overcoming “Digital Cliff 2025”Hybrid Sourcing for Overcoming “Digital Cliff 2025”
Hybrid Sourcing for Overcoming “Digital Cliff 2025”
NTT Software Innovation Center
 
データ分析をビジネスに活かす!データ創出・活用から、分析、課題解決までのDX時代のデータ活用事例のご紹介 ~不揃いのデータとの格闘~
データ分析をビジネスに活かす!データ創出・活用から、分析、課題解決までのDX時代のデータ活用事例のご紹介 ~不揃いのデータとの格闘~データ分析をビジネスに活かす!データ創出・活用から、分析、課題解決までのDX時代のデータ活用事例のご紹介 ~不揃いのデータとの格闘~
データ分析をビジネスに活かす!データ創出・活用から、分析、課題解決までのDX時代のデータ活用事例のご紹介 ~不揃いのデータとの格闘~
NTT Software Innovation Center
 
Network Implosion: Effective Model Compression for ResNets via Static Layer P...
Network Implosion: Effective Model Compression for ResNets via Static Layer P...Network Implosion: Effective Model Compression for ResNets via Static Layer P...
Network Implosion: Effective Model Compression for ResNets via Static Layer P...
NTT Software Innovation Center
 
Why and how Edge Computing matters enterprise IT strategy
Why and how Edge Computing matters enterprise IT strategyWhy and how Edge Computing matters enterprise IT strategy
Why and how Edge Computing matters enterprise IT strategy
NTT Software Innovation Center
 
外部キー制約を考慮した特徴量削減手法
外部キー制約を考慮した特徴量削減手法外部キー制約を考慮した特徴量削減手法
外部キー制約を考慮した特徴量削減手法
NTT Software Innovation Center
 
デジタルサービスプラットフォーム実現に向けた技術課題
デジタルサービスプラットフォーム実現に向けた技術課題デジタルサービスプラットフォーム実現に向けた技術課題
デジタルサービスプラットフォーム実現に向けた技術課題
NTT Software Innovation Center
 
Building images efficiently and securely on Kubernetes with BuildKit
Building images efficiently and securely on Kubernetes with BuildKitBuilding images efficiently and securely on Kubernetes with BuildKit
Building images efficiently and securely on Kubernetes with BuildKit
NTT Software Innovation Center
 
Real-time spatiotemporal data utilization for future mobility services
Real-time spatiotemporal data utilization for future mobility servicesReal-time spatiotemporal data utilization for future mobility services
Real-time spatiotemporal data utilization for future mobility services
NTT Software Innovation Center
 
【招待講演】ICM研究会 - 統合ログ分析技術Lognosisと運用ログ分析の取組
【招待講演】ICM研究会 - 統合ログ分析技術Lognosisと運用ログ分析の取組【招待講演】ICM研究会 - 統合ログ分析技術Lognosisと運用ログ分析の取組
【招待講演】ICM研究会 - 統合ログ分析技術Lognosisと運用ログ分析の取組
NTT Software Innovation Center
 
統合ログ分析技術Lognosisと運用ログ分析の取組
統合ログ分析技術Lognosisと運用ログ分析の取組統合ログ分析技術Lognosisと運用ログ分析の取組
統合ログ分析技術Lognosisと運用ログ分析の取組
NTT Software Innovation Center
 
MVSR Schedulerを作るための指針
MVSR Schedulerを作るための指針MVSR Schedulerを作るための指針
MVSR Schedulerを作るための指針
NTT Software Innovation Center
 
OpenStack Swiftとそのエコシステムの最新動向
OpenStack Swiftとそのエコシステムの最新動向OpenStack Swiftとそのエコシステムの最新動向
OpenStack Swiftとそのエコシステムの最新動向
NTT Software Innovation Center
 

More from NTT Software Innovation Center (20)

A Global Data Infrastructure for Data Sharing Between Businesses
A Global Data Infrastructure for Data Sharing Between BusinessesA Global Data Infrastructure for Data Sharing Between Businesses
A Global Data Infrastructure for Data Sharing Between Businesses
 
企業間データ流通のための国際データ基盤
企業間データ流通のための国際データ基盤企業間データ流通のための国際データ基盤
企業間データ流通のための国際データ基盤
 
企業間データ流通のための国際データ基盤
企業間データ流通のための国際データ基盤企業間データ流通のための国際データ基盤
企業間データ流通のための国際データ基盤
 
不揮発WALバッファ
不揮発WALバッファ不揮発WALバッファ
不揮発WALバッファ
 
企業間データ流通のための国際基盤
企業間データ流通のための国際基盤企業間データ流通のための国際基盤
企業間データ流通のための国際基盤
 
企業間データ流通のための国際基盤
企業間データ流通のための国際基盤企業間データ流通のための国際基盤
企業間データ流通のための国際基盤
 
Hybrid Computing Platform for Combinatorial Optimization with the Coherent Is...
Hybrid Computing Platform for Combinatorial Optimization with the Coherent Is...Hybrid Computing Platform for Combinatorial Optimization with the Coherent Is...
Hybrid Computing Platform for Combinatorial Optimization with the Coherent Is...
 
2-in-1 Cluster Integration: Batch and Interactive GPU Computing
2-in-1 Cluster Integration: Batch and Interactive GPU Computing2-in-1 Cluster Integration: Batch and Interactive GPU Computing
2-in-1 Cluster Integration: Batch and Interactive GPU Computing
 
Hybrid Sourcing for Overcoming “Digital Cliff 2025”
Hybrid Sourcing for Overcoming “Digital Cliff 2025”Hybrid Sourcing for Overcoming “Digital Cliff 2025”
Hybrid Sourcing for Overcoming “Digital Cliff 2025”
 
データ分析をビジネスに活かす!データ創出・活用から、分析、課題解決までのDX時代のデータ活用事例のご紹介 ~不揃いのデータとの格闘~
データ分析をビジネスに活かす!データ創出・活用から、分析、課題解決までのDX時代のデータ活用事例のご紹介 ~不揃いのデータとの格闘~データ分析をビジネスに活かす!データ創出・活用から、分析、課題解決までのDX時代のデータ活用事例のご紹介 ~不揃いのデータとの格闘~
データ分析をビジネスに活かす!データ創出・活用から、分析、課題解決までのDX時代のデータ活用事例のご紹介 ~不揃いのデータとの格闘~
 
Network Implosion: Effective Model Compression for ResNets via Static Layer P...
Network Implosion: Effective Model Compression for ResNets via Static Layer P...Network Implosion: Effective Model Compression for ResNets via Static Layer P...
Network Implosion: Effective Model Compression for ResNets via Static Layer P...
 
Why and how Edge Computing matters enterprise IT strategy
Why and how Edge Computing matters enterprise IT strategyWhy and how Edge Computing matters enterprise IT strategy
Why and how Edge Computing matters enterprise IT strategy
 
外部キー制約を考慮した特徴量削減手法
外部キー制約を考慮した特徴量削減手法外部キー制約を考慮した特徴量削減手法
外部キー制約を考慮した特徴量削減手法
 
デジタルサービスプラットフォーム実現に向けた技術課題
デジタルサービスプラットフォーム実現に向けた技術課題デジタルサービスプラットフォーム実現に向けた技術課題
デジタルサービスプラットフォーム実現に向けた技術課題
 
Building images efficiently and securely on Kubernetes with BuildKit
Building images efficiently and securely on Kubernetes with BuildKitBuilding images efficiently and securely on Kubernetes with BuildKit
Building images efficiently and securely on Kubernetes with BuildKit
 
Real-time spatiotemporal data utilization for future mobility services
Real-time spatiotemporal data utilization for future mobility servicesReal-time spatiotemporal data utilization for future mobility services
Real-time spatiotemporal data utilization for future mobility services
 
【招待講演】ICM研究会 - 統合ログ分析技術Lognosisと運用ログ分析の取組
【招待講演】ICM研究会 - 統合ログ分析技術Lognosisと運用ログ分析の取組【招待講演】ICM研究会 - 統合ログ分析技術Lognosisと運用ログ分析の取組
【招待講演】ICM研究会 - 統合ログ分析技術Lognosisと運用ログ分析の取組
 
統合ログ分析技術Lognosisと運用ログ分析の取組
統合ログ分析技術Lognosisと運用ログ分析の取組統合ログ分析技術Lognosisと運用ログ分析の取組
統合ログ分析技術Lognosisと運用ログ分析の取組
 
MVSR Schedulerを作るための指針
MVSR Schedulerを作るための指針MVSR Schedulerを作るための指針
MVSR Schedulerを作るための指針
 
OpenStack Swiftとそのエコシステムの最新動向
OpenStack Swiftとそのエコシステムの最新動向OpenStack Swiftとそのエコシステムの最新動向
OpenStack Swiftとそのエコシステムの最新動向
 

Recently uploaded

Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
Google
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
abdulrafaychaudhry
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 

Recently uploaded (20)

Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 

Introducing PMDK into PostgreSQL

  • 1. Copyright©2018 NTT Corp. All Rights Reserved. Introducing PMDK into PostgreSQL Challenges and implementations towards PMEM-generation elephant Takashi Menjo†, Yoshimi Ichiyanagi† †NTT Software Innovation Center PGCon 2018, Day 1 (May 31, 2018)
  • 2. 2Copyright©2018 NTT Corp. All Rights Reserved. • Worked on system software • Distributed block storage (Sheepdog) • Operating system (Linux) • First time to dive into PostgreSQL • Try to refine open-source software by a new storage and a new library • Choose PostgreSQL because the NTT group promotes it My background Any discussions and comments are welcome :-)
  • 3. 3Copyright©2018 NTT Corp. All Rights Reserved. 1. Introduction • Persistent Memory (PMEM), DAX for files, and PMDK 2. Hacks and evaluation i. XLOG segment file (Write-Ahead Logging) ii. Relation segment file (Table, Index, …) 3. Tips related to PMEM • Programming, benchmark, and operation 4. Conclusion Overview of my talk
  • 4. 4Copyright©2018 NTT Corp. All Rights Reserved. 1. Introduction • Persistent Memory (PMEM), DAX for files, and PMDK 2. Hacks and evaluation i. XLOG segment file (Write-Ahead Logging) ii. Relation segment file (Table, Index, …) 3. Tips related to PMEM • Programming, benchmark, and operation 4. Conclusion Overview of my talk
  • 5. 5Copyright©2018 NTT Corp. All Rights Reserved. • Emerging memory-like storage device • Non-volatile, byte-addressable, and as fast as DRAM • Several types released or announced • NVDIMM-N (Micron, HPE, Dell, ...) • Based on DRAM and NAND flash • HPE Scalable Persistent Memory • Intel Persistent Memory (in 2018?) Persistent Memory (PMEM) We use this.
  • 6. 6Copyright©2018 NTT Corp. All Rights Reserved. How PMEM is fast Source: J. Arulraj and A. Pavlo. How to Build a Non-Volatile Memory Database Management System (Table 1). Proc. SIGMOD '17.
  • 7. 7Copyright©2018 NTT Corp. All Rights Reserved. Databases on the way to PMEM “Non-volatile Memory Logging” in PGCon 2016 [4] [1] https://www.snia.org/sites/default/files/PM-Summit/2017/presentations/Tom_Talpey_Persistent_Memory_in_Windows_Server_2016.pdf [2] https://www.snia.org/sites/default/files/PM-Summit/2017/presentations/Zora_Caklovic_Bringing_Persistent_Memory_Technology_to_SAP_HANA.pdf [3] https://mariadb.com/about-us/newsroom/press-releases/mariadb-launches-innovation-labs-explore-and-conquer-new-frontiers [4] https://www.pgcon.org/2016/schedule/attachments/430_Non-volatile_Memory_Logging.pdf Microsoft SQL Server 2016 utilizes NVDIMM-N for Tail-of-Log [1] SAP HANA is leveraging PMEM for Main Store [2] MariaDB Labs is collaborating with Intel [3]
  • 8. 8Copyright©2018 NTT Corp. All Rights Reserved. • Hardware support • BIOS detecting and configuring PMEM • ACPI 6.0 or later: NFIT • Asynchronous DRAM Refresh (ADR) • : • Software support • Operating system (device drivers) • Direct-Access for files (DAX) • Linux (ext4 and xfs) and Windows (NTFS) • Persistent Memory Development Kit (PMDK) • Linux and Windows, on x64 What we need to use PMEM For NVDIMM, at least.
  • 9. 9Copyright©2018 NTT Corp. All Rights Reserved. DAX, PMDK, and what we did Memory- mapped file PostgreSQL PMEM Library (PMDK) DAX-enabled filesystem Traditional filesystem Page cache PostgreSQL PostgreSQL User Kernel File I/O APIs File I/O APIs Load/Store instructions Context switch Context switch No context switch
  • 10. 10Copyright©2018 NTT Corp. All Rights Reserved. DAX, PMDK, and what we did Memory- mapped file PostgreSQL PMEM Library (PMDK) DAX-enabled filesystem Traditional filesystem Page cache PostgreSQL PostgreSQL User Kernel File I/O APIs File I/O APIs Load/Store instructions Context switch Context switch What we did No context switch
  • 11. 11Copyright©2018 NTT Corp. All Rights Reserved. • With DAX only • Use PMEM faster without change of the application • With DAX and PMDK • Improve the performance of I/O-intensive workload • By reducing context switches and the overhead of API calls • Micro-benchmark • DAX+PMDK is 2.5X as fast as DAX-only • Try to introduce PMDK into PostgreSQL Benefits of DAX and PMDK (HPE NVDIMM, Linux kernel 4.16, ext4, PMDK 1.4) 2.5X
  • 12. 12Copyright©2018 NTT Corp. All Rights Reserved. 1. Introduction • Persistent Memory (PMEM), DAX for files, and PMDK 2. Hacks and evaluation i. XLOG segment file (Write-Ahead Logging) ii. Relation segment file (Table, Index, …) 3. Tips related to PMEM • Programming, benchmark, and operation 4. Conclusion Overview of my talk
  • 13. 13Copyright©2018 NTT Corp. All Rights Reserved. • How to hack  Replace read/write calls with memory copy  Easier way, reasonable for our first step • Have data structures on DRAM persist on PMEM directly, similar to in-memory database • What we hack  i) XLOG segment files  Critical for transaction performance  ii) Relation segment files  Many writes occur during checkpoint • Other files (CLOG, pg_control, ...) Approach
  • 14. 14Copyright©2018 NTT Corp. All Rights Reserved. read/write (POSIX) libpmem (PMDK) Open fd = open (path, ...); pmem = pmem_map_file (path, len, ...); Write nbytes = write (fd, buf, count); pmem_memcpy_nodrain (pmem, buf, count); Sync ret = fdatasync(fd); pmem_drain(); Read nbytes = read (fd, buf, count); memcpy // from <string.h> (buf, pmem, count); Close ret = close(fd); ret = pmem_unmap(pmem, len); How to hack  Replace read/write calls with memory copy
  • 15. 15Copyright©2018 NTT Corp. All Rights Reserved. • Contains Write-Ahead Log records • Guarantees durability of updates • By having the records persist before committing transaction • Fixed length (16-MiB per file) • Each file has a monotonically increasing “segment number” • Critical for transaction performance • Backend cannot commit a transaction before the commit log record persists on storage • A transaction takes less time if the record persists sooner i) XLOG segment file
  • 16. 16Copyright©2018 NTT Corp. All Rights Reserved. • Memory-map every segment file • Fixed-length (16-MiB) file is highly compatible with memory- mapping • Memory-copy to it from the XLOG buffer • Patch <backend/access/xlog.c> and so on • 15 files changed, 847 insertions, 174 deletions • Available on pgsql-hackers mailing list (search “PMDK”) i) How we hack XLOG
  • 17. 17Copyright©2018 NTT Corp. All Rights Reserved. i) Evaluation setup Hardware CPU E5-2667 v4 x 2 (8 cores per node) DRAM [Node0/1] 32 GiB each PMEM (NVDIMM-N) [Node0] 64 GiB (HPE 8GB NVDIMM x 8) NVMe SSD [Node0] Intel SSD DC P3600 400GB Optane SSD [Node0] Intel Optane SSD DC 4800X 750GB Software Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf {max,min}_wal_size 20GB shared_buffers 16085MB checkpoint_timeout 12min checkpoint_completion_target 0.7 32 GiB DRAM 64 GiB PMEM 32 GiB DRAM CPU CPU Node0 Node1 Server Client NUMA node NVMe SSD Optane SSD
  • 18. 18Copyright©2018 NTT Corp. All Rights Reserved. • Compare transaction throughput by using pgbench • pgbench -i -s 200 • pgbench -M prepared -h /tmp -p 5432 -c 32 -j 32 -T 1800 • The checkpoint runs twice during the benchmark • Run 3 times to find the median value for the result • Conditions: i) Evaluation of XLOG hacks (a) (b) (c) (d) (e) wal_sync_method fdatasync open_datasync fdatasync open_datasync libpmem (New) PostgreSQL No hack No hack Hacked with PMDK FS for XLOG ext4 (No DAX) ext4 (DAX enabled) Device for XLOG Optane SSD PMEM PGDATA NVMe SSD / ext4 (No DAX)
  • 19. 19Copyright©2018 NTT Corp. All Rights Reserved. i) Results Higherisbetter +3%
  • 20. 20Copyright©2018 NTT Corp. All Rights Reserved. • Improve transaction throughput by 3% • Roughly the same improvement as Yoshimi reported on pgsql-hackers • Seems small in the percentage, but not-so-small in the absolute value (+1,200 TPS) • Future work • Performance profiling • Searching for a query pattern for which our hack is more effective i) Discussion Abstract:
  • 21. 21Copyright©2018 NTT Corp. All Rights Reserved. 1. Introduction • Persistent Memory (PMEM), DAX for files, and PMDK 2. Hacks and evaluation i. XLOG segment file (Write-Ahead Logging) ii. Relation segment file (Table, Index, …) 3. Tips related to PMEM • Programming, benchmark, and operation 4. Conclusion Overview of my talk
  • 22. 22Copyright©2018 NTT Corp. All Rights Reserved. • So-called data file (or checkpoint file) • Table, Index, TOAST, Materialized View, ... • Variable length up to 1-GiB • A huge table and so forth consist of multiple segment files • Critical for checkpoint duration • Dirty pages on the shared buffer are written back to the segment files • A checkpoint takes less time if the pages are written back sooner ii) Relation segment file
  • 23. 23Copyright©2018 NTT Corp. All Rights Reserved. • Memory-map only every 1-GiB segment file • Memory-mapped file cannot extend or shrink • Remapping the file seems difficult for me to implement • Memory-copy to it from the shared buffer • Patch <backend/storage/smgr/md.c> and so on • 2 files changed, 152 insertions • Under test, not published yet ii) How we hack Relation
  • 24. 24Copyright©2018 NTT Corp. All Rights Reserved. ii) Evaluation setup Hardware CPU E5-2667 v4 x 2 (8 cores per node) DRAM [Node0/1] 32 GiB each PMEM (NVDIMM-N) [Node0] 64 GiB (HPE 8GB NVDIMM x 8) NVMe SSD [Node0] Intel SSD DC P3600 400GB Optane SSD [Node0] Intel Optane SSD DC 4800X 750GB Software Distro Ubuntu 17.10 Linux kernel 4.16 PMDK 1.4 Filesystem ext4 (DAX available) PostgreSQL base a467832 (master @ Mar 18, 2018) postgresql.conf {max,min}_wal_size 20GB shared_buffers 16085MB checkpoint_timeout 1d checkpoint_completion_target 0.0 32 GiB DRAM 64 GiB PMEM 32 GiB DRAM CPU CPU Node0 Node1 Server Client NUMA node NVMe SSD Optane SSD <= To complete ckpt ASAP <= Not to kick ckpt automatically
  • 25. 25Copyright©2018 NTT Corp. All Rights Reserved. • Compare checkpoint duration time as follows: • Conditions: ii) Evaluation of Relation hacks (a) (b) (c) PostgreSQL No hack No hack Hacked with PMDK FS for PGDATA ext4 (No DAX) ext4 (DAX enabled) Device for PGDATA Optane SSD PMEM Profile by Linux perf? No Yes Server pgbench Rel. seg. Server Rel. seg. -i -s 800 Flush Server pgbench Rel. seg. Server psql Rel. seg. 7.37 GiB CHECKPOINTRestart A few min. log_ filename total=? Shared buffer
  • 26. 26Copyright©2018 NTT Corp. All Rights Reserved. ii) Results Lowerisbetter -30%
  • 27. 27Copyright©2018 NTT Corp. All Rights Reserved. ii) Profiling -20 points: Overhead of write() (Skyblue) -7 points: Context switches (Green) -3 points: Memory copy (Deep blue and brown) +5 points: LockBufHdr (Orange)
  • 28. 28Copyright©2018 NTT Corp. All Rights Reserved. • Shorten checkpoint duration by 30% • The server can give its computing resource to the other purposes • Reduce the overhead of system calls and the context switches • Benefits of using memory-mapped files! • The time of LockBufHdr became rather longer • Open issue… ii) Discussion
  • 29. 29Copyright©2018 NTT Corp. All Rights Reserved. • (i) Improve transaction throughput by 3% • With 1,000-line hack for WAL • (ii) Shorten checkpoint duration by 30% • With 150-line hack for Relation • We must bring out more potential from PMEM • Not so bad in an easier way, but far from “2.5X” in micro- benchmark • I think another way is to have data structures on DRAM persist on PMEM directly Conclusion of evaluation
  • 30. 30Copyright©2018 NTT Corp. All Rights Reserved. 1. Introduction • Persistent Memory (PMEM), DAX for files, and PMDK 2. Hacks and evaluation i. XLOG segment file (Write-Ahead Logging) ii. Relation segment file (Table, Index, …) 3. Tips related to PMEM • Programming, benchmark, and operation 4. Conclusion Overview of my talk
  • 31. 31Copyright©2018 NTT Corp. All Rights Reserved. • The data should reach nothing but PMEM • Don’t stop at half, volatile middle layer such as CPU caches • Or it will be lost when the program or system crashes • x64 offers two instruction families • CLFLUSH – Flush data out of CPU caches to memory • MOVNT – Store data to memory, bypassing CPU caches • PMDK supports both • pmem_flush • pmem_memcpy_nodrain CPU cache flush and cache-bypassing store CPU Caches PMEM MOVNTCLFLUSH
  • 32. 32Copyright©2018 NTT Corp. All Rights Reserved. • The two are not compatible • Memory-mapped file cannot be extended while being mapped • Neither naive way is perfect • Remapping a segment file on extend is time-consuming • Pre-allocating maximum size (1GiB per segment) wastes PMEM free space • We must rethink traditional data structure towards PMEM era Memory-mapped file and Relation extension
  • 33. 33Copyright©2018 NTT Corp. All Rights Reserved. • Hugepage will improve performance of PMEM • By reducing page walk and Translation Lookaside Buffer (TLB) miss • PMDK on x64 considers hugepage • By aligning the mapping address on hugepage boundary (2MiB or 1GiB) when the file is large enough • Pre-warming page table for PMEM will also make the performance better • By reducing page fault on main runtime Page table and hugepage
  • 34. 34Copyright©2018 NTT Corp. All Rights Reserved. • Critical for stable performance on multi process- ing system • Accessing to local memory (DRAM and/or PMEM) is fast while remote is slow • This applies to PCIe SSD, but PMEM is more sensitive • Binding processes to a certain node • numactl --cpunodebind=X --membind=X -- pgctl ... Controlling NUMA effect
  • 35. 35Copyright©2018 NTT Corp. All Rights Reserved. • Something other than storage access could be hotspot of transaction when we use PMEM • Such as... • Concurrency control such as locking • Redundant internal memory copy • Pre-processing such as SQL parse • We fell into this trap and avoided it by prepared statement New common sense of hotspot PMEM accessOthers Disk accessOthers 🔥 🔥 (Conceptual figure)
  • 36. 36Copyright©2018 NTT Corp. All Rights Reserved. 1. Introduction • Persistent Memory (PMEM), DAX for files, and PMDK 2. Hacks and evaluation i. XLOG segment file (Write-Ahead Logging) ii. Relation segment file (Table, Index, …) 3. Tips related to PMEM • Programming, benchmark, and operation 4. Conclusion Overview of my talk
  • 37. 37Copyright©2018 NTT Corp. All Rights Reserved. • Applied PMDK into PostgreSQL • In an easier way to use memory-mapped files and memory copy • Achieved not-so-bad results • +3% transaction throughput • -30% checkpoint duration • Showed tips related to PMEM • PMEM will change software design drastically • We should change software and our mind to bring out PMEM’s potential much more • Let’s try PMEM and PMDK :) Conclusion
  • 38. 38Copyright©2018 NTT Corp. All Rights Reserved. Backup slides
  • 39. 39Copyright©2018 NTT Corp. All Rights Reserved. How to hack Disk XLOG buffer Shared buffer Rel.XLOG write() PMEM XLOG buffer Shared buffer Rel.XLOG write() PMEM XLOG buffer Shared buffer Rel.XLOG Memory copy Our HacksCurrent PostgreSQL On DAX-enabled filesystem Commit Checkpoint Memory- mapped
  • 40. 40Copyright©2018 NTT Corp. All Rights Reserved. • Defines behavior between user space and OS • Here we focus on NVM.PM.FILE mode SNIA NVM Programming Model https://www.snia.org/sites/default/files/technical_work/final/NVMProgrammingModel_v1.2.pdf
  • 41. 41Copyright©2018 NTT Corp. All Rights Reserved. read/write memory-mapped file PMDK (libpmem) Open fd = open( path, flags, mode); fd = open( path, flags, mode); pmem = pmem_map_file( path, len, flags, mode, ...); Allocate - // map cannot be extended // so pre-allocate the file err = posix_fallocate( fd, 0, len); Map - pmem = mmap( NULL, len, ..., fd, -1); (Close) - // accessing file via mapped // address; not fd any more ret = close(fd); Write nbytes = write( fd, buf, count); memcpy( pmem, buf, count); // bypassing cache if able // instead of flushing it pmem_memcpy_nodrain( pmem, buf, count);Flush - for(i=0; i<count; i+=64) _mm_clflush(pmem[i]); Sync ret = fdatasync(fd); _mm_sfence(); pmem_drain(); Read nbytes = read( fd, buf, count); memcpy( buf, pmem, count); memcpy( buf, pmem, count); Unmap - ret = munmap(pmem, len); ret = pmem_unmap(pmem, len); Close ret = close(fd); - - API walkthrough Blue: Intel Intrinsics; Red: PMDK (libpmem)

Editor's Notes

  1. Good afternoon, everyone. I'm very happy to speak to you today. My presentation today is about introducing PMDK into PostgreSQL.
  2. First, let me introduce myself. I have worked on system software, such as distributed block storage or operating system. This is my first time to dive into PostgreSQL. I try to refine open-source software by a new storage and a new library, and I choose PostgreSQL. Today, I will show you our first ideas, and I want to have better ideas. So any discussions and comments are welcome.
  3. This is an overview of my talk.
  4. Let me begin with introduction.
  5. Persistent Memory, PMEM, is an emerging memory-like storage device. It’s non-volatile, byte-addressable, and as fast as DRAM. There are several types of PMEM released or announced. We use NVDIMM-N in this presentation.
  6. This is a chart telling how PMEM is fast. This shows read and write latency of each device, so lower is better. The leftmost is DRAM and the middle three are PMEM. Solid-state disk and spindle disk are far from DRAM and PMEM. Zoom into the left four. Every type of PMEM is close to DRAM. So it’s fast enough for storage use.
  7. Popular databases are already on the way to PMEM. Microsoft SQL Server, SAP HANA and MariaDB are getting to PMEM. And, of cource, PostgreSQL is. An early work is reported in PGCon two years ago. So I think it’s time for us to get into PMEM.
  8. What do we need to use PMEM? We need hardware support and software support such as those written here. Today, I’d like to focus on performance improvement and talk about DAX and PMDK. Both of the two are already supported on Linux and Windows.
  9. I’d like to explain DAX, PMDK, and then what we did. The current I/O stack is as shown in the left. PostgreSQL uses File I/O APIs such as read and write system calls in POSIX. Then, go to the middle. PostgreSQL can run on DAX-enabled filesystem without change because the provided APIs are the same as before. It can also run faster than before because DAX reads and writes file data directly to PMEM. It bypasses page cache to remove unnecessary copies. Note that there is still context switch between user and kernel spaces. Next, go to the right. DAX-enabled filesystem also provides memory-mapped file feature. It maps the file on PMEM directly into user space. The application can use CPU instructions to access the file data without context switches so it can run much faster. In this way, PMDK provides primitive memory functions such as CPU cache flush or memory barrier to make sure that the data reaches PMEM. So we don’t need to make such functions by ourselves. Finally, what we did is here. We hacked PostgreSQL to use memory-mapped files and the PMDK library. Then we compared the performance of the middle and the right to evaluate our hacks. For details, I’ll show you in later slides.
  10. Finally, what we did is here. We hacked PostgreSQL to use memory-mapped files and the PMDK library. Then we compared the performance of the middle and the right to evaluate our hacks. For details, I’ll show you in later slides.
  11. The memory-mapped file and PMDK can improve the performance of I/O-intensive workload because it reduces context switches and the overhead of API calls. To make it sure, we run a micro-benchmark that performs 8-KiB synchronous write. This emulates the XLOG I/O inside PostgreSQL. The results shows that ... We think it seems good to apply PMDK, so we try to introduce it into PostgreSQL.
  12. Next, I’ll talk about our hacks and evaluation.
  13. Before going into detail, I’ll show you our approach. There are two important points; how to hack and what we hack. First, how to hack. We get two ideas and choose the first one. We think it’s an easier way so it’s reasonable for our first step. Second, what we hack. We look into PostgreSQL to find out what kind of files are used in it. We choose XLOG and Relation segment files because it’s critical for transaction performance or many writes occur during checkpoint.
  14. Roughly speaking, we replace five system calls including read and write by using PMDK. Blue four functions are provided by libpmem. It’s the base library of PMDK. Note that the memory-mapping function and unmapping funcion take the length of the mapped file. Please be careful that the memory-mapped file cannot extend or shrink while being mapped.
  15. Okay, let’s go to XLOG. An XLOG segment file contains write-ahead log records. It’s critical for transaction performance. I know that you are all PostgreSQL experts, so the details are skipped.
  16. I’ll show you how we hack XLOG. We have PostgreSQL memory-map every segment file and memory-copy to it from the XLOG buffer. Note that the fixed-length file is highly compatible with memory-mapping because it don’t need to either extend or shrink. My colleague Yoshimi created a patchset. It has about 1,000-line changes and it’s available on pgsql-hackers mailing list. So please search PMDK.
  17. Evaluation setup. I use one x64 computer with two NUMA nodes. I dedicate Node 0 to PostgreSQL server and Node 1 to client such as pgbench. The postgresql.conf is like this.
  18. To evaluate XLOG hacks, I compare transaction throughput by using pgbench on the five conditions below. The parameters are here. The scale factor is 200 [two-hundred] and the runtime is 30 minutes. The condition (e) corresponds to our hacks. On the conditions (c), (d), and (e), the server use PMEM for XLOG segment files, but each of them uses different wal_sync_method. So we should compare these to evaluate our hacks. (a) and (b) are for your information. They use Optane SSD for XLOG segment files, instead of PMEM.
  19. This is the results. Each bar shows the throughput in kilo TPS, so higher is better. The condition (e) achieves the best result. Comparing (c) and (e), the throughput is improved by 3 percent. This is because of our hacks.
  20. So we achieved the improvement by three percent. It is roughly the same as Yoshimi reported on the hackers mailing list. It seems small in the percentage, but I think not-so-small in the absolute value. For future work, we will do performance profiling to find hotspot. Also, we will search for a query pattern for which our hack is more effective. In the abstract of our talk, we said we achieved up to 1.8 [one point eight] as much throughput by using an INSERT-intensive benchmark of our own. This is quite artificial, but we think our hack is effective for such workload, so we search for a more real-life pattern.
  21. That’s all of the XLOG hacks. Then, let me explain the Relation hacks.
  22. The relation segment file is so-called data file. I think it’s critical for checkpoint duration. The details are skipped again because I believe you are experts.
  23. I have PostgreSQL memory-map only every 1-GiB segment file and memory-copy to it from the shared buffer. The reason of “only 1-GiB” is that it’s difficult for me to handle extending and shrinking in proper PostgreSQL manner. The relation segment file can extend or shrink up to 1 GiB per file but the memory-mapped file cannot do so while being mapped. Remapping the file on every extend or shrink could be a solution, but it seemed difficult to implement. I decided to map only 1-GiB segment file that cannot extend any more, and I ignored shrink. So my work is still in progress. I created a patchet. It has 150-line changes. My patchset is not published yet because it’s still under test.
  24. Setup is the same as before except postgresql.conf. I set the checkpoint_timeout to 1day. It’s large enough not to kick a checkpoint automatically. I also set the checkpoint_completion_target to zero to complete every checkpoint as soon as possible.
  25. I compare the checkpoint duration time as follows. It's a little tricky but the important point is that, on every condition, the server flushes out the same amount of pages on the shared buffer during checkpoint. I set the amount to 7.37 (seven point thirty-seven) gibibytes. Then I initialize the database, restart the server, and have the shared buffer dirty by pgbench. Next I send the CHECKPOINT query to the server to kick force-immediate checkpoint. After checkpoint done, I check the administration log to make sure the amount of flushed pages is 7.37-GiB. Finally I get the total checkpoint time for the result. I compare the three conditions <here>. The condition (c) corresponds to our hacks. On the conditions (b) and (c), the server uses PMEM for PGDATA [pee-gee-data]. The condition (a) is for your information. It uses Optane SSD for PGDATA, instead of PMEM. Moreover, I profile the performance of (b) and (c) by Linux perf.
  26. This is the results. Each bar shows the checkpoint duration, so lower is better. Comparing (b) and (c), the time is improved by 30 percent.
  27. Then, profile of (b) and (c). This shows relative CPU cycles, regarding the total cycles of (b) as 100 [one hundred]. The (c) indicates 71 [seventy-one] that almost matches the improvement by 30 [thirty] percent. First, look at the sky blue portion. This stands for the overhead of sys_write, the write system call. Note that the data copy in the write system call is separated as the deep blue portion. The overhead falls by 20 percentage point. This is the main reason of our improvement. Second, look at the green portion. It falls by 7 percentage point. The green stands for the context switches between user and kernel spaces. Third, the deep blue and the brown portions. Both of the two stand for memory copy but the blue is that of kernel space while the brown is user space. The summation of the two falls a little, by 3 percentage point. The user space function is provided by PMDK. The function uses instruction set extension such as SSE2 or AVX if possible. So it could take less time than the general function in kernel space. Finally, the orange portion, LockBufHdr [lock-buf-header]. This becomes rather longer than before, but I don’t know the reason why.
  28. We shortened checkpoint duration by 30%. I know that a regular checkpoint is not performed at full speed like as our evaluation. However, I think the server can give its computing resource to the other purposes. So it may serve more transactions than before during checkpoint. We reduce the overhead of system calls and the context switches between user and kernel spaces. It’s the benefits of using memory-mapped files. However, there is an open issue. The time of LockBufHdr became rather longer. I’ll find the reason for future work.
  29. This is the conclusion of our evaluation. We improved transaction throughput and shortened checkpoint duration. I think it’s not so bad in an easier way, but we must bring out more potential from PMEM. Because our improvement is far from 2.5 times in microbenchmark. I think another way is to have data structures on DRAM persist on PMEM directly. So I will continue my work and deep dive into PostgreSQL.
  30. Then I’ll show some tips related to PMEM.
  31. First, CPU cache flush and cache-bypassing store These methods are important because the data should reach nothing but PMEM, that is, do not stop at half, volatile middle layer such as CPU caches; or the data will be lost when the program or system crashes. For that purpose, the x64 offers two instruction families, CLFLUSH [see-el-flush] and MOVNT [move-en-tee]. The CLFLUSH flushes data out of CPU caches to memory. The MOVNT stores data to memory, bypassing CPU caches. PMDK supports both instructions by the functions <here>. So you don’t need to write assembly codes.
  32. Second, memory-mapped file and Relation extension, extension in file size. The two are not compatible, because memory-mapped file cannot be extended while it is being mapped. I find two naive workarounds, but neither is perfect. The first is remapping a segment file on extend, that is, unmapping, extending, then mapping again. It’s compatible, but also time-consuming, so it’s not good to remap in a small unit such as 8-KiB. The second is pre-allocating possible maximum size. It’s not time-comsuming, but it wastes free space of PMEM. There will be a middle way, of cource, but what I want to say is that, we must rethink traditional data structures, especially for DBMS, based on fixed-length page on the disk. PMEM is memory, so you can use a pointer, for example. Such properties will drastically change persistent data structures for DBMS.
  33. We should take care of page table for PMEM. For x64, the regular page size is 4-KiB, but it could be so small that many page faults occur during storing large data. Its overhead has a significant impact on the performance. Modern processors and operating systems support hugepage such as 2-MiB or 1-GiB page, for x64. The hugepage will improve performance of PMEM by reducing page walk and TLB [tee-el-bee] miss. The good news is that, PMDK on x64 considers hugepage by aligning the mapping address on hugepage boundary when the file is large enough. So you can enjoy it with ease. Also pre-warming the page table for PMEM will also make the performance better by reducing page fault on main runtime.
  34. We also take care of NUMA effect. It’s critical for stable performance on multi-processing system. Consider <this> figure, our evaluation setup. For the CPU on node 0, accessing to local memory on node 0 is fast, while remote node 1 is slow. This also applies to PCIe SSD, but PMEM is more sensitive. The operating system schedules which processes run on which CPU cores. So the server process can run on either node 0 or 1, if you do not control explicitly, and it runs slowly because of memory access over NUMA node. We can control the effect by binding the processes to a certain node. For that purpose, we can use numactl [numa-control] command as like <this>. If you assign 0 into X here, your process runs on only the node 0, and never run on the node 1.
  35. And this is the last tip. I think we should have a new common sense of hotspot. PMEM is much faster than the traditional disk storage so something other than storage access could be the hotspot of transaction. When we use disk storage, the time of disk access is usually the hotspot. But when we use PMEM, the access time decreases and other portions could be the hotspot, such as described <here>, locking, memory copy, or pre-processing. Actually, we fell into the SQL parse trap when we run pgbench and we avoided it by prepared statement.
  36. Finally, conclusion.
  37. We applied PMDK into PostgreSQL in an easier way that uses memory-mapped files memory copy. And we achieved not-so-bad results. I also showed some tips related to PMEM. I believe that PMEM will change software design drastically, and we should change software and our mind to bring out PMEM’s potential much more. I will continue my work, and I hope that you will get interested in. So let’s try PMEM and PMDK.