Good afternoon, everyone. Thank you for your participating to this presentation.
My name is Yoshimi Ichiyanagi. I’m a open-source software engineer at NTT labs in Japan.
I came from Japan a few days ago. So I have jet lag.
I'm very happy to speak to you today.
I’d like to talk about the challenges for implementing PMEM aware application with PMDK.
The purpose of my presentation is to share know-how gained by rewriting PMEM-aware application.
This is an overview of my talk.
Firstly, my introduction; secondly, our background and our motivation; thirdly, Challenges for implementing PMEM-aware applications; and finally, Challenges for performance evaluation to get valid results.
First of all, let me introduce myself.
I have worked at Software Innovation Center of NTT labs for twelve years.
NTT is a Japanese telecommunication company.
NTT Software Innovation Center is undertaking promotion of open innovation centering on the development of open source platforms.
There are many committers and developers of open source software, such as Openstack, Swift, and apache projects in NTT labs.
I worked on system software, which are distributed filesystem such as HDFS, and operating system such as Linux kernel.
Now, I work on storage applications.
Today I’d like to talk about rewriting open source software using a new storage and a new library.
Next, I’d like to talk about our research background and motivation.
Persistent Memory such as NVDIMM-N and Intel Optane DC Persistent Memory, begins to be supplied.
PMEM has several features. Memory-like features of PMEM are Low-latency and Byte-addressable, and Storage-like features are large-capacity and non-volatile.
We try to rewrite storage applications since PMEM features are utilized. I think storage applications are RDBMS such as PostgreSQL, Message queue system such as Apache Kafka and so on.
Let me share my challenges to rewrite the PMEM-aware applications and evaluate their performance.
Next, I’d like to talk about how to use PMEM.
We used written hardware and software.
There are 3 components, which are NVDIMM-N, Direct-Access for files, and Persistent Memory Developer Kid, PMDK.
DAX enabled FS and PMDK are already supported on Linux and Windows.
My previous speaker have talked about them. I think you know them.
But I’m going to explain DAX FS and PMDK just in case someone doesn’t know them.
I think 3 pattern of I/O stacks.
First, the left figure shows the current I/O stack.
Many applications uses File I/O APIs such as read and write system calls.
Second, the center figure shows DAX FS.
DAX filesystem is the filesystem which allows applications direct access to persistent memory without page cache. So, it can run faster than before, since unnecessary copies are removed.
But context switch is needed between user and kernel spaces. So context switch becomes overhead.
[ポインタを動かして、スライドさせる]
Finally, the right figure shows how to access PMEM with DAX FS and PMDK.
DAX filesystem also provides memory-mapped file feature.
DAX FS maps the file on PMEM directly into the virtual address space of the application.
The application can use CPU instructions in order to access the file data without context switches so it can run very very faster than before.
In this way, PMDK provides primitive memory functions such as [CPU cache flush or memory barrier] to make sure that the data reaches PMEM.
So, by using PMDK, we don’t need to write such functions by ourselves. [end]
The biggest benefit of DAX-enabled FS is that it is not necessary to rewrite applications.
The memory-mapped file and PMDK can improve the performance of I/O-intensive workload, because it reduces context switches and the overhead of API calls.
But PMDK cannot use without change of the application. So I rewrote PostgreSQL and apache Kafka in order to improve the application performance with PMDK.
I think there are 3 features of PMDK.
First, as I said, an application accesses PMEM as memory-mapped I/O.
Second, an application developers can select fine-grained sync size. For details, I’ll show you it in next slides.
Finally, CPU instructions suitable for copy data size are selected.
1. For example, when CPUs of the server supports SSE2, data is copied with 16(sixteen) bytes registers as possible.
2. When CPUs of the server supports AVX, data is copied with 32(thirty-two) bytes registers as possible.
3. And if recent CPUs is working on the server, data is copied with 64 bytes register as possible.
Furthermore, CPU instructions, “MOVNT(move non-temporal instructions) and SFENCE” store data to memory, without CPU caches.
Next, I explain on the second feature of PMDK.
The second feature is that PMDK provides some sync APIs. In particular, I’ll talk about pmem_msync() and pmem_drain(). Both are PMDK APIs.
The key difference between pmem_msync() and pmem_drain() is what kind of data is flushed.
pmem_msync() calls msync syscall.
The most general sync function for memory-mapped file is msync syscall. By calling msync syscall, the file metadata and written file data is flushed.
On the other hand, By calling pmem_drain(), Only written file data is flushed.
So, pmem_drain() is faster than pmem_msync().
We try to rewrite PostgreSQL and Apache Kafka as storage applications in order to be improved I/O performance with PMDK.
Let me share know-how gained by rewriting PostgreSQL.
We looked into PostgreSQL to find out what kind of files was used in it.
We chose 2 points, which are checkpoint files and WAL files.
Checkpoint file was chosen because many writes occurred during checkpoint.
WAL files was chosen because it was critical for transaction performance.
I’ll talk about challenges for implementing PostgreSQL.
But before that, I’d like to talk about how to rewrite PostgreSQL.
We replaced standard file I/O as some syscalls with mmap I/O as PMDK APIs simply.
As I said, we chose two points. That’s checkpoint files and WAL files.
First, I’ll show you how to hack checkpoint files.
On PostgreSQL, the huge table and so forth consist of multiple checkpoint files.
The size of checkpoint files is up to 1GB. So it’s necessary to resizing the checkpoint file.
PMDK provides APIs for memory-mapped file. And it’s difficult to resize memory-mapped file without overhead. I think that best practice is to access only fixed-size file. So, we changed how to access only 1GB checkpoint file
Then, what’s this overhead?
It’s to remap a file. I’d like to talk about how to resize the memory-mapped file.
When the memory-mapped file is enlarged, only part of the file can be accessed with PMDK.
This(指す, other) range can’t be accessed with PMDK. So remapping the file is needed.
The file is remapped, so that whole file can be accessed with PMDK.
Next, I’ll show you how to implement resizing the memory-mapped file.
Here, it’s how to enlarge a file.
On DAX-enabled FS, open syscall and close syscall are called.
Then, the largest data offset become file size. So, application developers don’t need to write ftruncate syscall in source code.
On the other hand, on DAX-enabled FS with PMDK, pmem_map_file(), pmem_unmap(), truncate syscall, pmam_map_file(), and pmem_unmap() are called.
3 function calls are added to use PMDK. 3 function become overhead.
Here, it’s how to shrink a file.
On DAX-enabled FS, open syscall, ftruncate syscall, and close syscall are called.
On the other hand, on DAX-enabled FS with PMDK, pmem_map_file(), pmem_unmap(), truncate syscall, pmam_map_file(), and pmem_unmap() are called. 2 function calls are added to use PMDK. 2 functions become overhead.
Then, if data is written to out of range, Segment fault happen.
I think when the file is remapped, it’s necessary to manage this mapped address. Of course, while the file is unmapped, the file can’t accessed with PMDK. I think the lock functions are needed, and the application performance reduces.
In these cases, it would be better to use only DAX FS. [end]
So, it’s difficult to use PMDK unless file size is fixed. By using PMDK, munmap(), close(), open(), and mmap() syscall is called again.
Repeating remapping many times degrades performance, because remapping file has large overhead. Or mapping large files may make file system full.
I think that best practice is to use only fixed-size file.
Next, I’ll show you how to hack WAL files.
The size of WAL files is fixed. This fixed-size file is highly suitable for memory-mapped I/O, because it is not necessary to either enlarge or shrink them.
For initialization of the WAL file. PostgreSQL create the WAL file and fill the file with zero. PostgreSQL flush file metadata at the end of initialization. This file metadata includes the size of file, the indirect node of inode, and so on.
On the other hand, synchronous logging is sequential synchronous writes. It’s necessary to flush only written data.
PMDK provides some sync APIs, for example pmem_msync() and pmem_drain().
Which sync function is better for initialization of the WAL file? And which sync function is better for "synchronous logging"?
It would be better to use pmem_msync() for initialization of the WAL file, because the file metadata should be flushed.
On the other hand, either’s fine for synchronous logging. But pmem_drain() is faster than pmem_msync(). So We selected pmem_drain() for synchronous logging.
Next, I’d like to talk about comparing the performance between pmem_msync() and pmem_drain().
I ran micro-benchmark to compare the performance between pmem_msync() and pmem_drain(). This emulates the WAL file I/O inside PostgreSQL. I measured only synchronous logging.
This is the detail of micro-benchmark. I replaced write syscall with the PMDK API as pmem_memcpy_nodrain(). And I replaced fdatasync syscall with the PMDK API, which is pmem_msync or pmem_nodrain.
This is the evaluation setup.
I used one HPE computer server with two NUMA nodes.
I ran the micro-benchmark on Node 1, because there was PMEM on Node 1.
The total size of the written data is 10GB and the block size is 8KB.
The micro-benchmark I/O throughput on DAX-enabled FS is 4.3 GB/sec, the throughput with pmem_msync() is 0.0023 GB/sec, and the throughput with pmem_drain() is 15GB/sec.
pmem_drain() is fastest of 3 patterns as expected. [end]
pmem_drain() greatly improves performance of I/O-intensive workload.
But you should use pmem_drain() with caution.
pmem_drain() can’t flush file metadata. So pmem_msync() should be called by application that uses file metadata such as time of last modification, time of last access, and so on.
In addition, pmem_drain() doesn’t work without pmem_memcpy_nodrain().
Now, I'd like to move on to the next topic, which is challenges for performance evaluation to get valid results.
It’s difficult to get valid results in PG performance evaluation.
The results such as application performance depends on NUMA and CPUs. In order to get valid results, it is better to avoid NUMA effects and CPUs becoming hotspots. It’s necessary to tuning an application for PMEM.
I’d like to talk about how to evaluate the performance to get the valid result.
Now, I’d like to talk about NUMA effect.
Would you please look at this(指す) figure, which is our evaluation setup. For the CPU on node 1, accessing to local memory on node 1 is fast, while remote node 0 is slow.
This also applies to PCIe SSD, but persistent memory is more sensitive.
I ran the micro-benchmark on local NUMA Node 1 and I also did on the remote Node 0.
This benchmark is the same as before.
The I/O throughput on the local node is 15GB/sec and one on the remote node is 11GB/sec.
Synchronous write is about 1.5 times faster on local NUMA node than on remote NUMA node.
Finally, I’d like to talk about tuning an application for persistent memory.
It’s Important to avoid calculation processing becoming hotspot. For example, it’s SQL processing in PostgreSQL.
Stored Procedure improves the PostgreSQL performance, since user-defined functions are pre-compiled and stored in the PostgreSQL server.
For that purpose, we used pgbench command with prepared option, which is this(指す) command line.
I ran pgbench client with “the prepared option”, and I also did without “the prepared option”. I compared them.
This pgbench is a popular PostgreSQL benchmarking tools used by many developers to do quick performance test.
I ran this PostgreSQL server on Node 1 and I ran the client on Node 0.
I wrote patches for PMEM-aware PostgreSQL, it’s available on pgsql-hackers mailing list.
The improvement ratio using PMEM is improved by 12% compared with using SSD.
On intel optane SSD, in original PostgreSQL, pgbench result with Stored Procedure is 18,396 tps, and one without Stored Procedure is 29,125 tps.
The result with Stored Procedure is 1.58 times faster than without Stored Procedure.
On Persistent memory, in PMEM-aware PostgreSQL, the pgbench result with Stored Procedure is 21,406 tps and one without Stored Procedure is 36,449 tps.
The result with Stored Procedure is 1.70 times faster than without Stored Procedure.
Improvement ratio using PMEM is higher than using intel Optane SSD as expected.
I’d like to give you a quick summary of what we’ve seen today.
The topics that were covered today were how to implement a PMEM-aware application and how to evaluate it to get valid results.
By applying PMDK into PostgreSQL, I understood it was difficult to implement PMEM-aware applications and to get valid results in performance evaluation.
I hope that my presentation answered your questions.
Thank you so much for your kind attention. Now, are there any questions?[end]
Q&A
That’s a good question. Or Thank you for your question.
Could you speak a little slower, please? Or Could you repeat the question?
わかりましたか?
Does it make a sense? Or Do you have any questions so far?
ほとんどの人はOptane SSDで満足できる。低遅延が必要なシステムは、PMEMが必要である。
Most people are satisfied with the use of Optane SSD. The systems that require low latency require Optane memory.
AとBの違いはなんですか?
What is the difference between A and B?
If not, thank you very much.
DAX FS tries to map 2M linear space of PMEM to each file as possible.
Linux does not know the virtual address where the application has written data. And Linux manages memory on a page basis.
Linux makes a page dirty when a write page fault occurs. Then DAX FS use huge page as possible. So, when msync syscalls is called, 2MB dirty page and file metadata is flushed. It has huge overhead.
I rewrote PG using this way. It has huge overhead.
So, my coworker, Menjo is developing this. I guess this will be published next year, maybe.