Persistent Memory
Dr. Benoit Hudzia
@blopeur
benoit@stratoscale.com
Agenda
NVM Evolution
Persistent Memory Linux Software Stack
Using , Emulating PMEM on Linux
Remote PMEM
Micro Storage Architecture
NVM Evolution
Persistent Memory
Yesterday : Battery Backed RAM
Today : NVDIMM with RAM + FLASH
Power Down - copy to Flash, Power Up copy Back to RAM
Emerging NVDIMM : PCM - 3DX Point - Memristor - etc…
Offer 1000x speed vs NAND -> closer to RAM
Characteristics as seen by software : Synchronous Model
Load / Store memory instruction
New Generation HW NVM is no longer the bottleneck
But still limited by Block stack latency + Asynchronous
Model
Asynchronous Model : NVMe
“When Poll is Better than Interrupt” Yang & Al . Usenix Fast 2012 https://www.usenix.org/legacy/events/fast12/tech/full_papers/Yang.pdf
● Active Polling ( SYNC ) lower latency ( at the expense of
CPU) vs interrupt MSI-X (ASYNC)
● Used in Intel SPDK
Enter persistent Memory
Source: Intel
4KB
Read
64B
Read
Moving away from Block I/O
L
A
T
E
N
C
Y
A
C
C
E
S
S
Lead to a new Tiered Software Stack
Challenge: Durability
PMEM Linux Software Stack
Linux kernel (>4.2) subsystem
NVDIMM Software Architecture
http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
BTT vs DAX
BTT : Block translation table
provides atomic sector update semantics for persistent memory devices
applications that rely on sector writes not being torn can continue to do so.
For Legacy application
DAX : stands for Direct Access
Allows mapping a pmem range directly into userspace via mmap
If the application is aware of persistent, byte-addressable memory, and can use it
to an advantage, DAX is the best path for it
Using , Emulating PMEM on Linux
Kernel Config ( > 4.2 )
Enable NVDIMM dynamic debug before you start playing with NVDIMMs
Add to the kernel cmd line:
libnvdimm.dyndbg nfit.dyndbg nd_pmem.dyndbg nd_blk.dyndbg
ignore_loglevel
Pick your PMEM
Use ACPI 6.0 compatible NVDIMM hardware or
legacy NVDIMMs
Use virtual NVDIMMs provided by hypervisor
RAM as persistent memory
PCMSIM: NVM-disk Emulation
Emulation : RAM as PMEM
Bare metal :
Add 'memmap=16G!16G' to the kernel boot parameters will reserve 16G of memory,
starting at 16G.
cat /proc/cmdline :
BOOT_IMAGE=/boot/vmlinuz-4.3.0-1-default root=UUID=39635fd6-64ee- 4538-9964-7de6bb181181
resume=/dev/sda1 splash=silent quiet showopts memmap=1G!5G memmap=1G!7G
BTT works
QEMU NVDIMM
Qemu :
qemu-system-x86_64 -object memory-backend-file,share,id=mem1,mem-
path=/dax/D1 -device nvdimm,memdev=mem1,reserve-label-data,id=nv1 -m
2048,maxmem=100G,slots=10 ….
Not yet in Upstream Qemu :
https://github.com/xiaogr/qemu/tree/nvdimm-v9
Seabios integration :
http://www.seabios.org/pipermail/seabios/2015-September/009770.html
Playing with DAX
Only ext2, ext4 and xfs currently support DAX
Note that block size should match page size
mkfs.ext4 -b 4096 /dev/pmem1
mount -t ext4 -o dax /dev/pmem1 /tmp/dax/
Playing with DAX - Cont
Then you just have to mmap it!
But remember: CFLUSH, etc.. for durability
NVML : Lets somebody else do the heavy lifting
http://pmem.io/
libpmem – Basic persistency handling
Libvmmalloc - Transparently converts all the dynamic memory allocations into
persistent memory allocations.
libpmemblk – Block access to pmem
libpmemlog - Log file on pmem (append-mostly)
libpmemobj - Transactional Object Store on pmem
Many more… pynvm , C++ bidings , etc..
Remote PMEM
Remote NVMe : using RDMA to transfer NVMe commands & data
http://blog.pmcs.com/flash-memory-summit-2015-special-nvm-express-rdma-awesome/
Transitioning from Indirect to Direct Flow
● Project Donard ( PMC - Microsemi)
● Page Struct backed Pmem patch (I/O mem are normally accessed via PFN only)
Comes with Challenge : Durability vs Visibility
http://www.snia.org/sites/default/files/SDC15_presentations/persistant_mem/ChetDouglas_RDMA_with_PM.pdf
RDMA + DDIO
RDMA + Non Allocating write
Peer 2 Peer : Bypassing CPU + SW bottleneck
● NVM HW - Expose BAR
address
● March 16 : RFC patchset for
DAX allowing DMA to I/O
mem
● CCIX fabric
● Use case:
○ Pre-process in Data
path
○ Avoid RAM buffer (
HMM style )
○ SW only fetch what is
necessary
Future Hyperscale Architecture
NVMe gravy train for 3-5 years
Transition to Pmem optimised apps and
Natural evolution of Ethernet Connected
Drive => Fabric connected Pmem
Durable Array of Wimpy Nodes
Direct PMEM
Low power High perf K/V storage
Use pluggable front end
Links
Drivers specs: http://pmem.io/documents/
NVDIMM Namespace Specification: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
NVDIMM Drivers Writers Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
NVDIMM DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
Linux docs: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/nvdimm/nvdimm.txt
Qemu : https://github.com/xiaogr/qemu/tree/nvdimm-v9
Seabios : http://www.seabios.org/pipermail/seabios/2015-September/009770.html
Libraries:
https://github.com/pmem/nvml/
https://github.com/perone/pynvm
http://opennvm.github.io/index.html
https://github.com/spdk/spdk
Project :
PMFS : https://github.com/linux-pmfs/pmfs
NOVA: NOn-Volatile memory Accelerated log-structured file system https://github.com/NVSL/NOVA
PCMSIM : https://code.google.com/p/pcmsim/
Patch :
Donard: A PCIe Peer-2-Peer kernel patch https://github.com/sbates130272/donard
adds struct page backing for IO memory and as such allows IO memory to be used as a DMA target : http://www.spinics.net/lists/linux-
mm/msg103990.html
Thank You!
Questions ?
NVDIMM block I/O path

Persistent memory

  • 1.
    Persistent Memory Dr. BenoitHudzia @blopeur benoit@stratoscale.com
  • 2.
    Agenda NVM Evolution Persistent MemoryLinux Software Stack Using , Emulating PMEM on Linux Remote PMEM Micro Storage Architecture
  • 3.
  • 4.
    Persistent Memory Yesterday :Battery Backed RAM Today : NVDIMM with RAM + FLASH Power Down - copy to Flash, Power Up copy Back to RAM Emerging NVDIMM : PCM - 3DX Point - Memristor - etc… Offer 1000x speed vs NAND -> closer to RAM Characteristics as seen by software : Synchronous Model Load / Store memory instruction
  • 5.
    New Generation HWNVM is no longer the bottleneck But still limited by Block stack latency + Asynchronous Model
  • 6.
    Asynchronous Model :NVMe “When Poll is Better than Interrupt” Yang & Al . Usenix Fast 2012 https://www.usenix.org/legacy/events/fast12/tech/full_papers/Yang.pdf ● Active Polling ( SYNC ) lower latency ( at the expense of CPU) vs interrupt MSI-X (ASYNC) ● Used in Intel SPDK
  • 7.
    Enter persistent Memory Source:Intel 4KB Read 64B Read
  • 8.
    Moving away fromBlock I/O L A T E N C Y A C C E S S
  • 9.
    Lead to anew Tiered Software Stack
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
    BTT vs DAX BTT: Block translation table provides atomic sector update semantics for persistent memory devices applications that rely on sector writes not being torn can continue to do so. For Legacy application DAX : stands for Direct Access Allows mapping a pmem range directly into userspace via mmap If the application is aware of persistent, byte-addressable memory, and can use it to an advantage, DAX is the best path for it
  • 15.
    Using , EmulatingPMEM on Linux
  • 16.
    Kernel Config (> 4.2 ) Enable NVDIMM dynamic debug before you start playing with NVDIMMs Add to the kernel cmd line: libnvdimm.dyndbg nfit.dyndbg nd_pmem.dyndbg nd_blk.dyndbg ignore_loglevel
  • 17.
    Pick your PMEM UseACPI 6.0 compatible NVDIMM hardware or legacy NVDIMMs Use virtual NVDIMMs provided by hypervisor RAM as persistent memory PCMSIM: NVM-disk Emulation
  • 18.
    Emulation : RAMas PMEM Bare metal : Add 'memmap=16G!16G' to the kernel boot parameters will reserve 16G of memory, starting at 16G. cat /proc/cmdline : BOOT_IMAGE=/boot/vmlinuz-4.3.0-1-default root=UUID=39635fd6-64ee- 4538-9964-7de6bb181181 resume=/dev/sda1 splash=silent quiet showopts memmap=1G!5G memmap=1G!7G BTT works
  • 19.
    QEMU NVDIMM Qemu : qemu-system-x86_64-object memory-backend-file,share,id=mem1,mem- path=/dax/D1 -device nvdimm,memdev=mem1,reserve-label-data,id=nv1 -m 2048,maxmem=100G,slots=10 …. Not yet in Upstream Qemu : https://github.com/xiaogr/qemu/tree/nvdimm-v9 Seabios integration : http://www.seabios.org/pipermail/seabios/2015-September/009770.html
  • 20.
    Playing with DAX Onlyext2, ext4 and xfs currently support DAX Note that block size should match page size mkfs.ext4 -b 4096 /dev/pmem1 mount -t ext4 -o dax /dev/pmem1 /tmp/dax/
  • 21.
    Playing with DAX- Cont Then you just have to mmap it! But remember: CFLUSH, etc.. for durability
  • 22.
    NVML : Letssomebody else do the heavy lifting http://pmem.io/ libpmem – Basic persistency handling Libvmmalloc - Transparently converts all the dynamic memory allocations into persistent memory allocations. libpmemblk – Block access to pmem libpmemlog - Log file on pmem (append-mostly) libpmemobj - Transactional Object Store on pmem Many more… pynvm , C++ bidings , etc..
  • 23.
  • 24.
    Remote NVMe :using RDMA to transfer NVMe commands & data http://blog.pmcs.com/flash-memory-summit-2015-special-nvm-express-rdma-awesome/
  • 25.
    Transitioning from Indirectto Direct Flow ● Project Donard ( PMC - Microsemi) ● Page Struct backed Pmem patch (I/O mem are normally accessed via PFN only)
  • 26.
    Comes with Challenge: Durability vs Visibility http://www.snia.org/sites/default/files/SDC15_presentations/persistant_mem/ChetDouglas_RDMA_with_PM.pdf
  • 27.
  • 28.
    RDMA + NonAllocating write
  • 29.
    Peer 2 Peer: Bypassing CPU + SW bottleneck ● NVM HW - Expose BAR address ● March 16 : RFC patchset for DAX allowing DMA to I/O mem ● CCIX fabric ● Use case: ○ Pre-process in Data path ○ Avoid RAM buffer ( HMM style ) ○ SW only fetch what is necessary
  • 30.
    Future Hyperscale Architecture NVMegravy train for 3-5 years Transition to Pmem optimised apps and Natural evolution of Ethernet Connected Drive => Fabric connected Pmem Durable Array of Wimpy Nodes Direct PMEM Low power High perf K/V storage Use pluggable front end
  • 31.
    Links Drivers specs: http://pmem.io/documents/ NVDIMMNamespace Specification: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf NVDIMM Drivers Writers Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf NVDIMM DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf Linux docs: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/nvdimm/nvdimm.txt Qemu : https://github.com/xiaogr/qemu/tree/nvdimm-v9 Seabios : http://www.seabios.org/pipermail/seabios/2015-September/009770.html Libraries: https://github.com/pmem/nvml/ https://github.com/perone/pynvm http://opennvm.github.io/index.html https://github.com/spdk/spdk Project : PMFS : https://github.com/linux-pmfs/pmfs NOVA: NOn-Volatile memory Accelerated log-structured file system https://github.com/NVSL/NOVA PCMSIM : https://code.google.com/p/pcmsim/ Patch : Donard: A PCIe Peer-2-Peer kernel patch https://github.com/sbates130272/donard adds struct page backing for IO memory and as such allows IO memory to be used as a DMA target : http://www.spinics.net/lists/linux- mm/msg103990.html
  • 32.
  • 33.