NVDIMM is a standard for allowing non-volatile memory to be exposed to as normal RAM, which can be directly mapped to guests. This simple concept has the potential to dramatically change the way software is written; but also has a number of surprising problems to solve. Furthermore, this area is plagued with incomplete specifications and confusing terminoligy.
This talk will attempt to give an overview of NVDIMMs from an operating system perspective: What the terminology means, how they are discovered and partitioned, issues relating to filesystems, a brief description of the functionality available in Linux, and so on. It will then describe the various issues and design choices a Xen system has to make in order to allow Xen systems to use NVDIMMs effectively.
2. What are NVDIMMs?
• A standard for allowing NVRAM to be exposed as normal memory
• Potential to dramatically change the way software is written
3. But..
• They have a number of surprising problems to solve
• Particularly with respect to Xen
• Incomplete specifications and confusing terminology
4. Goals
• Give an overview of NVDIMM concepts, terminology, and architecture
• Introduce some of the issues they bring up WRT operating systems
• Introduce some of the issues faced wrt XEN
• Propose a “sketch” of a design solution
5. Current Technology
• Currently available: NVDIMM-N
• DRAM with flash + capacitor backup
• Strictly more expensive than a normal
DIMM with the same storage capacity
8. SPA
DPA 0
DPA $N
• DRAM-like access
• 1-1 Mapping between DPA and SPA
• Interleaved across DIMMs
• Similar to RAID-0
• Better performance
• Worse reliability
PMEM: RAM-like access
9. PBLK: Disk-like access
• Disk-like access
• Control region
• 8k Data “window”
• One window per NVDIMM device
• Never interleaved
• Useful for software RAID
• Also useful when SPA space < NVRAM size
• 48 address bits (256TiB)
• 46 physical bits (64TiB)
SPA
DPA 0
DPA $N
10. How is this mapping set up?
• Firmware sets up the mapping at boot
• May be modifiable using BIOS / vendor-specific tool
• Exposes information via ACPI
11. ACPI: Devices
• NVDIMM Root Device
• Expect one per system
• NVDIMM Device
• One per NVDIMM
• Information about size, manufacture, &c
12. ACPI: NFIT Table
• NVDIMM Firmware Interface Table
• PMEM information:
• SPA ranges
• Interleave sets
• PBLK information
• Control regions
• Data window regions
13. Practical issues for using NVDIMM
• How to partition up the NVRAM and share it between operating systems
• Knowing the correct way to access each area (PMEM / PBLK)
• Detecting when interleaving / layout has changed
14. Dividing things up: Namespaces
• “Namespace”: Think partition
• PMEM namespace and interleave sets
• PBLK namespaces
• “Type UUID” to define how it’s used
• think DOS partitions: “Linux ext2/3/4”,
“Linux Swap”, “NTFS”, &c
15. How do we store namespace information?
• Reading via PMEM and PBLK will give you different results
• PMEM Interleave sets may change across reboots
• So we can’t store it inside the visible NVRAM
16. Label area: Per-NVDIMM storage
• One “label area” per NVDIMM device (aka
physical DIMM)
• Label describes a single contiguous DPA
region
• Namespaces made out of labels
• Accessed via ACPI AML methods
• Pure read / write
NVRAM
Label Areas
17. • Read ACPI NFIT to determine
• How many NVDIMMs you have
• Where PMEM is mapped
• Read label area for each NVDIMM
• Piece together the namespace described
• Double-check interleave sets with the interleave
sets (from NFIT table)
• Access PMEM regions by offsets in SPA sets
(from NFIT table)
• Access PBLK regions by programming control /
data windows (from NFIT table)
SPA
How an OS Determines Namespaces
Label Areas
18.
19. Key points
• “Namespace”: Partition
• “Label area”: Partition table
• NVDIMM devices / SPA ranges / etc defined in ACPI static NFIT table
• Label area accessed via ACPI AML methods
21. ndctl
• Create / destroy namespaces
• Four modes
• raw
• sector
• fsdax
• devdax
22. The ideal interface
• Guest processes map a normal file, and magically get permanent storage
23. The obvious solution
• Make a namespace into a block device
• Put a normal partition inside the namespace
• Put a normal filesystem inside a partition
• Have mmap() map to the system memory directly
24. Issue: Sector write atomicity
• Disk sector writes are atomic: all-or-nothing
• memcpy() can be interrupted in the middle
• Block Translation Table (BTT): an abstraction that guarantees write
atomicity
• ‘sector mode’
• But this means mmap() needs a separate buffer
25. Issue: Page struct
• To keep track of userspace mappings, Linux needs a ‘page struct’
• 64 bytes per 4k page
• 1 TiB of PMEM requires 7.85GiB of page array
• Solution: Use PMEM to store a ‘page struct’
• Use a superblock to designate areas of the namespace to be used for this
purpose (allocated on namespace creation)
26. Issue: Filesystems and block location
• Filesystems want to be able to move blocks around
• Difficult interactions between write() system call (DMA)
28. Mode summary: Raw
• Block mode access to full SPA range
• No support for namespaces
• Therefore, no UUIDs / superblocks; page structs must be stored in main
memory
• Supports DAX
29. Mode summary: Sector
• Block mode with BTT for sector atomicity
• Supports namespaces
• No DAX / direct mmap() support
30. Mode summary: fsdax
• Block mode access to a namespace
• Supports page structs in main memory, or in the namespace
• Must be chosen at time of namespace creation
• Supports filesystems with DAX
• But there’s some question about the safety of this
31. Mode summary: devdax
• Character device access to namespace
• Does not support filesystems
• page structs must be contained within the namespace
• Supports mmap()
• “No interaction with kernel page cache”
• Character devices don’t have a “size”, so you have to remember
32. Summary
• Four ways of mapping with different advantages and disadvantages
• Seems clear that Linux is still figuring out how best use PMEM
34. Issue: Xen and AML
• Reading the label areas can only be done via AML
• Xen cannot do AML
• ACPI spec requires only a single entity to do AML
• That must be domain 0
35. Issue: struct page
• In order to track mapping to guests, the hypervisor needs a struct page
• “frametable”
• 32 or 40 bits
36. Issue: RAM vs MMIO
• Dom0 is free to map any SPA
• RAM: Page reference counts taken
• MMIO: No reference counts taken
• If we “promote” NVDIMM SPA ranges from MMIO to RAM, existing dom0
mappings won’t have a reference count
37. PMEM promotion, continued
• Three basic options:
• 1a: Trust that dom0 has unmapped before promotion
• 1b: Check to make sure that dom0 unmapped before promotion
• 2: Automatically take appropriate reference counts on promotion
• Checking / converting seem about equivalent:
• A: Keep track of all unpromoted PMEM regions in dom0 pagetables
• B: Brute force search of dom0 pagetables (PV) or p2m table (PVH)
38. Solution: Where to get PMEM for frame table
• “Manually” set-aside namespaces
• Namespace with custom Xen UUID
• Allocated from within a namespace (like Linux)
39. Solution: Promoting NVRAM
• Hypercall to allow dom0 “promote” specific PMEM regions (probably full
namespaces)
• Allow dom0 to specify two types of “scratch” PMEM for frame tables
• PMEM tied to specific other PMEM regions
• PMEM generally available for any PMEM region
40. Proposal: Promoting PMEM mappings
• Keep a list of all unpromoted PMEM mappings, automatically increase
reference counts on promotion
• Alternately:
• For PV: Require unmapping without checking
• For PVH: Require dom0 to map PMEM into a single contiguous p2m
range
43. PMEM only, no interleaving
• Don’t bother with virtual PBLK
• Each vNVDIMM will be exposed as a single non-interleaved chunk with its
own label area
44. Issue: Guest label area
• Each NVDIMM needs two distinct chunks
• To begin with, specify two files / devices
• Eventually: Both contained in a single file (w/ metadata?)
45. Issue: How to read label areas
• Expose via ACPI?
• Map label areas into “secret” section of p2m
• Implement AML which reads and writes this secret section
• Expose via PV interface?