NVDIMM Overview
Technology, Linux, and Xen
What are NVDIMMs?
• A standard for allowing NVRAM to be exposed as normal memory

• Potential to dramatically change the way software is written
But..
• They have a number of surprising problems to solve

• Particularly with respect to Xen

• Incomplete specifications and confusing terminology
Goals
• Give an overview of NVDIMM concepts, terminology, and architecture

• Introduce some of the issues they bring up WRT operating systems

• Introduce some of the issues faced wrt XEN

• Propose a “sketch” of a design solution
Current Technology
• Currently available: NVDIMM-N

• DRAM with flash + capacitor backup

• Strictly more expensive than a normal
DIMM with the same storage capacity
Coming soon: Terabytes?
Page Tables
SPA
Terminology
• Physical DIMM

• DPA: DIMM Physical Address

• SPA: System Physical Address
DPA 0
DPA $N
SPA
DPA 0
DPA $N
• DRAM-like access

• 1-1 Mapping between DPA and SPA

• Interleaved across DIMMs

• Similar to RAID-0

• Better performance

• Worse reliability
PMEM: RAM-like access
PBLK: Disk-like access
• Disk-like access

• Control region

• 8k Data “window”

• One window per NVDIMM device

• Never interleaved

• Useful for software RAID

• Also useful when SPA space < NVRAM size

• 48 address bits (256TiB)

• 46 physical bits (64TiB)
SPA
DPA 0
DPA $N
How is this mapping set up?
• Firmware sets up the mapping at boot

• May be modifiable using BIOS / vendor-specific tool

• Exposes information via ACPI
ACPI: Devices
• NVDIMM Root Device

• Expect one per system

• NVDIMM Device

• One per NVDIMM

• Information about size, manufacture, &c
ACPI: NFIT Table
• NVDIMM Firmware Interface Table

• PMEM information:

• SPA ranges

• Interleave sets

• PBLK information

• Control regions

• Data window regions
Practical issues for using NVDIMM
• How to partition up the NVRAM and share it between operating systems

• Knowing the correct way to access each area (PMEM / PBLK)

• Detecting when interleaving / layout has changed
Dividing things up: Namespaces
• “Namespace”: Think partition

• PMEM namespace and interleave sets

• PBLK namespaces 

• “Type UUID” to define how it’s used

• think DOS partitions: “Linux ext2/3/4”,
“Linux Swap”, “NTFS”, &c
How do we store namespace information?
• Reading via PMEM and PBLK will give you different results

• PMEM Interleave sets may change across reboots

• So we can’t store it inside the visible NVRAM
Label area: Per-NVDIMM storage
• One “label area” per NVDIMM device (aka
physical DIMM)

• Label describes a single contiguous DPA
region 

• Namespaces made out of labels

• Accessed via ACPI AML methods

• Pure read / write
NVRAM
Label Areas
• Read ACPI NFIT to determine

• How many NVDIMMs you have

• Where PMEM is mapped

• Read label area for each NVDIMM

• Piece together the namespace described

• Double-check interleave sets with the interleave
sets (from NFIT table)

• Access PMEM regions by offsets in SPA sets
(from NFIT table)

• Access PBLK regions by programming control /
data windows (from NFIT table)
SPA
How an OS Determines Namespaces
Label Areas
Key points
• “Namespace”: Partition

• “Label area”: Partition table

• NVDIMM devices / SPA ranges / etc defined in ACPI static NFIT table

• Label area accessed via ACPI AML methods
NVDIMMs in Linux
ndctl
• Create / destroy namespaces

• Four modes 

• raw

• sector

• fsdax

• devdax
The ideal interface
• Guest processes map a normal file, and magically get permanent storage
The obvious solution
• Make a namespace into a block device

• Put a normal partition inside the namespace

• Put a normal filesystem inside a partition

• Have mmap() map to the system memory directly
Issue: Sector write atomicity
• Disk sector writes are atomic: all-or-nothing

• memcpy() can be interrupted in the middle

• Block Translation Table (BTT): an abstraction that guarantees write
atomicity

• ‘sector mode’

• But this means mmap() needs a separate buffer
Issue: Page struct
• To keep track of userspace mappings, Linux needs a ‘page struct’

• 64 bytes per 4k page

• 1 TiB of PMEM requires 7.85GiB of page array

• Solution: Use PMEM to store a ‘page struct’

• Use a superblock to designate areas of the namespace to be used for this
purpose (allocated on namespace creation)
Issue: Filesystems and block location
• Filesystems want to be able to move blocks around

• Difficult interactions between write() system call (DMA)
Issue: Interactions with the page cache
Mode summary: Raw
• Block mode access to full SPA range

• No support for namespaces

• Therefore, no UUIDs / superblocks; page structs must be stored in main
memory

• Supports DAX
Mode summary: Sector
• Block mode with BTT for sector atomicity

• Supports namespaces

• No DAX / direct mmap() support
Mode summary: fsdax
• Block mode access to a namespace

• Supports page structs in main memory, or in the namespace

• Must be chosen at time of namespace creation

• Supports filesystems with DAX

• But there’s some question about the safety of this
Mode summary: devdax
• Character device access to namespace

• Does not support filesystems

• page structs must be contained within the namespace

• Supports mmap()

• “No interaction with kernel page cache”

• Character devices don’t have a “size”, so you have to remember
Summary
• Four ways of mapping with different advantages and disadvantages

• Seems clear that Linux is still figuring out how best use PMEM
NVDIMM, Xen, and dom0
Issue: Xen and AML
• Reading the label areas can only be done via AML

• Xen cannot do AML

• ACPI spec requires only a single entity to do AML

• That must be domain 0
Issue: struct page
• In order to track mapping to guests, the hypervisor needs a struct page

• “frametable”

• 32 or 40 bits
Issue: RAM vs MMIO
• Dom0 is free to map any SPA

• RAM: Page reference counts taken

• MMIO: No reference counts taken

• If we “promote” NVDIMM SPA ranges from MMIO to RAM, existing dom0
mappings won’t have a reference count
PMEM promotion, continued
• Three basic options:

• 1a: Trust that dom0 has unmapped before promotion

• 1b: Check to make sure that dom0 unmapped before promotion

• 2: Automatically take appropriate reference counts on promotion

• Checking / converting seem about equivalent:

• A: Keep track of all unpromoted PMEM regions in dom0 pagetables

• B: Brute force search of dom0 pagetables (PV) or p2m table (PVH)
Solution: Where to get PMEM for frame table
• “Manually” set-aside namespaces

• Namespace with custom Xen UUID

• Allocated from within a namespace (like Linux)
Solution: Promoting NVRAM
• Hypercall to allow dom0 “promote” specific PMEM regions (probably full
namespaces)

• Allow dom0 to specify two types of “scratch” PMEM for frame tables

• PMEM tied to specific other PMEM regions

• PMEM generally available for any PMEM region
Proposal: Promoting PMEM mappings
• Keep a list of all unpromoted PMEM mappings, automatically increase
reference counts on promotion

• Alternately:

• For PV: Require unmapping without checking

• For PVH: Require dom0 to map PMEM into a single contiguous p2m
range
NVDIMM, Xen, and domUs
(Virtual NVDIMMs)
Issue: Guest compatibility
• PVH is our “next generation” guest type

• Requiring QEMU for is a last resort
PMEM only, no interleaving
• Don’t bother with virtual PBLK

• Each vNVDIMM will be exposed as a single non-interleaved chunk with its
own label area
Issue: Guest label area
• Each NVDIMM needs two distinct chunks

• To begin with, specify two files / devices

• Eventually: Both contained in a single file (w/ metadata?)
Issue: How to read label areas
• Expose via ACPI?

• Map label areas into “secret” section of p2m

• Implement AML which reads and writes this secret section

• Expose via PV interface?
Questions?

XPDDS18: NVDIMM Overview - George Dunlap, Citrix

  • 1.
  • 2.
    What are NVDIMMs? •A standard for allowing NVRAM to be exposed as normal memory • Potential to dramatically change the way software is written
  • 3.
    But.. • They havea number of surprising problems to solve • Particularly with respect to Xen • Incomplete specifications and confusing terminology
  • 4.
    Goals • Give anoverview of NVDIMM concepts, terminology, and architecture • Introduce some of the issues they bring up WRT operating systems • Introduce some of the issues faced wrt XEN • Propose a “sketch” of a design solution
  • 5.
    Current Technology • Currentlyavailable: NVDIMM-N • DRAM with flash + capacitor backup • Strictly more expensive than a normal DIMM with the same storage capacity
  • 6.
  • 7.
    Page Tables SPA Terminology • PhysicalDIMM • DPA: DIMM Physical Address • SPA: System Physical Address DPA 0 DPA $N
  • 8.
    SPA DPA 0 DPA $N •DRAM-like access • 1-1 Mapping between DPA and SPA • Interleaved across DIMMs • Similar to RAID-0 • Better performance • Worse reliability PMEM: RAM-like access
  • 9.
    PBLK: Disk-like access •Disk-like access • Control region • 8k Data “window” • One window per NVDIMM device • Never interleaved • Useful for software RAID • Also useful when SPA space < NVRAM size • 48 address bits (256TiB) • 46 physical bits (64TiB) SPA DPA 0 DPA $N
  • 10.
    How is thismapping set up? • Firmware sets up the mapping at boot • May be modifiable using BIOS / vendor-specific tool • Exposes information via ACPI
  • 11.
    ACPI: Devices • NVDIMMRoot Device • Expect one per system • NVDIMM Device • One per NVDIMM • Information about size, manufacture, &c
  • 12.
    ACPI: NFIT Table •NVDIMM Firmware Interface Table • PMEM information: • SPA ranges • Interleave sets • PBLK information • Control regions • Data window regions
  • 13.
    Practical issues forusing NVDIMM • How to partition up the NVRAM and share it between operating systems • Knowing the correct way to access each area (PMEM / PBLK) • Detecting when interleaving / layout has changed
  • 14.
    Dividing things up:Namespaces • “Namespace”: Think partition • PMEM namespace and interleave sets • PBLK namespaces • “Type UUID” to define how it’s used • think DOS partitions: “Linux ext2/3/4”, “Linux Swap”, “NTFS”, &c
  • 15.
    How do westore namespace information? • Reading via PMEM and PBLK will give you different results • PMEM Interleave sets may change across reboots • So we can’t store it inside the visible NVRAM
  • 16.
    Label area: Per-NVDIMMstorage • One “label area” per NVDIMM device (aka physical DIMM) • Label describes a single contiguous DPA region • Namespaces made out of labels • Accessed via ACPI AML methods • Pure read / write NVRAM Label Areas
  • 17.
    • Read ACPINFIT to determine • How many NVDIMMs you have • Where PMEM is mapped • Read label area for each NVDIMM • Piece together the namespace described • Double-check interleave sets with the interleave sets (from NFIT table) • Access PMEM regions by offsets in SPA sets (from NFIT table) • Access PBLK regions by programming control / data windows (from NFIT table) SPA How an OS Determines Namespaces Label Areas
  • 19.
    Key points • “Namespace”:Partition • “Label area”: Partition table • NVDIMM devices / SPA ranges / etc defined in ACPI static NFIT table • Label area accessed via ACPI AML methods
  • 20.
  • 21.
    ndctl • Create /destroy namespaces • Four modes • raw • sector • fsdax • devdax
  • 22.
    The ideal interface •Guest processes map a normal file, and magically get permanent storage
  • 23.
    The obvious solution •Make a namespace into a block device • Put a normal partition inside the namespace • Put a normal filesystem inside a partition • Have mmap() map to the system memory directly
  • 24.
    Issue: Sector writeatomicity • Disk sector writes are atomic: all-or-nothing • memcpy() can be interrupted in the middle • Block Translation Table (BTT): an abstraction that guarantees write atomicity • ‘sector mode’ • But this means mmap() needs a separate buffer
  • 25.
    Issue: Page struct •To keep track of userspace mappings, Linux needs a ‘page struct’ • 64 bytes per 4k page • 1 TiB of PMEM requires 7.85GiB of page array • Solution: Use PMEM to store a ‘page struct’ • Use a superblock to designate areas of the namespace to be used for this purpose (allocated on namespace creation)
  • 26.
    Issue: Filesystems andblock location • Filesystems want to be able to move blocks around • Difficult interactions between write() system call (DMA)
  • 27.
  • 28.
    Mode summary: Raw •Block mode access to full SPA range • No support for namespaces • Therefore, no UUIDs / superblocks; page structs must be stored in main memory • Supports DAX
  • 29.
    Mode summary: Sector •Block mode with BTT for sector atomicity • Supports namespaces • No DAX / direct mmap() support
  • 30.
    Mode summary: fsdax •Block mode access to a namespace • Supports page structs in main memory, or in the namespace • Must be chosen at time of namespace creation • Supports filesystems with DAX • But there’s some question about the safety of this
  • 31.
    Mode summary: devdax •Character device access to namespace • Does not support filesystems • page structs must be contained within the namespace • Supports mmap() • “No interaction with kernel page cache” • Character devices don’t have a “size”, so you have to remember
  • 32.
    Summary • Four waysof mapping with different advantages and disadvantages • Seems clear that Linux is still figuring out how best use PMEM
  • 33.
  • 34.
    Issue: Xen andAML • Reading the label areas can only be done via AML • Xen cannot do AML • ACPI spec requires only a single entity to do AML • That must be domain 0
  • 35.
    Issue: struct page •In order to track mapping to guests, the hypervisor needs a struct page • “frametable” • 32 or 40 bits
  • 36.
    Issue: RAM vsMMIO • Dom0 is free to map any SPA • RAM: Page reference counts taken • MMIO: No reference counts taken • If we “promote” NVDIMM SPA ranges from MMIO to RAM, existing dom0 mappings won’t have a reference count
  • 37.
    PMEM promotion, continued •Three basic options: • 1a: Trust that dom0 has unmapped before promotion • 1b: Check to make sure that dom0 unmapped before promotion • 2: Automatically take appropriate reference counts on promotion • Checking / converting seem about equivalent: • A: Keep track of all unpromoted PMEM regions in dom0 pagetables • B: Brute force search of dom0 pagetables (PV) or p2m table (PVH)
  • 38.
    Solution: Where toget PMEM for frame table • “Manually” set-aside namespaces • Namespace with custom Xen UUID • Allocated from within a namespace (like Linux)
  • 39.
    Solution: Promoting NVRAM •Hypercall to allow dom0 “promote” specific PMEM regions (probably full namespaces) • Allow dom0 to specify two types of “scratch” PMEM for frame tables • PMEM tied to specific other PMEM regions • PMEM generally available for any PMEM region
  • 40.
    Proposal: Promoting PMEMmappings • Keep a list of all unpromoted PMEM mappings, automatically increase reference counts on promotion • Alternately: • For PV: Require unmapping without checking • For PVH: Require dom0 to map PMEM into a single contiguous p2m range
  • 41.
    NVDIMM, Xen, anddomUs (Virtual NVDIMMs)
  • 42.
    Issue: Guest compatibility •PVH is our “next generation” guest type • Requiring QEMU for is a last resort
  • 43.
    PMEM only, nointerleaving • Don’t bother with virtual PBLK • Each vNVDIMM will be exposed as a single non-interleaved chunk with its own label area
  • 44.
    Issue: Guest labelarea • Each NVDIMM needs two distinct chunks • To begin with, specify two files / devices • Eventually: Both contained in a single file (w/ metadata?)
  • 45.
    Issue: How toread label areas • Expose via ACPI? • Map label areas into “secret” section of p2m • Implement AML which reads and writes this secret section • Expose via PV interface?
  • 46.