Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

XPDDS18: NVDIMM Overview - George Dunlap, Citrix

168 views

Published on

NVDIMM is a standard for allowing non-volatile memory to be exposed to as normal RAM, which can be directly mapped to guests. This simple concept has the potential to dramatically change the way software is written; but also has a number of surprising problems to solve. Furthermore, this area is plagued with incomplete specifications and confusing terminoligy.

This talk will attempt to give an overview of NVDIMMs from an operating system perspective: What the terminology means, how they are discovered and partitioned, issues relating to filesystems, a brief description of the functionality available in Linux, and so on. It will then describe the various issues and design choices a Xen system has to make in order to allow Xen systems to use NVDIMMs effectively.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

XPDDS18: NVDIMM Overview - George Dunlap, Citrix

  1. 1. NVDIMM Overview Technology, Linux, and Xen
  2. 2. What are NVDIMMs? • A standard for allowing NVRAM to be exposed as normal memory • Potential to dramatically change the way software is written
  3. 3. But.. • They have a number of surprising problems to solve • Particularly with respect to Xen • Incomplete specifications and confusing terminology
  4. 4. Goals • Give an overview of NVDIMM concepts, terminology, and architecture • Introduce some of the issues they bring up WRT operating systems • Introduce some of the issues faced wrt XEN • Propose a “sketch” of a design solution
  5. 5. Current Technology • Currently available: NVDIMM-N • DRAM with flash + capacitor backup • Strictly more expensive than a normal DIMM with the same storage capacity
  6. 6. Coming soon: Terabytes?
  7. 7. Page Tables SPA Terminology • Physical DIMM • DPA: DIMM Physical Address • SPA: System Physical Address DPA 0 DPA $N
  8. 8. SPA DPA 0 DPA $N • DRAM-like access • 1-1 Mapping between DPA and SPA • Interleaved across DIMMs • Similar to RAID-0 • Better performance • Worse reliability PMEM: RAM-like access
  9. 9. PBLK: Disk-like access • Disk-like access • Control region • 8k Data “window” • One window per NVDIMM device • Never interleaved • Useful for software RAID • Also useful when SPA space < NVRAM size • 48 address bits (256TiB) • 46 physical bits (64TiB) SPA DPA 0 DPA $N
  10. 10. How is this mapping set up? • Firmware sets up the mapping at boot • May be modifiable using BIOS / vendor-specific tool • Exposes information via ACPI
  11. 11. ACPI: Devices • NVDIMM Root Device • Expect one per system • NVDIMM Device • One per NVDIMM • Information about size, manufacture, &c
  12. 12. ACPI: NFIT Table • NVDIMM Firmware Interface Table • PMEM information: • SPA ranges • Interleave sets • PBLK information • Control regions • Data window regions
  13. 13. Practical issues for using NVDIMM • How to partition up the NVRAM and share it between operating systems • Knowing the correct way to access each area (PMEM / PBLK) • Detecting when interleaving / layout has changed
  14. 14. Dividing things up: Namespaces • “Namespace”: Think partition • PMEM namespace and interleave sets • PBLK namespaces • “Type UUID” to define how it’s used • think DOS partitions: “Linux ext2/3/4”, “Linux Swap”, “NTFS”, &c
  15. 15. How do we store namespace information? • Reading via PMEM and PBLK will give you different results • PMEM Interleave sets may change across reboots • So we can’t store it inside the visible NVRAM
  16. 16. Label area: Per-NVDIMM storage • One “label area” per NVDIMM device (aka physical DIMM) • Label describes a single contiguous DPA region • Namespaces made out of labels • Accessed via ACPI AML methods • Pure read / write NVRAM Label Areas
  17. 17. • Read ACPI NFIT to determine • How many NVDIMMs you have • Where PMEM is mapped • Read label area for each NVDIMM • Piece together the namespace described • Double-check interleave sets with the interleave sets (from NFIT table) • Access PMEM regions by offsets in SPA sets (from NFIT table) • Access PBLK regions by programming control / data windows (from NFIT table) SPA How an OS Determines Namespaces Label Areas
  18. 18. Key points • “Namespace”: Partition • “Label area”: Partition table • NVDIMM devices / SPA ranges / etc defined in ACPI static NFIT table • Label area accessed via ACPI AML methods
  19. 19. NVDIMMs in Linux
  20. 20. ndctl • Create / destroy namespaces • Four modes • raw • sector • fsdax • devdax
  21. 21. The ideal interface • Guest processes map a normal file, and magically get permanent storage
  22. 22. The obvious solution • Make a namespace into a block device • Put a normal partition inside the namespace • Put a normal filesystem inside a partition • Have mmap() map to the system memory directly
  23. 23. Issue: Sector write atomicity • Disk sector writes are atomic: all-or-nothing • memcpy() can be interrupted in the middle • Block Translation Table (BTT): an abstraction that guarantees write atomicity • ‘sector mode’ • But this means mmap() needs a separate buffer
  24. 24. Issue: Page struct • To keep track of userspace mappings, Linux needs a ‘page struct’ • 64 bytes per 4k page • 1 TiB of PMEM requires 7.85GiB of page array • Solution: Use PMEM to store a ‘page struct’ • Use a superblock to designate areas of the namespace to be used for this purpose (allocated on namespace creation)
  25. 25. Issue: Filesystems and block location • Filesystems want to be able to move blocks around • Difficult interactions between write() system call (DMA)
  26. 26. Issue: Interactions with the page cache
  27. 27. Mode summary: Raw • Block mode access to full SPA range • No support for namespaces • Therefore, no UUIDs / superblocks; page structs must be stored in main memory • Supports DAX
  28. 28. Mode summary: Sector • Block mode with BTT for sector atomicity • Supports namespaces • No DAX / direct mmap() support
  29. 29. Mode summary: fsdax • Block mode access to a namespace • Supports page structs in main memory, or in the namespace • Must be chosen at time of namespace creation • Supports filesystems with DAX • But there’s some question about the safety of this
  30. 30. Mode summary: devdax • Character device access to namespace • Does not support filesystems • page structs must be contained within the namespace • Supports mmap() • “No interaction with kernel page cache” • Character devices don’t have a “size”, so you have to remember
  31. 31. Summary • Four ways of mapping with different advantages and disadvantages • Seems clear that Linux is still figuring out how best use PMEM
  32. 32. NVDIMM, Xen, and dom0
  33. 33. Issue: Xen and AML • Reading the label areas can only be done via AML • Xen cannot do AML • ACPI spec requires only a single entity to do AML • That must be domain 0
  34. 34. Issue: struct page • In order to track mapping to guests, the hypervisor needs a struct page • “frametable” • 32 or 40 bits
  35. 35. Issue: RAM vs MMIO • Dom0 is free to map any SPA • RAM: Page reference counts taken • MMIO: No reference counts taken • If we “promote” NVDIMM SPA ranges from MMIO to RAM, existing dom0 mappings won’t have a reference count
  36. 36. PMEM promotion, continued • Three basic options: • 1a: Trust that dom0 has unmapped before promotion • 1b: Check to make sure that dom0 unmapped before promotion • 2: Automatically take appropriate reference counts on promotion • Checking / converting seem about equivalent: • A: Keep track of all unpromoted PMEM regions in dom0 pagetables • B: Brute force search of dom0 pagetables (PV) or p2m table (PVH)
  37. 37. Solution: Where to get PMEM for frame table • “Manually” set-aside namespaces • Namespace with custom Xen UUID • Allocated from within a namespace (like Linux)
  38. 38. Solution: Promoting NVRAM • Hypercall to allow dom0 “promote” specific PMEM regions (probably full namespaces) • Allow dom0 to specify two types of “scratch” PMEM for frame tables • PMEM tied to specific other PMEM regions • PMEM generally available for any PMEM region
  39. 39. Proposal: Promoting PMEM mappings • Keep a list of all unpromoted PMEM mappings, automatically increase reference counts on promotion • Alternately: • For PV: Require unmapping without checking • For PVH: Require dom0 to map PMEM into a single contiguous p2m range
  40. 40. NVDIMM, Xen, and domUs (Virtual NVDIMMs)
  41. 41. Issue: Guest compatibility • PVH is our “next generation” guest type • Requiring QEMU for is a last resort
  42. 42. PMEM only, no interleaving • Don’t bother with virtual PBLK • Each vNVDIMM will be exposed as a single non-interleaved chunk with its own label area
  43. 43. Issue: Guest label area • Each NVDIMM needs two distinct chunks • To begin with, specify two files / devices • Eventually: Both contained in a single file (w/ metadata?)
  44. 44. Issue: How to read label areas • Expose via ACPI? • Map label areas into “secret” section of p2m • Implement AML which reads and writes this secret section • Expose via PV interface?
  45. 45. Questions?

×