NANDFS: A RAM-Constrained Flash File system


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

NANDFS: A RAM-Constrained Flash File system

  1. 1. NANDFS: A Flexible Flash File System for RAM-Constrained Systems Aviad Zuck, Ohad Barzliay and Sivan Toledo
  2. 2. Overview <ul><li>Introduction + motivation </li></ul><ul><li>Flash properties </li></ul><ul><li>Big Ideas </li></ul><ul><li>Going into details </li></ul><ul><li>Software engineering, tests and experiments </li></ul><ul><li>General flash issues </li></ul>
  3. 3. Flash is Everywhere
  4. 4. <ul><li>Resilient to vibrations and extreme conditions </li></ul><ul><li>Faster up 100 times more (random access) than rotating disks </li></ul>
  5. 5. What’s missing?
  6. 6. <ul><li>Sequential access </li></ul><ul><li>And </li></ul><ul><ul><li>“ Today, consumer-grade SSD costs from $2 to $3.45 per gigabyte, hard drives about $0.38 per gigabyte…” </li></ul></ul><ul><ul><li>, 27.8.2008* </li></ul></ul><ul><li>* </li></ul>
  7. 7. NOR Flash NAND Flash Looser Constrained Mostly Reads Storage Few MB Many MB/GB
  8. 8. Two Ways of Flash Management NTFS FAT ext3 … JFFS YAFFS NANDFS …
  9. 9. So Why NANDFS?
  10. 10.
  11. 11. NANDFS Also Has: <ul><li>File locking </li></ul><ul><li>Transactions </li></ul><ul><li>Competitive performance and graceful degradation </li></ul>
  12. 12. How is it Done, in a Nutshell? <ul><li>Explanation does not fit in a nutshell </li></ul><ul><li>Complex data structures </li></ul><ul><li>New garbage collection mechanism </li></ul><ul><li>And much more… </li></ul><ul><li>Let’s elaborate </li></ul>
  13. 13. Flash Properties
  14. 14. <ul><li>Flash memory is divided to pages – 0.5KB, 2KB, 4KB </li></ul><ul><li>Page consists of Data and Metadata areas – 16B of metadata for every 512B of data </li></ul><ul><li>Pages arranged in units – 32/64/128 pages per unit </li></ul><ul><li>Metadata contains unit validity indicator, ECC code and file system metadata </li></ul>
  15. 15.
  16. 16. Erasures & Programming <ul><li>Page bits initialized to 1’s </li></ul><ul><li>Writing clears bits (1 to 0) </li></ul><ul><li>Bits set by erasing entire unit (“erase unit”). </li></ul><ul><li>Erase unit has limited endurance </li></ul>
  17. 17. The Design of NANDFS - The “Big” Ideas
  18. 18. Log-structured design <ul><li>Overwrite-in-place is not permitted in flash </li></ul><ul><li>Caching avoids rippling effect </li></ul>
  19. 19. Modular Flash File System <ul><li>Modularity is good. But… </li></ul><ul><li>We need a block device API designated for flash </li></ul><ul><li>We call our “block device” the sequencing layer </li></ul>Traditional Block Device NANDFS “Block Device” READ READ WRITE ALLOCATE-AND-WRITE (TRIM) TRIM
  20. 20. High-level Design <ul><li>A 2-layer structure: </li></ul><ul><ul><li>File System Layer - transactional file system with unix-like file structure </li></ul></ul><ul><ul><li>Sequencing Layer – manages the allocation of immutable page-sized chunks of data. Assists in crash recovery and atomicity </li></ul></ul>
  21. 21. The Sequencing Layer
  22. 22. <ul><li>Divides flash to fixed-size physical units called slots </li></ul><ul><li>Slots assigned to segments - logical units of the same size </li></ul><ul><li>Each segment maps to one physical matching slot, except one “ active segment” which is mapped to two slots. </li></ul>
  23. 23. Block access <ul><li>Segment ~> Slot mapping table in RAM </li></ul><ul><li>Block is referenced by a logical handle </li></ul><ul><ul><ul><li>< segment_id , offset_in_segment > </li></ul></ul></ul><ul><li>Address translation </li></ul><ul><ul><li>Example: Logical address <0,2> ~> Physical address 8 </li></ul></ul>
  24. 24. Where’s the innovation? <ul><li>Logical address mapping not a new idea: </li></ul><ul><ul><li>Logical Disk (1993), YAFFS, JFFS, And more </li></ul></ul><ul><li>Many FTL’s use some logical address mapping </li></ul><ul><ul><li>Full mapping ~> expensive </li></ul></ul><ul><ul><li>Coarse-grained mapping </li></ul></ul><ul><ul><ul><li>Fragmentation, performance degradation </li></ul></ul></ul><ul><ul><ul><li>Costly merges </li></ul></ul></ul>
  25. 25. * DFTL: A Flash Translation Layer Employing Demand-based Selective Caching of Page-level Address Mappings (2009)
  26. 26. <ul><li>The difference in NANDFS </li></ul><ul><ul><li>NANDFS uses coarse-grained mapping, not full mapping </li></ul></ul><ul><ul><li>Less RAM for page mapping (more RAM flexibility) </li></ul></ul><ul><ul><li>Collect garbage while preserving validity of pointers to non-obsolete blocks </li></ul></ul><ul><li>Appropriate for flash, not for magnetic disks </li></ul>
  27. 27. Block allocation <ul><li>NANDFS is log-structured </li></ul><ul><li>New blocks allocated sequentially from the active segment. </li></ul><ul><li>In a log-structured system blocks are never re-written </li></ul><ul><li>File pointer structures need to be updated to reflect the new location of the data. </li></ul>
  28. 28. Garbage collection <ul><li>TRIM - pages with obsolete data are marked with a special “obsolete flag” </li></ul><ul><li>sequencing layer manages counters of obsolete pages in every segment. </li></ul><ul><li>Problem - EUs contain a mixture of valid and obsolete data (pages), we can’t simply collect entire EUs </li></ul><ul><li>Solution : Garbage collection is performed together with allocation </li></ul>
  29. 29. <ul><li>Reclamation unit = Segment </li></ul><ul><ul><li>The sequencing layer chooses a segment to reclaim, and allocates it another (fresh) second slot. </li></ul></ul><ul><li>Reclaim obsolete pages while copying non-obsolete pages </li></ul><ul><li>NOTICE – Logical addresses are preserved, although physical translation changed </li></ul>
  30. 30. <ul><li>Finally when the new slot is full, the old slot is erased. </li></ul><ul><li>Can now be used to reclaim another segment </li></ul><ul><li>We choose the segment with the highest obsolete counter level as the new “active segment”. </li></ul><ul><li>This will not go down well in rotating disks – too many seek operations </li></ul>
  31. 31. Sequencing Layer Recovery <ul><li>When a new slot is allocated to a segment, a segment header is written in the slot’s first page </li></ul><ul><li>Header contains: </li></ul><ul><ul><li>Incremented segment sequencing number </li></ul></ul><ul><ul><li>Segment number </li></ul></ul><ul><ul><li>Segment type </li></ul></ul><ul><ul><li>Checkpoint (further details later) </li></ul></ul>
  32. 32. <ul><li>On mounting the header of every slot is read </li></ul><ul><li>The segment-to-slot map can be reconstructed using only the data from the headers </li></ul><ul><li>Other systems (with complete mapping) need to scan entire flash </li></ul>
  33. 33. Bad EU Management <ul><li>Each flash memory chip contains some bad EUs </li></ul><ul><li>Some slots contain more valid EUs than others </li></ul><ul><li>Solution – some slots are set aside as a bank of reserve EUs </li></ul>
  34. 34. Brief Summary
  35. 35. The Design of NANDFS - More Ideas
  36. 36. Wear Leveling <ul><li>Writes and erases should be spread evenly over all EUs </li></ul><ul><li>Problem : some slots may be reclaimed rarely </li></ul><ul><li>Solution: Perform periodic random wear leveling process </li></ul><ul><ul><li>Choose random slot and copy it to a fresh slot </li></ul></ul><ul><ul><li>Incurs only a low overhead </li></ul></ul><ul><ul><li>Guarantees near-optimal expected endurance </li></ul></ul><ul><ul><li>(Ben-Aroya and Toledo, 2006) </li></ul></ul><ul><li>Technique widely used (YAFFS, JFFS) </li></ul>
  37. 37. Transactions <ul><li>File system operations are atomic and transactional </li></ul><ul><li>Marking pages as obsolete is not straightforward </li></ul><ul><li>Simple transaction – block re-write </li></ul><ul><ul><li>After rewriting, old data block should be marked obsolete </li></ul></ul><ul><ul><li>If we mark it, and the transaction aborts before completing, old data should remain valid </li></ul></ul><ul><ul><li>If already marked as obsolete – cannot undo </li></ul></ul>
  38. 38. <ul><li>Solution : Perform valid-to-obsolete-transition (or VOT) AFTER the transaction commits. </li></ul><ul><li>Write VOT records to flash in dedicated pages </li></ul><ul><li>On commit use VOT records to mark pages as obsolete </li></ul><ul><li>Maintain linked list of all pages written in a specific transaction on flash </li></ul><ul><li>Keep in RAM a pointer to the last page written in a transaction </li></ul><ul><li>On abort mark all pages written by the transaction as obsolete </li></ul>
  39. 39.
  40. 40. Checkpoints <ul><li>Snapshot of system state </li></ul><ul><li>Ensures returning to stable state following a crash </li></ul><ul><li>Checkpoint is written: </li></ul><ul><ul><li>As part of a segment header. </li></ul></ul><ul><ul><li>Whenever a transaction commits. </li></ul></ul><ul><li>Structure: </li></ul><ul><ul><li>Obsolete counters array </li></ul></ul><ul><ul><li>Pointer to last-written block address of committed transaction </li></ul></ul><ul><ul><li>Pointers to the last-written blocks of all on-going transactions </li></ul></ul><ul><ul><li>Pointer to root inode </li></ul></ul>
  41. 41. Simple Example
  42. 42. Finding the Last Checkpoint <ul><li>In every given time there is only one valid checkpoint in flash </li></ul><ul><li>On mounting </li></ul><ul><ul><li>Locate last allocated slot (using its sequence #) </li></ul></ul><ul><ul><li>Perform binary search to see if another later checkpoint exists in the slot </li></ul></ul><ul><ul><li>Aborting all other transactions </li></ul></ul><ul><ul><li>Truncate all pages written after the checkpoint </li></ul></ul><ul><ul><li>Finishing the transaction that was committed </li></ul></ul>
  43. 43. File System Layer
  44. 44. <ul><li>Files represented by inode trees </li></ul><ul><ul><li>File metadata </li></ul></ul><ul><ul><li>Direct pointers to data pages </li></ul></ul><ul><ul><li>Indirect pointers etc. </li></ul></ul><ul><li>All pointers are logical pointers </li></ul><ul><li>Regular files not permitted to be sparse </li></ul>
  45. 45. <ul><li>Root file and directory inodes may be sparse. </li></ul><ul><li>Hole indicated by special flag </li></ul>
  46. 46. The Root File <ul><li>Array of inodes </li></ul>
  47. 47. <ul><li>When a file is deleted a page-size hole is created </li></ul><ul><li>When creating a file a hole can easily be located </li></ul><ul><li>If no hole exists, allocate a new inode by extending the root file </li></ul>
  48. 48. Directory Structure <ul><li>Directory = array of directory entries </li></ul><ul><ul><li>inode number </li></ul></ul><ul><ul><li>Length </li></ul></ul><ul><ul><li>UTF-8 file name. </li></ul></ul><ul><li>Direntry length <= 256 bytes. </li></ul><ul><li>Direntries packed into chunks without gaps </li></ul>
  49. 49. <ul><li>chunk size < (page - direntry size) ~> directory contains “hole” </li></ul><ul><li>Allocating new direntry requires finding a hole </li></ul><ul><li>Direntry Lookup is sequential </li></ul>
  50. 50. System Calls <ul><li>Most system calls ( creat , unlink , mkdir …) are atomic transactions </li></ul><ul><li>Transaction that handles a write() commits only when on close() </li></ul><ul><ul><li>System calls that modify a single file can be bundled into a single transaction </li></ul></ul><ul><ul><li>5 consecutive calls to write() + close() on a single file are treated as a single transaction </li></ul></ul><ul><li>Overhead of transaction commit ~ 1 </li></ul>Actual physical page writes Minimum possible page writes
  51. 51. Running Out of Space <ul><li>Log-structured file system writes even when user deletes files </li></ul><ul><li>When flash is full, the system may have too few free pages to delete a file </li></ul><ul><li>Solution – maintain number of free+obsolete pages. </li></ul><ul><li>If next write lowers this number below threshold - abort transactions until we have enough free pages </li></ul><ul><li>Threshold is : </li></ul><ul><ul><li>c = # of blocks written on direntry delete </li></ul></ul><ul><ul><li>= max file pages </li></ul></ul><ul><ul><li>= re-do records per page. </li></ul></ul>
  52. 52. Software Engineering
  53. 53. Coding <ul><li>Code written with intention to be “humanly readable” </li></ul><ul><ul><ul><li>(&(transactions[tid]))->f_type = 0x02 </li></ul></ul></ul><ul><ul><ul><li>vs. </li></ul></ul></ul><ul><ul><ul><li>TRANSACTION_SET_FTYPE(tid, FTYPE_FILE) </li></ul></ul></ul><ul><li>Embedded development </li></ul><ul><ul><li>External libraries not an option (math, string) </li></ul></ul><ul><ul><li>More macros, less functions (stack) </li></ul></ul><ul><ul><li>No debugging – need good simulator! </li></ul></ul><ul><ul><li>Various gcc compliances – cygwin, debian, arm-gcc </li></ul></ul>
  54. 54. Incremental development <ul><li>High level and Low level design preceded development </li></ul><ul><ul><li>3 weeks </li></ul></ul><ul><li>Code written bottom up </li></ul><ul><ul><li>Flash driver –> sequencing layer –> file system layer </li></ul></ul><ul><ul><li>Caching layer added later. Challenging… </li></ul></ul><ul><ul><li>1 year (~commercial code) </li></ul></ul><ul><li>Test driven development </li></ul><ul><ul><li>“ By hand” (no libraries) </li></ul></ul>
  55. 55. My own boss - lessons <ul><li>Time frames </li></ul><ul><li>Outsider notes </li></ul><ul><ul><li>Feedback </li></ul></ul><ul><ul><li>“ pairing” </li></ul></ul>
  56. 56. Experiments & Tests
  57. 57. Testing <ul><li>Extensive test-suite: </li></ul><ul><ul><li>Integration and performance tests </li></ul></ul><ul><ul><li>Extensive crash tests </li></ul></ul><ul><ul><li>Large set of unit tests for every function </li></ul></ul><ul><li>Integrated to eCos </li></ul><ul><li>Tests and integration verified on actual 32 MB flash </li></ul>
  58. 58. Experiments <ul><li>Simulated 1GB flash </li></ul><ul><li>Configuration - 512 slots, 8 reserved for bad-block replacement </li></ul><ul><li>6 open files and 8 file descriptors </li></ul><ul><li>3 concurrent transactions </li></ul>
  59. 59. Workload
  60. 60. Slot Partitioning
  61. 61. Mounting <ul><li>YAFFS mounting time - 2.7s </li></ul><ul><ul><li>80% utilization </li></ul></ul>
  62. 62. Endurance <ul><li>Repeatedly re-write a small file when the file system contains a static 205MB file. </li></ul>
  63. 63. (Some) Challenges in flash
  64. 64. Single vs. Multi level cell <ul><li>Flash classified by number of bits stored in a single cell </li></ul>SLC (1 bits) MLC (2-4 bit) <ul><ul><li>Smaller capacity </li></ul></ul><ul><ul><li>Cheaper </li></ul></ul><ul><ul><li>Errors from partial writes </li></ul></ul><ul><ul><li>Write-constrained </li></ul></ul><ul><ul><li>Faster </li></ul></ul><ul><ul><li>More error-prone </li></ul></ul><ul><ul><li>Less endurance </li></ul></ul>
  65. 65. Parallelism <ul><li>*Picture from N Agrawal, V Prabhakaran, T Wobber (2008) </li></ul>
  66. 66. <ul><li>Simple example for utilizing parallelism </li></ul><ul><li>* J Seol, H Shim, J Kim, and S Maeng (2009) </li></ul>
  67. 67. Enterprise storage <ul><li>* SW Lee, B Moon, C Park, JM Kim, SW Kim (2008) </li></ul><ul><li>Disk bandwidth (sequential) still 2-3 times higher than flash </li></ul><ul><li>Read/write latency flash smaller than disk by more than an order of magnitude </li></ul><ul><li>This improves throughput of transaction processing – useful for database servers </li></ul>
  68. 68. The End <ul><ul><li>Thank you! </li></ul></ul>