NANDFS: A RAM-Constrained Flash File system

  • 496 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
496
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
9
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. NANDFS: A Flexible Flash File System for RAM-Constrained Systems Aviad Zuck, Ohad Barzliay and Sivan Toledo
  • 2. Overview
    • Introduction + motivation
    • Flash properties
    • Big Ideas
    • Going into details
    • Software engineering, tests and experiments
    • General flash issues
  • 3. Flash is Everywhere
  • 4.
    • Resilient to vibrations and extreme conditions
    • Faster up 100 times more (random access) than rotating disks
  • 5. What’s missing?
  • 6.
    • Sequential access
    • And
      • “ Today, consumer-grade SSD costs from $2 to $3.45 per gigabyte, hard drives about $0.38 per gigabyte…”
      • Computerworld.com, 27.8.2008*
    • *http://www.computerworld.com/s/article/print/9112065/Solid_state_disk_lackluster_for_laptops_PCs
  • 7. NOR Flash NAND Flash Looser Constrained Mostly Reads Storage Few MB Many MB/GB
  • 8. Two Ways of Flash Management NTFS FAT ext3 … JFFS YAFFS NANDFS …
  • 9. So Why NANDFS?
  • 10.
  • 11. NANDFS Also Has:
    • File locking
    • Transactions
    • Competitive performance and graceful degradation
  • 12. How is it Done, in a Nutshell?
    • Explanation does not fit in a nutshell
    • Complex data structures
    • New garbage collection mechanism
    • And much more…
    • Let’s elaborate
  • 13. Flash Properties
  • 14.
    • Flash memory is divided to pages – 0.5KB, 2KB, 4KB
    • Page consists of Data and Metadata areas – 16B of metadata for every 512B of data
    • Pages arranged in units – 32/64/128 pages per unit
    • Metadata contains unit validity indicator, ECC code and file system metadata
  • 15.
  • 16. Erasures & Programming
    • Page bits initialized to 1’s
    • Writing clears bits (1 to 0)
    • Bits set by erasing entire unit (“erase unit”).
    • Erase unit has limited endurance
  • 17. The Design of NANDFS - The “Big” Ideas
  • 18. Log-structured design
    • Overwrite-in-place is not permitted in flash
    • Caching avoids rippling effect
  • 19. Modular Flash File System
    • Modularity is good. But…
    • We need a block device API designated for flash
    • We call our “block device” the sequencing layer
    Traditional Block Device NANDFS “Block Device” READ READ WRITE ALLOCATE-AND-WRITE (TRIM) TRIM
  • 20. High-level Design
    • A 2-layer structure:
      • File System Layer - transactional file system with unix-like file structure
      • Sequencing Layer – manages the allocation of immutable page-sized chunks of data. Assists in crash recovery and atomicity
  • 21. The Sequencing Layer
  • 22.
    • Divides flash to fixed-size physical units called slots
    • Slots assigned to segments - logical units of the same size
    • Each segment maps to one physical matching slot, except one “ active segment” which is mapped to two slots.
  • 23. Block access
    • Segment ~> Slot mapping table in RAM
    • Block is referenced by a logical handle
        • < segment_id , offset_in_segment >
    • Address translation
      • Example: Logical address <0,2> ~> Physical address 8
  • 24. Where’s the innovation?
    • Logical address mapping not a new idea:
      • Logical Disk (1993), YAFFS, JFFS, And more
    • Many FTL’s use some logical address mapping
      • Full mapping ~> expensive
      • Coarse-grained mapping
        • Fragmentation, performance degradation
        • Costly merges
  • 25. * DFTL: A Flash Translation Layer Employing Demand-based Selective Caching of Page-level Address Mappings (2009)
  • 26.
    • The difference in NANDFS
      • NANDFS uses coarse-grained mapping, not full mapping
      • Less RAM for page mapping (more RAM flexibility)
      • Collect garbage while preserving validity of pointers to non-obsolete blocks
    • Appropriate for flash, not for magnetic disks
  • 27. Block allocation
    • NANDFS is log-structured
    • New blocks allocated sequentially from the active segment.
    • In a log-structured system blocks are never re-written
    • File pointer structures need to be updated to reflect the new location of the data.
  • 28. Garbage collection
    • TRIM - pages with obsolete data are marked with a special “obsolete flag”
    • sequencing layer manages counters of obsolete pages in every segment.
    • Problem - EUs contain a mixture of valid and obsolete data (pages), we can’t simply collect entire EUs
    • Solution : Garbage collection is performed together with allocation
  • 29.
    • Reclamation unit = Segment
      • The sequencing layer chooses a segment to reclaim, and allocates it another (fresh) second slot.
    • Reclaim obsolete pages while copying non-obsolete pages
    • NOTICE – Logical addresses are preserved, although physical translation changed
  • 30.
    • Finally when the new slot is full, the old slot is erased.
    • Can now be used to reclaim another segment
    • We choose the segment with the highest obsolete counter level as the new “active segment”.
    • This will not go down well in rotating disks – too many seek operations
  • 31. Sequencing Layer Recovery
    • When a new slot is allocated to a segment, a segment header is written in the slot’s first page
    • Header contains:
      • Incremented segment sequencing number
      • Segment number
      • Segment type
      • Checkpoint (further details later)
  • 32.
    • On mounting the header of every slot is read
    • The segment-to-slot map can be reconstructed using only the data from the headers
    • Other systems (with complete mapping) need to scan entire flash
  • 33. Bad EU Management
    • Each flash memory chip contains some bad EUs
    • Some slots contain more valid EUs than others
    • Solution – some slots are set aside as a bank of reserve EUs
  • 34. Brief Summary
  • 35. The Design of NANDFS - More Ideas
  • 36. Wear Leveling
    • Writes and erases should be spread evenly over all EUs
    • Problem : some slots may be reclaimed rarely
    • Solution: Perform periodic random wear leveling process
      • Choose random slot and copy it to a fresh slot
      • Incurs only a low overhead
      • Guarantees near-optimal expected endurance
      • (Ben-Aroya and Toledo, 2006)
    • Technique widely used (YAFFS, JFFS)
  • 37. Transactions
    • File system operations are atomic and transactional
    • Marking pages as obsolete is not straightforward
    • Simple transaction – block re-write
      • After rewriting, old data block should be marked obsolete
      • If we mark it, and the transaction aborts before completing, old data should remain valid
      • If already marked as obsolete – cannot undo
  • 38.
    • Solution : Perform valid-to-obsolete-transition (or VOT) AFTER the transaction commits.
    • Write VOT records to flash in dedicated pages
    • On commit use VOT records to mark pages as obsolete
    • Maintain linked list of all pages written in a specific transaction on flash
    • Keep in RAM a pointer to the last page written in a transaction
    • On abort mark all pages written by the transaction as obsolete
  • 39.
  • 40. Checkpoints
    • Snapshot of system state
    • Ensures returning to stable state following a crash
    • Checkpoint is written:
      • As part of a segment header.
      • Whenever a transaction commits.
    • Structure:
      • Obsolete counters array
      • Pointer to last-written block address of committed transaction
      • Pointers to the last-written blocks of all on-going transactions
      • Pointer to root inode
  • 41. Simple Example
  • 42. Finding the Last Checkpoint
    • In every given time there is only one valid checkpoint in flash
    • On mounting
      • Locate last allocated slot (using its sequence #)
      • Perform binary search to see if another later checkpoint exists in the slot
      • Aborting all other transactions
      • Truncate all pages written after the checkpoint
      • Finishing the transaction that was committed
  • 43. File System Layer
  • 44.
    • Files represented by inode trees
      • File metadata
      • Direct pointers to data pages
      • Indirect pointers etc.
    • All pointers are logical pointers
    • Regular files not permitted to be sparse
  • 45.
    • Root file and directory inodes may be sparse.
    • Hole indicated by special flag
  • 46. The Root File
    • Array of inodes
  • 47.
    • When a file is deleted a page-size hole is created
    • When creating a file a hole can easily be located
    • If no hole exists, allocate a new inode by extending the root file
  • 48. Directory Structure
    • Directory = array of directory entries
      • inode number
      • Length
      • UTF-8 file name.
    • Direntry length <= 256 bytes.
    • Direntries packed into chunks without gaps
  • 49.
    • chunk size < (page - direntry size) ~> directory contains “hole”
    • Allocating new direntry requires finding a hole
    • Direntry Lookup is sequential
  • 50. System Calls
    • Most system calls ( creat , unlink , mkdir …) are atomic transactions
    • Transaction that handles a write() commits only when on close()
      • System calls that modify a single file can be bundled into a single transaction
      • 5 consecutive calls to write() + close() on a single file are treated as a single transaction
    • Overhead of transaction commit ~ 1
    Actual physical page writes Minimum possible page writes
  • 51. Running Out of Space
    • Log-structured file system writes even when user deletes files
    • When flash is full, the system may have too few free pages to delete a file
    • Solution – maintain number of free+obsolete pages.
    • If next write lowers this number below threshold - abort transactions until we have enough free pages
    • Threshold is :
      • c = # of blocks written on direntry delete
      • = max file pages
      • = re-do records per page.
  • 52. Software Engineering
  • 53. Coding
    • Code written with intention to be “humanly readable”
        • (&(transactions[tid]))->f_type = 0x02
        • vs.
        • TRANSACTION_SET_FTYPE(tid, FTYPE_FILE)
    • Embedded development
      • External libraries not an option (math, string)
      • More macros, less functions (stack)
      • No debugging – need good simulator!
      • Various gcc compliances – cygwin, debian, arm-gcc
  • 54. Incremental development
    • High level and Low level design preceded development
      • 3 weeks
    • Code written bottom up
      • Flash driver –> sequencing layer –> file system layer
      • Caching layer added later. Challenging…
      • 1 year (~commercial code)
    • Test driven development
      • “ By hand” (no libraries)
  • 55. My own boss - lessons
    • Time frames
    • Outsider notes
      • Feedback
      • “ pairing”
  • 56. Experiments & Tests
  • 57. Testing
    • Extensive test-suite:
      • Integration and performance tests
      • Extensive crash tests
      • Large set of unit tests for every function
    • Integrated to eCos
    • Tests and integration verified on actual 32 MB flash
  • 58. Experiments
    • Simulated 1GB flash
    • Configuration - 512 slots, 8 reserved for bad-block replacement
    • 6 open files and 8 file descriptors
    • 3 concurrent transactions
  • 59. Workload
  • 60. Slot Partitioning
  • 61. Mounting
    • YAFFS mounting time - 2.7s
      • 80% utilization
  • 62. Endurance
    • Repeatedly re-write a small file when the file system contains a static 205MB file.
  • 63. (Some) Challenges in flash
  • 64. Single vs. Multi level cell
    • Flash classified by number of bits stored in a single cell
    SLC (1 bits) MLC (2-4 bit)
      • Smaller capacity
      • Cheaper
      • Errors from partial writes
      • Write-constrained
      • Faster
      • More error-prone
      • Less endurance
  • 65. Parallelism
    • *Picture from N Agrawal, V Prabhakaran, T Wobber (2008)
  • 66.
    • Simple example for utilizing parallelism
    • * J Seol, H Shim, J Kim, and S Maeng (2009)
  • 67. Enterprise storage
    • * SW Lee, B Moon, C Park, JM Kim, SW Kim (2008)
    • Disk bandwidth (sequential) still 2-3 times higher than flash
    • Read/write latency flash smaller than disk by more than an order of magnitude
    • This improves throughput of transaction processing – useful for database servers
  • 68. The End
      • Thank you!