• Like
  • Save
NANDFS: A RAM-Constrained Flash File system
Upcoming SlideShare
Loading in...5

NANDFS: A RAM-Constrained Flash File system






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    NANDFS: A RAM-Constrained Flash File system NANDFS: A RAM-Constrained Flash File system Presentation Transcript

    • NANDFS: A Flexible Flash File System for RAM-Constrained Systems Aviad Zuck, Ohad Barzliay and Sivan Toledo
    • Overview
      • Introduction + motivation
      • Flash properties
      • Big Ideas
      • Going into details
      • Software engineering, tests and experiments
      • General flash issues
    • Flash is Everywhere
      • Resilient to vibrations and extreme conditions
      • Faster up 100 times more (random access) than rotating disks
    • What’s missing?
      • Sequential access
      • And
        • “ Today, consumer-grade SSD costs from $2 to $3.45 per gigabyte, hard drives about $0.38 per gigabyte…”
        • Computerworld.com, 27.8.2008*
      • *http://www.computerworld.com/s/article/print/9112065/Solid_state_disk_lackluster_for_laptops_PCs
    • NOR Flash NAND Flash Looser Constrained Mostly Reads Storage Few MB Many MB/GB
    • Two Ways of Flash Management NTFS FAT ext3 … JFFS YAFFS NANDFS …
    • So Why NANDFS?
    • NANDFS Also Has:
      • File locking
      • Transactions
      • Competitive performance and graceful degradation
    • How is it Done, in a Nutshell?
      • Explanation does not fit in a nutshell
      • Complex data structures
      • New garbage collection mechanism
      • And much more…
      • Let’s elaborate
    • Flash Properties
      • Flash memory is divided to pages – 0.5KB, 2KB, 4KB
      • Page consists of Data and Metadata areas – 16B of metadata for every 512B of data
      • Pages arranged in units – 32/64/128 pages per unit
      • Metadata contains unit validity indicator, ECC code and file system metadata
    • Erasures & Programming
      • Page bits initialized to 1’s
      • Writing clears bits (1 to 0)
      • Bits set by erasing entire unit (“erase unit”).
      • Erase unit has limited endurance
    • The Design of NANDFS - The “Big” Ideas
    • Log-structured design
      • Overwrite-in-place is not permitted in flash
      • Caching avoids rippling effect
    • Modular Flash File System
      • Modularity is good. But…
      • We need a block device API designated for flash
      • We call our “block device” the sequencing layer
      Traditional Block Device NANDFS “Block Device” READ READ WRITE ALLOCATE-AND-WRITE (TRIM) TRIM
    • High-level Design
      • A 2-layer structure:
        • File System Layer - transactional file system with unix-like file structure
        • Sequencing Layer – manages the allocation of immutable page-sized chunks of data. Assists in crash recovery and atomicity
    • The Sequencing Layer
      • Divides flash to fixed-size physical units called slots
      • Slots assigned to segments - logical units of the same size
      • Each segment maps to one physical matching slot, except one “ active segment” which is mapped to two slots.
    • Block access
      • Segment ~> Slot mapping table in RAM
      • Block is referenced by a logical handle
          • < segment_id , offset_in_segment >
      • Address translation
        • Example: Logical address <0,2> ~> Physical address 8
    • Where’s the innovation?
      • Logical address mapping not a new idea:
        • Logical Disk (1993), YAFFS, JFFS, And more
      • Many FTL’s use some logical address mapping
        • Full mapping ~> expensive
        • Coarse-grained mapping
          • Fragmentation, performance degradation
          • Costly merges
    • * DFTL: A Flash Translation Layer Employing Demand-based Selective Caching of Page-level Address Mappings (2009)
      • The difference in NANDFS
        • NANDFS uses coarse-grained mapping, not full mapping
        • Less RAM for page mapping (more RAM flexibility)
        • Collect garbage while preserving validity of pointers to non-obsolete blocks
      • Appropriate for flash, not for magnetic disks
    • Block allocation
      • NANDFS is log-structured
      • New blocks allocated sequentially from the active segment.
      • In a log-structured system blocks are never re-written
      • File pointer structures need to be updated to reflect the new location of the data.
    • Garbage collection
      • TRIM - pages with obsolete data are marked with a special “obsolete flag”
      • sequencing layer manages counters of obsolete pages in every segment.
      • Problem - EUs contain a mixture of valid and obsolete data (pages), we can’t simply collect entire EUs
      • Solution : Garbage collection is performed together with allocation
      • Reclamation unit = Segment
        • The sequencing layer chooses a segment to reclaim, and allocates it another (fresh) second slot.
      • Reclaim obsolete pages while copying non-obsolete pages
      • NOTICE – Logical addresses are preserved, although physical translation changed
      • Finally when the new slot is full, the old slot is erased.
      • Can now be used to reclaim another segment
      • We choose the segment with the highest obsolete counter level as the new “active segment”.
      • This will not go down well in rotating disks – too many seek operations
    • Sequencing Layer Recovery
      • When a new slot is allocated to a segment, a segment header is written in the slot’s first page
      • Header contains:
        • Incremented segment sequencing number
        • Segment number
        • Segment type
        • Checkpoint (further details later)
      • On mounting the header of every slot is read
      • The segment-to-slot map can be reconstructed using only the data from the headers
      • Other systems (with complete mapping) need to scan entire flash
    • Bad EU Management
      • Each flash memory chip contains some bad EUs
      • Some slots contain more valid EUs than others
      • Solution – some slots are set aside as a bank of reserve EUs
    • Brief Summary
    • The Design of NANDFS - More Ideas
    • Wear Leveling
      • Writes and erases should be spread evenly over all EUs
      • Problem : some slots may be reclaimed rarely
      • Solution: Perform periodic random wear leveling process
        • Choose random slot and copy it to a fresh slot
        • Incurs only a low overhead
        • Guarantees near-optimal expected endurance
        • (Ben-Aroya and Toledo, 2006)
      • Technique widely used (YAFFS, JFFS)
    • Transactions
      • File system operations are atomic and transactional
      • Marking pages as obsolete is not straightforward
      • Simple transaction – block re-write
        • After rewriting, old data block should be marked obsolete
        • If we mark it, and the transaction aborts before completing, old data should remain valid
        • If already marked as obsolete – cannot undo
      • Solution : Perform valid-to-obsolete-transition (or VOT) AFTER the transaction commits.
      • Write VOT records to flash in dedicated pages
      • On commit use VOT records to mark pages as obsolete
      • Maintain linked list of all pages written in a specific transaction on flash
      • Keep in RAM a pointer to the last page written in a transaction
      • On abort mark all pages written by the transaction as obsolete
    • Checkpoints
      • Snapshot of system state
      • Ensures returning to stable state following a crash
      • Checkpoint is written:
        • As part of a segment header.
        • Whenever a transaction commits.
      • Structure:
        • Obsolete counters array
        • Pointer to last-written block address of committed transaction
        • Pointers to the last-written blocks of all on-going transactions
        • Pointer to root inode
    • Simple Example
    • Finding the Last Checkpoint
      • In every given time there is only one valid checkpoint in flash
      • On mounting
        • Locate last allocated slot (using its sequence #)
        • Perform binary search to see if another later checkpoint exists in the slot
        • Aborting all other transactions
        • Truncate all pages written after the checkpoint
        • Finishing the transaction that was committed
    • File System Layer
      • Files represented by inode trees
        • File metadata
        • Direct pointers to data pages
        • Indirect pointers etc.
      • All pointers are logical pointers
      • Regular files not permitted to be sparse
      • Root file and directory inodes may be sparse.
      • Hole indicated by special flag
    • The Root File
      • Array of inodes
      • When a file is deleted a page-size hole is created
      • When creating a file a hole can easily be located
      • If no hole exists, allocate a new inode by extending the root file
    • Directory Structure
      • Directory = array of directory entries
        • inode number
        • Length
        • UTF-8 file name.
      • Direntry length <= 256 bytes.
      • Direntries packed into chunks without gaps
      • chunk size < (page - direntry size) ~> directory contains “hole”
      • Allocating new direntry requires finding a hole
      • Direntry Lookup is sequential
    • System Calls
      • Most system calls ( creat , unlink , mkdir …) are atomic transactions
      • Transaction that handles a write() commits only when on close()
        • System calls that modify a single file can be bundled into a single transaction
        • 5 consecutive calls to write() + close() on a single file are treated as a single transaction
      • Overhead of transaction commit ~ 1
      Actual physical page writes Minimum possible page writes
    • Running Out of Space
      • Log-structured file system writes even when user deletes files
      • When flash is full, the system may have too few free pages to delete a file
      • Solution – maintain number of free+obsolete pages.
      • If next write lowers this number below threshold - abort transactions until we have enough free pages
      • Threshold is :
        • c = # of blocks written on direntry delete
        • = max file pages
        • = re-do records per page.
    • Software Engineering
    • Coding
      • Code written with intention to be “humanly readable”
          • (&(transactions[tid]))->f_type = 0x02
          • vs.
      • Embedded development
        • External libraries not an option (math, string)
        • More macros, less functions (stack)
        • No debugging – need good simulator!
        • Various gcc compliances – cygwin, debian, arm-gcc
    • Incremental development
      • High level and Low level design preceded development
        • 3 weeks
      • Code written bottom up
        • Flash driver –> sequencing layer –> file system layer
        • Caching layer added later. Challenging…
        • 1 year (~commercial code)
      • Test driven development
        • “ By hand” (no libraries)
    • My own boss - lessons
      • Time frames
      • Outsider notes
        • Feedback
        • “ pairing”
    • Experiments & Tests
    • Testing
      • Extensive test-suite:
        • Integration and performance tests
        • Extensive crash tests
        • Large set of unit tests for every function
      • Integrated to eCos
      • Tests and integration verified on actual 32 MB flash
    • Experiments
      • Simulated 1GB flash
      • Configuration - 512 slots, 8 reserved for bad-block replacement
      • 6 open files and 8 file descriptors
      • 3 concurrent transactions
    • Workload
    • Slot Partitioning
    • Mounting
      • YAFFS mounting time - 2.7s
        • 80% utilization
    • Endurance
      • Repeatedly re-write a small file when the file system contains a static 205MB file.
    • (Some) Challenges in flash
    • Single vs. Multi level cell
      • Flash classified by number of bits stored in a single cell
      SLC (1 bits) MLC (2-4 bit)
        • Smaller capacity
        • Cheaper
        • Errors from partial writes
        • Write-constrained
        • Faster
        • More error-prone
        • Less endurance
    • Parallelism
      • *Picture from N Agrawal, V Prabhakaran, T Wobber (2008)
      • Simple example for utilizing parallelism
      • * J Seol, H Shim, J Kim, and S Maeng (2009)
    • Enterprise storage
      • * SW Lee, B Moon, C Park, JM Kim, SW Kim (2008)
      • Disk bandwidth (sequential) still 2-3 times higher than flash
      • Read/write latency flash smaller than disk by more than an order of magnitude
      • This improves throughput of transaction processing – useful for database servers
    • The End
        • Thank you!