Flash File Systems                                            Flash differentiates between write and erase. An erased
It can be seen that traditional file systems that were de-       direct-access, the density of NOR memory is low com-
3 Block mapping techniques                                            1. Frequently updated blocks are written to differen...
3.2 Block mapping data structures                                      2. The valid sectors in each erase block are copied...
Figure 1: Block mapping in a flash device. The gray array on the right is the direct map, which resides in RAM. Each
4.2 Background: Journaling and Log struc- flushed to the log. When a crash occurs the table needs to
    tured file systems ...
Figure 2: An inode in the top of the figure can point directly to data at level 0, or via indirect blocks which support
Figure 4: Updating in FFS2. The data structure is modified to accommodate an update of 5 of the 20 bytes. The data
and next...
4.4.2 JFFS2
David Woodhouse of Red Hat enhanced JFFS1 into
JFFS2 [Woo01]. Compression using zlib, rubin or rtime
is availa...
of the physical nodes which belong to that inode. The
second structure represents each valid node on the flash,
containing ...
The superblock contains the global information such as           before applying them to the flash device. This not only
Figure 7: LogFS. Combination of Inode file and normal file tree structure. Directory entries are inodes with no pointer
to d...
• MTD subsystem, providing a uniform interface to ac-       on each known LEB in the main area. The main area
    cess raw...
software techniques for these newer flash devices.               file system. This approachs seems very promising as
alternatives for mobile computers. In In Pro-     [Woo01]   Jffs: The journaling flash file system.
          ceedings of th...
Upcoming SlideShare
Loading in …5

Flash File Systems


Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Flash File Systems

  1. 1. Flash File Systems Flash differentiates between write and erase. An erased flash is completely filled with ones, while a flash write Hackers Hut / Linux Kernel will flip bits from 1 to 0. The only way to flip bits back from 0 to 1 is to erase. Guido R. Kok A big difference between flash memory and mag- netic hard disks is that the erase operations on flash only Faculty of Electrical Engineering, operate on large blocks. Erases on flash happen in coarse Mathematics and Computer Science granularities of powers of two, ranging from 32k to 256k blocks. Writes can occur in much smaller granularities, University of Twente such as individual bits for NOR flash, to 256, 512 or 2048 bytes in case of NAND flash. More details on NOR and NAND flash will be given in Section 2. August 2008 Hardware manufacturers try to use flash memory as magnetic hard disk replacements. Before flash memory Abstract can threaten the dominating market position of magnetic Using flash memory in mass storage devices is an up- hard disks as means of mass storage media, there are coming trend. With high performance, low energy con- several limitations on flash that need to be coped with by sumption and shock proof properties, flash memory is an the filesystems running on them. appealing alternative to the large magnetic disk drives. However, flash memory has other properties that warrant 1.1 Flash limitations special treatment in comparison to magnetic disk drives. Instead of the disadvantages of slow seek times for mag- 1. The lifetime of flash memory is finite. It is measured netic hard drives, flash memory has other disadvantages in the amount of write-erase cycles on an erase block that need to be coped with, such as slow erasure times, big before the block begins to fail. Most hardware man- erasure blocks and blocks wearing out. Several flash file ufacturers guarantee 100.000 cycles on each erase systems have been developed to deal with these shortcom- block in their chips. Magnetic hard disks do not have ings. An overview and comparison of flash file systems is such a limitation. presented. 2. Flash requires out of place updates of data, mean- ing that a new version of data has to be written to a 1 Introduction new block instead of overwriting the old data. Before being able to write to a specific location, the target Flash memory is growing to be one of the main means erase block must be erased. If an unclean unmount of mass data storage. Compared to the traditional and occurs at this time, data loss will occur, as both the very common magnetic hard disks, flash memory offers old and new data cannot be retrieved. faster access times and better kinetic shock resistance. These two characteristics explain the popularity of flash 3. Erase blocks are far larger than magnetic hard disk memory in portable devices, such as PDAs, laptops, sectors or blocks. One erase block on flash is there- mobile phones, digital cameras and digital audio players. fore shared by multiple filesystem blocks. Because of the out of place updates, erase blocks get par- Two types of flash will be considered in this paper; tially obsoleted. Once free space runs low, a tech- NOR (Negative OR) and NAND (Negative AND) flash. nique called Garbage Collection starts to collect These two types of memory differ in details, but they valid filesystem blocks to free space. For more de- share the same principles as any flash memory does. tails, see Section 3.3 1
  2. 2. It can be seen that traditional file systems that were de- direct-access, the density of NOR memory is low com- signed for magnetic hard disk usage, are not suitable to pared to NAND flash. In 2001, a 16MB NOR array would cope with the flash memory limitations mentioned above. be considered large, while a 128MB NAND was already This paper will will give an overview of techniques that available as a single chip. cope with the above problems and flash file systems that implement these techniques in their own way. 2.2 NAND Flash 1.2 Paper layout In section 2 a more detailed explanation on NOR and the All operations on NAND flash (read, write, unlock, erase) newer NAND flash will be given, pointing out their dif- operate on a block-by-block basis. For NAND flash there ferences and limitations. Section 3 a presentation is given are two kinds of blocks; write blocks, also known as on the general block mapping techniques and garbage col- pages, are typically 512 to 4096 bytes in size. Associated lection in flash filesystems. With the general approach on with each page are some Out-Of-Band (OOB) bytes, garbage collection and mapping techniques, section 4 ad- which are used to store EEC codes and other header vances on this topic by giving a survey on several flash information. The other type of NAND block is the erase file systems, including FFS, JFFS, YAFFS, LogFS and block. UBIFS. Section 5 compares the flash filesystems and con- Typical erase block sizes can vary between 32 pages of cludes the paper. 512 bytes each for an erase block size of 16 kB, up to 128 pages of 4,096 bytes each for a total erase block size of 512 kB. Programming and reading can be performed 2 Background on a page basis, while erasing can only be performed on erase blocks. Pages are buffered before a write operation NOR and NAND flash are non-volatile memory that is executed, because each erase block can be written only can be electrically erased and reprogrammed. They up to four times before information starts to leak and the are successors of EEPROM (Electrically Erasable Pro- block must be erased. grammable Read-Only Memory), which was considered too slow to write to. NOR flash was the first to appear in NAND memory has more capacity and lower costs 1984, while the first NAND principles were presented in due to two reasons: 1989. • The external buses in NOR flash have been removed, placing the memory cells in series rather than paral- 2.1 NOR Flash lel to the bit lanes. This saves space, enhancing the Read operations on NOR flash is the same as reading from density of memory cells, but at the cost of no more RAM, as NOR flash has external address buses that al- direct access. low to read NOR memory bit-by-bit. This direct access read ability allows the NOR flash to be used as eXecute- In-Place (XIP) memory, meaning that programs stored in • The memory cells inside NAND are not guaranteed NOR flash can be executed directly without the need to error-free when shipped. Bad block management copy them into RAM first. XIP reduces the need of RAM, needs to be implemented in the file systems in or- but as a disadvantage compression cannot be used. der to handle bad blocks. Because the manufac- Unlocking, erasing and writing NOR memory operates on turers dropped the requirement that each cell must a block-by-block basis, which are typically 64, 128 or 256 be error-free (except the first physical block), far bytes big. higher yields can be made, dropping the manufac- Due to the integration of external address buses that allow turing costs. 2
  3. 3. 3 Block mapping techniques 1. Frequently updated blocks are written to different sectors each time it is modified, evening out the wear The earliest approach to use flash memory is to treat it of different erase blocks the same way older filesystems do, like FAT. FAT treats flash memory as a block device that allows data blocks to 2. An updated block is written to a new location with- be read and written. This is the approach typically used out the need to erase and rewrite an entire erase block on magnetic hard disks. However, this linear mapping of blocks on flash addresses give cause to several problems. 3. When power is lost during a write operation, the dy- First of all, by rewriting new data to the old location, fre- namic mapping makes sure that this is an atomic op- quently used data translates to a block that is used fre- eration, meaning that no data is lost quently. This is no problem on magnetic hard disks, but as mentioned in section 1.1, flash memory blocks wear The atomicity mentioned in item 3 is achieved by in- out, causing failure of the memory block if it has been formation stored in the header associated with each data written to too often. sector. When blocks needs to be written to flash, the soft- Secondly, data blocks can be a lot smaller than the phys- ware searches for a free and erased sector. In this sector ical erase blocks. Writing a small data block to the old and its associated header, all bits are set to 1. To achieve location means that a big erase block has to be read into atomicity, three bits in the header are used. Before the RAM, the appropriate data block is overwritten, the erase block is written, the used bit is cleared (made 0), to in- block in the flash is erased and the whole erase block in dicate that the sector is no longer free. Then the virtual the RAM is written back into flash. Clearly this approach block number is written to the header and the new data is is very inefficient. written to the sector. The used bit can be used in conjunc- Finally, if an unclean unmount, such as a power outage, tion with the virtual block number, under the requirement occurs during the above operation data is not only lost in that an virtual block number consisting of all ones, is not the form of the small data block, but also the total con- a valid block number. tents of the erase block can be lost. Once the data has been written, the so called valid bit is The above problems are solved with more sophisticated cleared to indicate that the data in the sector is ready to be block-to-flash mappings and wear-leveling. read. The last bit called the obsolete bit, is cleared in the header of the old sector, once that sector does not contain the latest version of the virtual block. 3.1 Block-mapping in flash When power is lost during a write operation, the system can be in two states. The first happens when the power Instead of using a direct mapping of blocks onto physi- is lost before the valid is cleared. In this case, the sector cal addresses, the basic idea of dynamic mapping is de- contains data that is no longer valid and when the flash is veloped. The idea is that blocks presented by the higher used again, this sector is marked obsolete to ready it for layer functions, such as an Operating System, are identi- erasure. fied by a virtual block number. This virtual block num- The second inconsistent state occurs after the sector is ber is mapped to a physical flash address called a sector marked valid, but before the old sector is marked obso- (Some authors use a different terminology [GT05]). This lete. In this case there are two valid data sectors of the mapping is stored in RAM and can be updated, giving a same virtual block and the system can choose which one virtual block a new physical location. This idea is com- to use. In the case it is important to pick the most recent monly used in wear leveling techniques as follows. When version, a two-bit version number can be inserted in the a virtual block is written to flash, it is not written to the sector header, indicating a version number that has 4 dif- old location, but written to a new location and the virtual- ferent values, where the version number 0 is more recent block-to-sector mapping is updated with the new physical than 3. location of the block. This dynamic mapping approach For more information on these header fields the reader is serves multiple purposes: referred to [GT05] and [Man02]. 3
  4. 4. 3.2 Block mapping data structures 2. The valid sectors in each erase block are copied to free sectors in newly allocated erased blocks. In flash memory mappings cannot be stored in the same sector due to the wear-leveling. In order to find a sector 3. The data structures used in the mapping proccess are that contains a new block, two approaches have been updated to reflect the new location of the valid sec- developed. Direct maps are maps containing the current tors. location of a given block, while inverse maps store the identity of a block given a sector. In other words, direct 4. The erase block is erased and its sectors are added to maps allow efficient mapping of blocks to sectors, while a free-sector pool. This step might include writing inverse maps does the inverse by mapping sectors to an erase-block header to specify details such as an blocks. erase counter. Inverse maps are stored in the flash itself. The virtual block number in the header of a sector point out to which The choice of which erase blocks to reclaim and where block the sector belongs to. The virtual block number can to move the valid sectors to, affect the file system in three also be stored in a sector full of block numbers, as long ways. The first is the efficiency of the garbage collection, as they are stored within the same erase unit, so that on measured in how many obsolete sectors are reclaimed erase all data and associated block numbers are erased. in each erased block. The more obsolete sectors in The main purpose of these inverse maps is to reconstruct comparison to valid sectors, the higher the efficiency. the direct map upon device initialization, such as a mount. The second effect is wear leveling, where the target free Direct maps are at least partially, if not totally, stored block to copy the valid sectors to, is under influency of in RAM, which supports fast lookups. When a block how many times that block has already been used. is updated and written to a new location, its mapping The last effect is the way these choices affect the mapping is updated in the RAM. Updating of the mapping in data structures, as some garbage collections only require flash would not be possible because it does not support one simple update of a direct map while other systems in-place modification. may require complex updates. To summarize, inverse maps ensure that sectors can be Wear leveling and garbage collection efficiency of- linked to the blocks they contain, while direct maps allow ten produce contradictory results. The best example is the system to make fast lookups of the physical where- when we have an erase block filled with static data, data abouts of blocks, as it is stored in RAM. Direct and in- that is never or rarely updated. In terms of efficiency, verse maps, as well as the atomicity property are illus- it would be a terrible mistake to select this block for trated in Figure 1. garbage collection, as there are no obsolete blocks that would be freed. However, garbage collecting this block has it’s use in terms of wear leveling, as such a static 3.3 Garbage Collection block would have a very low wear. Moving the static data Garbage collection for flash filesystems has its basis in the to a block that already has a high wear, reducing the wear principle of ”segment cleaning,” designed by Rosenblum on that block, while making the low wear erase block et al. [RO92]. Data that is no longer needed, is not deleted available for dynamic data. but obsoleted. Obsolete data is still in the flash and oc- cupies space. Obsolete data cannot be deleted at once, as there may be valid data remaining in the same erase block. The implementation, efficiency and activation of 4 File Systems the Garbage Collection depends on which file system is This section will give an overview of file systems used on used, but can generally be described in four stages: flash memory. The first approach is to use a Flash Transla- 1. One or more erase blocks are selected for garbage tion Layer and run a normal filesystem on top of that, such collection as FAT. The combination of FTL and FAT is used a lot in 4
  5. 5. Figure 1: Block mapping in a flash device. The gray array on the right is the direct map, which resides in RAM. Each sector contains a header and data. The header contains the virtual block number, an erase counter, a valid and obsolete bit, as well as an ECC code for error checking and a version number. The virtual block numbers in used sectors constitute the inverse map, from which the direct map can be constructed. The erase counter is used in wear-leveling, where the valid and obsolete bit and the version number support the atomicity and consistency of write operations. The ECC code supports to detect errors in failing blocks. Courtesy of Gal et al. [GT05] removable flash devices as portability is a main require- • Optionally, depending on the concrete FTL used, ment. After FTL a background will be given on which more writes to flash may be necessary to update principle dedicated flash file systems work; Journaling block-to-sector mappings, increasing erase and free and Logging. Then several flash filesystems will be han- block counters and so on dled, old file systems such as Microsoft’s FFS, currently used filesystems such as JFFS and YAFFS and promising filesystems still in development, such as LogFS. 4.1 Flash Translation Layer A FTL keeps track of the current location of each sec- tor in the emulated block device, which makes it a sort The first approach for file systems to be used on flash of journaling file system. Journaling file systems will be memory is to emulate a virtual block device, which can be handled in the next section. The idea of the FTL stems used by a regular filesystem such as FAT. A Flash Trans- from a patent owned by A. Ban [Ban95], while this patent lation Layer (FTL) provides this functionality and takes was adopted as a PCMIA standard in 1998 [Cor98b]. care of the drawbacks mentioned in section 1.1. FTL was created for NOR memory only, but in 1999 on A write of a block to the virtual block device handled by a NAND version of FTL was developed, called NTFL the FTL causes the FTL to do three things: [Ban99] . NTFL is incorporated in the DiskOnChip de- vice. • The content of the data block is written to flash Because the use of an extra layer between a regular file system and the flash device is inefficient, scientists started • The location of the old data is marked obsolete to work filesystems specifically made for flash memory. The fact that the use of FTL and NTFL was heavily re- • Garbage collection may be activated to free up space stricted by all the patents involved, fueled the need for for later use flash specific filesystems. 5
  6. 6. 4.2 Background: Journaling and Log struc- flushed to the log. When a crash occurs the table needs to tured file systems be reconstructed, so the filesystem searches for the latest entry of this table in the log and scans the remaining part Most of the flash specific file systems are based on the of the log to find files whose position changed after the principle of Log-structured file systems. This principle is table was flushed. the successor of Journaling file systems, which we will The main advantage of Log-structured file systems is explain first. the favorable write speed, as writings occur at the end of the log, so it rarely has seek and rotational delays. The In Journaling file systems modifications of metadata downside of log-structured filesystems is the read perfor- are stored in a journal before modifications are made to mance. Reading can be very slow as blocks of files may the data block itself. When a crash has occurred, the be scattered around, especially if blocks were modified at fixing-up process examines the tail of the journal and rolls different times. In the case of magnetic hard disks this re- back or completes each metadata operation, depending sults in a lot of seek and rotational delays. This downside on which moment the crash occurred. Journaling file is the main reason Log-structured filesystems are rarely systems are used a lot in current file systems for magnetic used on magnetic hard disks. hard disks, such as ext3 [Rob] and ReiserFS [Mas] on However, Log-structured filesystems are an excellent Linux systems. choice for flash devices, as old data cannot be overwrit- ten, so a new version must be written to a new location The principle of Log-structured filesystems was de- anyway. Furthermore, read performance is not affected signed to be used on magnetic disks, on which it is not by flash devices, as flash has uniform low random-access currently used much. However, the idea is very useful in times. Kawaguchi, Nishioka and Motoda were the first to the context of flash memory. Log-structured filesystems point out that Log-structured filesystems would be very are a rather extreme version of journaling filesystems, in suitable for flash memory [KNM95]. a way that the journal/log is the filesystem. The disk is organized as one long log, which consists of fixed-size 4.3 Microsoft’s Flash File System segments of contiguous disk fragments, chained together as a linked list. In the mid 1990’s Microsoft developed a filesystem for When data and metadata are written, they are appended removable flash memories, called FFS2. Documentation to the tail of the log, never rewritten in-place. When of the supposedly earlier version FFS1 can not be found. (meta)data is appended to the end of the log, two prob- The first patent used in the development of FFS2 [BL95] lems arise; how to find new data and how can garbage describes a system for NOR flash that consist of one large collection work properly. In order to find new data, erase unit. This results in a write-once device with the pointers to that data must be updated, and these new exception that bits that are not cleared yet can be cleared pointers are normally also appended to the end of the log. later. FF2 uses linked lists to keep track of the files and This recursive updating of data and pointers can lead to it attributes and data. When a file is extended, a new a snowball effect, so Rosenblum et al [RO92] came up record is appended to the end of the linked list, followed with the idea of implementing inodes in Log-structured by clearing the next field of the last record of the current filesystems. list (which was all ones before). Inodes are data structures containing of file attributes, As can be seen in Figure 3, each record consists of 4 such as type, owner and permissions, as well as the fields; a raw data pointer which points to the start of a physical addresses of the first ten blocks of the file. If the data block, a data size field to state the length of the data file data consists of more than 10 blocks, the inode will used, a replacement pointer to be used in updates of data point to indirect blocks, which will point further to data and a next pointer to be used in appending data in the file. blocks or lower layered indirect blocks, see Figure 2. The Updates within the data of a file is a more difficult prob- inode to physical location of related files is stored in a lem. Because records point to raw data and data is written table, kept in RAM memory. This table is periodically once, the replacement pointer is used to indicate that the 6
  7. 7. Figure 2: An inode in the top of the figure can point directly to data at level 0, or via indirect blocks which support fragmented and/or big data files. Courtesy of Engel et al. [EBM07] Figure 3: The data structure of the Microsoft Flash File System. The data structure shows a linked-list element pointing to a block of 20 raw bytes in a file. Courtesy of Gal et al. [GT05] data pointer and next pointer are not valid anymore. The data. This drawback is caused by the design decision to replacement pointer points to a new record that uses a part try to keep the objects with static addresses, meaning that of the old data, while a second record points to the up- each file starts at the same physical address, no matter dated data. Figure 4 shows this lengthy and cumbersome which version it is. This design makes it easy to find approach. things in the filesystem, but requires long and inefficient As can be seen, a big drawback of FFS2 is that in case traversing of invalid data chains to find current data. of dynamic data which changes frequently, a long linked The log-structured approach makes it more difficult to list has to be traversed in order to access all the data. locate objects as they are moved around on updates, but Suppose we have a file with the first 5 bytes of data being once found, sectors pointing to invalid data do not need updated 10 times. In this case when we try to access the to be traversed. file, a chain of 10 invalid records needs to be traversed before the record is reached that points to the most recent Douglis et al. report very poor write performance 7
  8. 8. Figure 4: Updating in FFS2. The data structure is modified to accommodate an update of 5 of the 20 bytes. The data and next-in-list pointers of the original node are invalidated. The replacement pointer, which was originally free (all 1s; marked in gray in the figure), is set to point to a chain of 3 new nodes, two of which point to still-valid data within the existing block, and one of which points to a new block of raw data. The last node in the new chain points back to the tail of the original list. Courtesy of Gal et al. [GT05] for FFS2 in 1994 [DCK + 94], which is presumed to marked obsolete. When storage space runs low, garbage be the main reason why it failed and to be reported as collection kicks in. Garbage collection examines the head obsolete by Intel in 1998 [Cor98a]. of the circular log and moves valid nodes to the tail of the log and marks the valid node at the head of the log obso- lete. Once a complete erase block is rendered obsolete, it 4.4 JFFS is erased and made available for reuse by the tail of the log. 4.4.1 JFFS1 JFFS1 has several drawbacks: Journaling Flash File System (FFS1) was developed by • At mount time, the entire device must be scanned Axis Communication AB [AC04], designed to be used to construct the direct map. This scanning process in Linux embedded systems. JFFS1 was designed to be can be very slow and the space occupied in RAM by used with NOR flash memory only. JFFS1 is a purely the direct map can be quite large, proportional to the log-structured filesystem. Nodes containing metadata and number of files in the file system. possibly data are stored on the NOR flash in a circular log fashion. In JFFS1 there is only one type of node; the • The circular log design results in that all the data at struct jffs raw inode, which is associated with a the head of the log is deleted, even if the head of single inode by an inode number in its header. Next to the log consists of only valid nodes. This is not only an inode number there is a version number of the node inefficient, it is also not positive in terms of wear- and filesystem metadata in the header. The node may also leveling. carry a variable amount of data. When the flash device is mounted a scan is made of the en- • Compression is not supported. tire medium. With the information found in the nodes, a • Hard links are not supported. complete direct map is reconstructed and stored in RAM. When a node is superseded by a newer node, that node is • JFFS1 does not support NAND flash. 8
  9. 9. 4.4.2 JFFS2 David Woodhouse of Red Hat enhanced JFFS1 into JFFS2 [Woo01]. Compression using zlib, rubin or rtime is available. Hard linking and NAND flash memory are now supported. Instead of one type of node, JFFS2 uses three types of nodes: • inodes: just like the struct jffs raw inode in JFFS1, but without file name nor parent inode number. An inode is removed once the last directory entry referring to is has been unlinked. • dirent nodes: directory entries, holding a name and an inode number. Hard links are maintained with dif- ferent names but the same inode number. A link is removed by writing a dirent node with a higher ver- sion number, having the same name but with target inode number 0. • cleanmarker node: this node is written in an erased block to inform that the block has been properly erased, in case of a scan at mount time. Like in JFFS1, nodes with a lower version than the most recent one are considered obsolete. Instead of the circular log in JFFS1, the filesystem deals in blocks, which correspond to physical erase blocks in the flash device. A block containing only valid nodes is called clean, blocks having at least one obsolete node are called dirty and a free block only contains the cleanmarker node. When a JFF2 system is mounted, the system scans all nodes in the flash device and constructs two data structures, called struct jffs2 inode cache and struct jffs2 raw node ref. The first is a direct map from each inode number to the start of a linked list Figure 5: Two data structures in JFFS2. Each inode is rep- of the physical nodes which belong to that inode. The resented by the struct jffs2 inode cache, which second structure represents each valid node on the flash, points to the start of the chain of nodes representing the containing two linked lists, one pointing to the next node file. To indicate the end of the chain, the last node points in the same physical block, and the other list that points back to the inode. Courtesy of Gal et al. [GT05] to the next node belonging to the same inode. Figure 5 shows how these two data structures interconnect. When a JFF2 system is mounted, the system scans all nodes in the flash device and constructs two data structures, called struct jffs2 inode cache and struct jffs2 raw node ref. The first is a direct map from each inode number to the start of a linked list 9
  10. 10. of the physical nodes which belong to that inode. The second structure represents each valid node on the flash, containing two linked lists, one pointing to the next node in the same physical block, and the other list that points File to the next node belonging to the same inode. Figure 5 shows how these two data structures interconnect. Garbage collection frees up dirty blocks, turning them into free blocks. To provide wear leveling on semi-static data, JFFS2 picks a clean block once every 100 selections, instead of a dirty block. The big drawback on JFFS2 remains the mounting time. As JFFS2 also supports NAND and thus bigger flash Figure 6: yaffs Tnode tree of data chunks in a file. If devices, the time to scan the whole device becomes a the file grows in size, the levels increase. Each Tnode is serious problem. 32 bytes big. Level 0 (i.e. lowest level) has 16 2-byte pointers to data chunks. Higher level Tnodes comprise 8 4.5 YAFFS 4-byte pointers to other Tnodes lower in the tree. Yet Another Flash File System was developed by Charles Garbage collection comes a deterministic mode and an Manning of Aleph One [Man02]. YAFFS is the first aggressive mode. The former is the normal mode, acti- NAND-only flash file system. YAFFS was made for vated when a write occurs. When a write has been com- NAND flash of 512 byte chunks and 16 bytes headers pleted and there is a block that is completely filled with (see section 2.2), while YAFFS2 supports bigger NAND discarded chunks, it is garbage collected. The aggressive chips, 1KB or 2KB pages with respectively 30 and 42 mode is activated once free space is running low, collect- byte headers. ing blocks that contain valid chunks, copying the valid Because the earliest NAND flash memory with 512 chunks to a free block and erasing the old erase block. page chunks allowed up to three writes to the same area Wear leveling is not of a high priority, as the authors argue before an erasure was needed, YAFFS1 marked chunks as that NAND devices are already shipped with bad blocks obsolete by rewriting a field in the header of each chunk. so the filesystem needs to take care of bad blocks anyway. YAFFS2 required a more complex arrangement to obso- Uneven wear will only lead to loss of storage capacity, lete chunks newer flash only supported write-once before not to errors as bad blocks are handled by the filesystem. erasure was needed. In YAFFS2 every header does not YAFFS does not support compression. only contain a file ID and the position within the file, but also a sequence number. When multiple chunks with the same file ID and position within the file are encountered, 4.6 LogFS the chunk with the higher sequence number counts and LogFS is a creation of Engel et al. [EM] [EBM07] as a the others are considered obsolete. response on user comments that JFFS2 and YAFFS have When the system boots, a scan is performed to create high RAM usage and long mount times. a direct map that maps files to chunks using a tree like The flash medium is split into segments, each segment structure, see Figure 6. To speed up the scan YAFFS consists of multiple erase blocks. LogFS structures the incorporates checkpointing, saving the RAM map in device in three storage areas: flash before a clean unmount. When a system boots, it reads the flash device from the end to the beginning, • Superblock (1 segment) encountering the checkpoint fairly fast. Any write to the • Journal (2-8 segments) filesystem after the creation of a checkpoint renders the checkpoint invalid. • Object store 10
  11. 11. The superblock contains the global information such as before applying them to the flash device. This not only file system type. The journals will be discussed later. increases write speed but also decreases the number of Each object store consists of one segment, in where all inode updates as some data updates may have the same but the last erase block are normal data blocks. The last direct or indirect parent inode. block of each segment contains a summary of the Object Journal replacing is activated when the erase blocks are store containing for each data block its Inode number, its weared too often. A clean segment is designated as a new logical position in a file, physical offset of the block. Next journal and the first entry in first journal points to this new to these block specific fields, the summary maintains journal. segment-global information such as erase count, write As LogFS is still under development, wear leveling is time, etc. not optimized yet. As of January 2007, the segment When an update of data occurs, the data block is rewritten picked for writing is the first empty segment encountered out-of-place, so the pointer referring to the data must when scanning some segments ahead. Future develop- be updated (also out-of-place), which needs an update ment would optimize the wear leveling on basis of age of each parent node of that node. So basically each and/or erase count. change at the bottom of the tree will propagate upward A free space counter is maintained in the journal and when all the way to the root. This method of updating the tree space is running out, garbage collection comes into play. bottom-up is known as the wandering tree algorithm. A Due to the wandering tree algorithm and the fact that the crash before the root node has been rewritten only causes tree is stored in flash itself, garbage collection in LogFS the loss of the last operation, as the root node still points is complex. When garbage collection is needed on a seg- to the previous data and structure. ment containing valid nodes, one free segment is needed for each level of the tree, because blocks on different lev- Because inodes do not have reserved areas in flash de- els should be written to different segments and blocks on vices, LogFS stores the inodes in an inode file (ifile). The the same level should be written to the same segment. Be- the root inode of this ifile is stored in the journal. This cause of this, LogFS becomes slow when the device is design of ifile and normal files not only simplifies the getting full. code (file writes and inode writes are identical now), it The author states that LogFS is designed for big flash de- also makes it possible to use hardlinks. Figure 7 shows vices, ranging from gigabytes upwards. For smaller flash the setup of the ifile and normal files. All data, inodes, devices, the author recommends using JFFS2. LogFS was indirect blocks and the ifile inode are stored in the flash, planned to be included in the 2.6.25 Linux kernel. LogFS although the ifile inode is stored in the journal. has a codesize of around 8 KLOC. The journal is a circular log but much smaller than the log used in JFFS2, in which the log is the filesystem. In LogFS the small journal is filled with ifile inodes, being 4.7 UBI and UBIFS tuples of a version number and offset. This offset points 4.7.1 Unsorted Block Images - UBI to the tree root node of the Ifile. Upon mounting, the system does not perform a full scan UBI is a flash management layer with almost the same but only scans the superblock and the journal to find the functionality as the Logical Volume Manager (LVM) on most recent version of the root node. This approach im- hard drives, but with additional functions. It is designed proves the mount time by a big factor (J¨ rn Engel states an o by IBM [TG06]. An UBI runs on top of a flash device, OLPC system mount goes from 3.3 seconds under JFFS2 and UBIFS runs on top of UBI (see Figure 8 ). to 60ms under LogFS [Cor07]). UBI has the following relevant functionalities: As each updated and new data block would indirectly lead • Bad block management to a new version of the root node, the erase blocks contain- ing the journals would wear at a rapid pace. Two solutions • Wear leveling across all physical flash counter this aggressive wearing, write buffering and jour- nal replacing. Write buffering stores updates in a buffer • Logical to physical block mapping 11
  12. 12. Figure 7: LogFS. Combination of Inode file and normal file tree structure. Directory entries are inodes with no pointer to data. Courtesy of Engel et al. [EBM07] As we see, UBI hides these functions from higher lay- ered filesystems. UBI provides an UBI volume to higher layers, consisting of logical erase blocks. Higher layer filesystems may rewrite this logical erase block over and over without danger of wearing, because UBI transpar- ently changes the mapping to another physical eraseblock when it is time. UBI is not a FTL, as it was designed for bare flashes and not for flash devices such as MMC/SD carde, USB sticks, CompactFlash and so on. As so, neither ext2 nor other ”traditional” file systems can be run on top of an UBI de- vice. UBI weighs around 11 KLOC. For more informa- tion on UBI, the reader is referred to [TG06]. 4.7.2 UBI Filesystem - UBIFS UBIFS is developed by Nokia engineers with help of the University of Szeged [Hun08]. UBIFS is designed to Figure 8: Layered structure of UBI, UBIFS and the flash work on top of UBI volumes, it cannot operate directly device. on top of MTD devices or FTL’s [Hun08]. Basicly the whole setup of UBI, UBIFS and MTD is as follows (see also Figure 8): 12
  13. 13. • MTD subsystem, providing a uniform interface to ac- on each known LEB in the main area. The main area cess raw flash is discussed later. Each leaf node contains three values about each LEB in the main are: free space, dirty space • UBI subsystem, the volume manager providing wear- and whether the LEB is a index eraseblock or not. Index leveling, bad block management and logical to phys- nodes (being part of the on-flash tree) and non-index ical erase block mapping nodes are kept seperate in different blocks, meaning • UBIFS filesystem, providing all other functionality eraseblocks either contain only index nodes or only filesystems should provide non-index nodes. The free space can be used in new writes and the dirty In contrast, FFS, JFFS2, YAFFS and LogFS work directly space counter is used in garbage collection. The LPT is on top of raw MTD devices. updated only during a commit. As UBIFS runs on top of an UBI volume, it is not The on-flash tree and LPT represent the filesystem just provided with physical erase blocks but with logical erase after the last commit. The difference between these two blocks (LEBs). As such, UBIFS does not need to take and the actual state of the filesystem is represented by the care of wear leveling as that is handled by the UBI layer. nodes in the journal. Just like in LogFS, UBIFS uses a wandering tree in a tree just like the one pictured in Figure 7. The fifth area is called the orphan area, consisting of inode numbers whose inodes have a link count of There are 6 areas in UBIFS whose position is fixed zero. After an commit these inodes appear in the tree at filesystem creation. The first area is the superblock, as leaves with no parent. This is possible when an using one LEB. The second area are two LEBs filled with unclean unmount occurs when an open file is unlinked master nodes, which store the position of all on-flash and committed. To delete these orphan nodes after an data structures that do not have fixed logical positions. unclean unmount, either the entire on-flash tree must be To prevent data corruption, two LEBs are used instead of scanned for unlinked leaf nodes, or a list of orphans must one. be kept somewhere. UBIFS incorporates the latter with the orphan area. When the link count of an inode drops The third fixed area is the log of UBIFS, designed to to zero, the inode number is added to the orphan area reduce the frequency of updates to the on-flash tree as as leaves of the orphan tree. These inode numbers are updated nodes can share the same parent. The log is part deleted when the corresponding inode is deleted. of the journal. Nodes that are updated are placed in the journal and the index tree in memory (called the TNC) is The sixth and last area is the main area, containing the updated. Once the journal is full it is committed. data nodes and the on-flash tree (also called index). As The commit process consists of writing the new version describes earlier, main area LEBs are either filled with of the on-flash tree and the corresponding master node. index nodes or non-index nodes. This process is based on two special type of nodes stored When a UBIFS is mounted, the LPT and on-flash tree are in the log, being the commit start node which recods the scanned, after which the journal is replayed to receive the commit has begun, and the reference nodes that record correct stats of the filesystem. the LEB numbers of the LEBs in the rest of the journal. The UBIFS code size is around 30 KLOC. Those LEBs are called buds, so the journal consists of the log and the buds. The start of a commit is recorded by the commit start node, while the end of a commit is defined 5 Comparison and Conclusion when the master node has been written. After that the reference nodes are obsolete and can be deleted. Flash memory is gowing rapidly in speed, capacity and popularity. Newer flash device with bigger storage and The fourth area is the LEB properties tree (LPT), which higher speeds appear constantly, often at the expense of is a tree in where each leaf node represents information ease of use. This trend requires constant development of 13
  14. 14. software techniques for these newer flash devices. file system. This approachs seems very promising as Several approaches have been discussed, from the other filesystems can be adapted to work on top of UBI, inefficient and potentially dangerous Flash Translation as a patched JFFS2 is already capable of. UBIFS also Layer and the first and abandoned Microsoft’s Flash FS, maintains on-flash trees to minimize mount times. to more advanced dedicated flash filesystems like JFFS, UBIFS and LogFS are developed around the same time YAFFS, LogFS and UBIFS. The first filesystems handles and changes are implemented at the moment of writing. were designed for NOR flash, while nowadays NAND The codebase for the UBI/UBIFS combination is quite flash is commonly used in flash devices. large in comparison to LogFS, respectively 11/30 and 8 FTL is commonly used on removable devices like USB KLOC. sticks, because so far the only filesystem that is supported by every system is FAT. This approach of FTL with a traditional filesystem works, but it is inefficient and potentially dangerous as FTL does not treat flash memory References properties properly, even at the cost of potential data loss in case of a crash. [AC04] Sweden. Axis Communica- Microsofts FFS had very poor performance and was tions, Lund. Jffs homepage. abandoned early, but it gave other developers ideas for http://developer.axis.com/software/jffs/, further development. JFFS was the first dedicated flash 2004. file system that brought good performance. JFFS1 was [Ban95] A. Ban. Flash file system. us patent focussed on NOR memory, JFFS2 released later with 5,404,485. filed march 8, 1993; issued april several serious improvements, including NAND and 4,1995;assigned to m-systems. 1995. hardlink support. JFFS scans the whole device upon mount time and with the introduction of NAND flash and [Ban99] A. Ban. Flash file system optimized its enormous size potential, the mount time became the for page-mode flash technologies. us patent major disadvantage of JFFS. The full structure of JFFS 5,937,425.filed october 16, 1997; issued au- remains in memory, laying a heavy burden on RAM gust 10, 1999; assigned to m-systems. 1999. capacity. YAFFS was developed for NAND flash only and to [BL95] S. D. Barrett, P. L. Quinn and R. A. Lipe. cope with the long scan time and high RAM usage of System for updating data stored on a flash- JFFS2. YAFFS maintains a smaller tree structure in RAM erasable, programmable, read-only memory and supports checkpointing, decreasing the mount scan (feprom) based upon predetermined bit value time dramaticly if and only if the device is unmounted of indicating pointers. us patent 5,392,427. properly. In case of a crash the whole system needs to be filed may 18, 1993; issued february 21, 1995; scanned again. The author of YAFFS state it is better to assigned to microsoft. 1995. use JFFS2 on devices smaller than 64MB and YAFFS on bigger devices. [Cor98a] Intel Corporation. Flash file system selection LogFS solves the mounting time problem and high RAM guide. application note 686. 1998. usage by maintaining the tree structure in the flash itself, [Cor98b] Intel Corporation. Understanding the flash rather then reconstructing it with a scan and keeping it in translation layer (ftl) specification, applica- RAM only. LogFS is created for large NAND devices of tion note 648. 1998. 1 GB and bigger and performance drops when the device is almost full with valid data. The LogFS author states it [Cor07] Corbet. Logfs. is better to use JFFS2 for smaller devices. http://lwn.net/Articles/234441/, 2007. UBI and UBIFS introduce a new approach for flash filesystems using a layered approach to provide [DCK+ 94] F. Douglis, R. Caceres, M.F. Kaashoek, transparancy and simplicity to the higher layered K. Li, B. Marsh, and J.A. Tauber. Storage 14
  15. 15. alternatives for mobile computers. In In Pro- [Woo01] Jffs: The journaling flash file system. ceedings of the First USENIX Symposium on Presented in the Ottawa Linux Sym- Operating Systems Design and Implementa- posium, July 2001 (no proceedings); tion (OSDI), pages 25–37, Monterey, Califor- a 12-page article is available online at nia, 1994. ACM. http://sources.redhat.com/jffs2/jffs2.pdf, 2001. [EBM07] Jorn Engel, Dirk Bolte, and Robert Mertens. Garbage collection in logfs. http://www.logfs.org/logfs/, 2007. [EM] Jorn Engel and Robert Mertens. Logfs - finally a scalable flash file system. http://lazybastard.org/ joern/logfs1.pdf. [GT05] Eran Gal and Sivan Toledo. Algorithms and data structures for flash memories. ACM Comput. Surv., 37(2):138–163, 2005. [Hun08] A. Hunter. A brief introduction to the design of ubifs. http://www.linux- mtd.infradead.org/doc/ubifs whitepaper.pdf, 2008. [KNM95] Atsuo Kawaguchi, Shingo Nishioka, and Hi- roshi Motoda. A flash-memory based file system. In USENIX Winter, pages 155–164, 1995. [Man02] Charles Manning. Yaffs: Yet another flash fil- ing system, available at http://www.yaffs.net. Sep 2002. [Mas] Chris Mason. Journaling with reiserfs. http://www.linuxjournal.com/article/4466. [RO92] Mendel Rosenblum and John K. Ousterhout. The design and implementation of a log- structured file system. ACM Transactions on Computer Systems, 10(1):26–52, 1992. [Rob] Daniel Robbins. Intro- ducing ext3. http://www- 128.ibm.com/developerworks/linux/library/l- fs7.html. [TG06] A. Bityutskiy T. Gleixner, F. Haverkamp. Ubi - unsorted block images. http://www.linux- mtd.infradead.org/doc/ubidesign/ubidesign.pdf, 2006. 15