ext3/ext4 filesystems


Published on

This presentation describes the internals and on-disk layout of ext3 and ext4 filesystems. This presentation was given to a techie audience and hence comprises mostly of low level technical details.

1 Comment
  • I liked the presentation... I understand the ext4 works very well with large files and large partitions. How does it fare in smaller partitions? say i have a flash partition of ext4 filesystem of 10MB. What kind of overhead can I expect. I tried with a 5 MB partition and got only 2.5 MB free space on the system. Is there a way to breakdown the overhead ?
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

ext3/ext4 filesystems

  1. 1. ext3/ext4 Filesystems Kalpak Shah Clogeny Technologies Pvt. Ltd.
  2. 2. AGENGA  Layout of EXT3/4  Essential on-disk data structures  New features in ext4 • Extents, uninit_bg, nanosecond timestamps, 48- bit support, preallocation, mballoc, flex_bg, journal checksums • Its effects on on-disk layout  Crash recovery  Latest filesystem design layouts
  3. 3. Basic layout of EXT2/3/4 partition  All block groups are of same size and stored sequentially.  Superblock and group descriptors are duplicated in multiple block groups as per SPARSE_SUPER feature.  Block sizes starting from 512 bytes upto 8KB are supported.
  4. 4. Creating an ext3 FS  mkfs.ext3 -b 4096 -I 512 -i 8192 -J size=256 /dev/sda1 • Blocksize consideration • Number of inodes and inode sizes • Journal size  For example, consider an 8GB ext3 fs with a 4KB blocksize. In this case, each 4KB block bitmap describes 32K data blocks that is, 128MB. Therefore 64 block groups will be present in this fs.
  5. 5. Ext3 Superblock  The ext3 superblock is stored in an ext3_super_block structure. Some important fields are listed here: • s_inodes_count, s_blocks_count, s_free_blocks_count, s_free_inodes_count, s_inode_size • blocks_per_group, inodes_per_group • s_mnt_count, s_max_mnt_count • s_feature_{compat, incompat, rocompat} • s_uuid, s_volume_name • s_journal_inum, s_journal_dev • s_state, s_errors
  6. 6. Group Descriptors  Each block group has its own group descriptor, represented by ext3_group_desc structure, which has these fields: • bg_block_bitmap • bg_inode_bitmap • bg_inode_table • bg_free_{blocks,inode}_count • bg_used_dirs_count  Most fields are useful for inode/block allocator
  7. 7. ext3/ext4 inode  The on-disk ext3/4 inode structure has these fields: • i_mode, i_uid, i_gid • i_size, i_blocks • i_atime, i_mtime, i_dtime, i_ctime • i_links_count • i_block[EXT2_N_BLOCKS(15)] • i_version (for NFS) • i_file_acl • i_dir_acl (i_size_high)  New in ext4: • i_extra_isize • i_size_hi, i_size_high, l_i_file_acl_high • i_{ctime,mtime,atime,crtime}_extra • i_version_hi
  8. 8. Directory layout  EXT3/4 implements directories using a special kind of file whose data blocks store filenames along with corresponding inode numbers. Such data blocks basically contain structures of type ext3_dir_entry_2. This structure contains the following fields: • Inode number • Directory entry length • Name length • Filetype • Name  Directories entries are stored using a 2-level hashing for fast retrieval.
  9. 9. Ext4 Features – Extents  Replaces traditional indirect block mapping scheme which causes high metadata overhead and poor performance with large files.  An extent is a single descriptor that represents a range of contiguous blocks: struct ext4_extent { __le32 ee_block; /* first logical block */ __le16 ee_len; /* no of blocks */ __le16 ee_start_hi; /* high 16 bits of phy blk */ __le32 ee_start_lo; /* low 32 bits of phy blk */ };  Extents tree leads to efficient lookups and improves performance on sequential IO as well as mail server workloads.  Ext4 supports both extents and indirect mapping schemes and files can be converted between the two formats.
  10. 10. Extents
  11. 11. Ext4 features  Large FS support • Ext3 used 32-bit block numbers and with 4KB blocksize, the filesystem is limited to maximum 16TB size. • Ext4 uses 48-bit block numbers. All on-disk structures needed to be changed to support the 48-bit block number.  Persistent preallocation (fallocate support) • Apps such as large databases often write zeros to a file for guaranteed and contiguous space reservation. • Ext4 improves this scenario by skipping the zero-out and marking the extents as uninitialized instead.
  12. 12. Ext4 features  UNINIT_BG • For very large filesystems, e2fsck times are starting to become unacceptable. • The uninitialized block groups feature uses flags in the group descriptor to indicate of the block group is initialized or not. Efsck can just ignore block groups that are marked as uninitialized . • The flags marking the block group uninitialized and the high watermark are checksummed so we can detect corruption. • We have seen 2-10x speedup for e2fsck in many cases.  Nanosecond timestamp support • Using the i_{atime, ctime, mtime, crtime}_extra fields.
  13. 13. Ext4 features  Multi-block-allocator • Allocates multiple blocks at once using buddy data structure. • Includes inode and group preallocation • Includes special allocation modes for small files and GOAL blocks.  flex_bg • This feature groups meta-data(inode,block bitmap and indoe table) from a series of groups at the beginning of a “flex” group in order to improve performance during heavy meta-data operations.
  14. 14. Crash recovery - JBD/2  First a copy of the blocks to be written is stored in the journal. Then, when the I/O transfer to the journal is completed (commit block is written), the blocks are written (replayed) in the filesystem.  Journaling modes: • Journal – All data and metadata is journaled. • Ordered – Only metadata changes are journaled. Data blocks are written to disk before the metadata to avoid data corruption. • Writeback – Only metadata is journaled. Fastest mode.  Journal checksums • All blocks in a transaction are checksummed and the checksum is stored in the commit header. • While replaying the transaction(either by e2fsck or ext4), this checksum ensures that corrupt or partial transactions are not written to disk.
  15. 15. Latest filesystem design layouts  Trees • Latest filesystems like ZFS, BtrFS, Tux3 use indexed trees for efficient directory layouts, blocks, objects(inodes, EAs) and snapshots. With 64-bit or 128-bit pointers, we literally end all limits imposed on filesystems – no of inodes, EA sizes, no of files within directories.  Checksumming • All data/metadata is checksummed for early detection and possible correction.  In-built VM • Volume manager and filesystem are tightly coupled to take advantage of mirroring and RAID like functionality.  In-built encryption, compression