Quo vadis Linux File Systems
Ext4 or BTRFS
Udo Seidel
OSDC 2011 2
Agenda
● Introduction/motivation
● ext4 – the new member of the extfs family
● Facts, specs
● Migration
● BTRFS – the newbie .. the hope
● Facts, specs
● Migration
● Summary
OSDC 2011 3
Linux file systems
● More than 50 file systems shipped with Linux
kernel
● Local
● Remote
● Cluster
● ...
● A few as standard for root directory
● ext2, ext3
● XFS
OSDC 2011 4
Linux file systems – challenges
● ReiserFS sun-setted
● Limitations of ext3
● Changes in recent Enterprise distributions
OSDC 2011 5
Linux file systems – new players
● New version of the ext family -> ext4
● Marked as stable
● Shipped with Enterprise distributions
● New approach with BTRFS
● Still experimental
● Default by some projects, e.g. MeeGo
OSDC 2011 6
4th
extended file system
● Shipped since 2.6.19
● Stable since 2.6.28
● To overcome limits of ext3
● Size
● Performance
OSDC 2011 7
Ext4 - history
● Successor of ext3
● Started as set of patches for ext3
● Later forked
● First called ext3dev (sometimes ext4dev)
● Not impact ext3 stability
● Less dependencies to ext3 code
● Easier to maintain source code
OSDC 2011 8
Ext4 - facts
● Max volume size: 1 EByte = 1024 PByte
● Max file size: 16 TByte
● Max length of file name: 256 Bytes
● Support of extended attributes
● No encryption
● Not really compression
● Partially 64bit
OSDC 2011 9
Ext4 – starting from known
● Known tools
● mkfs
● fsck
● tune2fs
● e2label
OSDC 2011 10
Ext4 – global structure I
● Entry point -> superblock
● Block size
● Number of blocks and inodes
● Number of free blocks and inodes
● Disk divided in block groups
● backup of superblock
● Block group description (inode/block bitmaps)
OSDC 2011 11
Ext4 – global structure II
● Similar to ext3
● Inherits some ext3 limitations
● Number of inodes per block group
●
2nd
type of block groups => flexible
● Flexible placement of bitmaps
● Bigger inodes to store additional information
● 256 Bytes
● Nano second time stamps
OSDC 2011 12
Ext4 – from blocks to extents
● Common addressing for modern file systems
● Contiguous area of blocks
● Less management information needed
● Less meta data operations
● Less “fragmentation”
● Requires change of on-disk format
OSDC 2011 13
Ext4 – extent I
● 15 bit for extent size
● Block size of 4 KByte => 128 MByte
● 1 bit for extent initialization information
struct ext4_extent {
  __le32  ee_block; /* first logical block extent covers */
  __le16  ee_len;  /* number of blocks covered by extent */
  __le16  ee_start_hi; /* high 16 bits of physical block */
  __le32  ee_start_lo; /* low 32 bits of physical block */
};
OSDC 2011 14
Ext4 – extent II
● 32 bit for block addresses inside file
● Block size of 4 KByte => 16 TByte
● 48 (!) bit for block addresses of file system
● Block size of 4 KByte => 1 EByte
OSDC 2011 15
Ext4 – extent III
● 60 Byte for extent information
● 12 Byte for extent header
● 12 Byte for extent structure
– Up to 4 extents per inode
– max. 512 MByte direct addressable (ext3: 48 KByte)
– Different schema for bigger files
OSDC 2011 16
Ext4 – extent tree I
● For files > 512 MByte
● B+ tree
● Extent structure only at leaf nodes
● New element: extent index
● Same header structure like data extent
● Points to data block
● Data block contains either extent index or extent
structure
OSDC 2011 17
Ext4 – extent tree II
OSDC 2011 18
Ext4 – from extents to blocks
● At the end block allocation
● New features
● Multi-block allocation
● Delayed allocation
● Persistent allocation
OSDC 2011 19
Ext4 – multi-block allocation
● Ext3: only one block
● 12800 calls for 50 MByte file
● Ext4: multiple blocks per call
● Less overhead
● Contiguous physical location of data
OSDC 2011 20
Ext4 – delayed allocation
● Ext3
● Instant block allocation
● Fragmentation due to buffers and caches
● Ext4
● Delayed block allocation
● Use cache information for placement
● Risk of data loss in early versions => improved
since 2.6.30
OSDC 2011 21
Ext4 – “clever” allocation
● Support of system call fallocate()
● Application reserves blocks ahead
● File system ensures disk space availability
● Allocation information in extent structure
●
Remember 16th
bit
OSDC 2011 22
Ext4 – consistent status
● New journaling => JBD2
● Transactions have checksums
● 64 bit ready
● Deactivation possible
OSDC 2011 23
Ext4 – repair
● Improved fsck()
● No check of unused blocks
– information stored in block group header
– Information secured via checksums
– (de)activation possible at any time
● First run as slow like in ext3
OSDC 2011 24
Ext4 – other news
● Nano second precision time stamps
● Unix millennium bug shifted to 2514
● More subdirectories
● Up to 65000
● More than 65000 ... with limitation
OSDC 2011 25
Ext4 – general migration paths
● mkfs() and backup/restore
● Clean new file system structure
● Only way for file systems other than ext2/3
● Extended outage
● Conversion via tune2fs
● Partial only
● Only possible for ext family
● Faster/easier
OSDC 2011 26
Ext4 – background for migration
● 2 kind of changes compared to ext3
● change of ondisk format:
– Extents
– Only enabled for new files via tune2fs
– Additional tasks needed
● Ondisk format not relevant
– block allocation
– Immediately enabled via tune2fs
OSDC 2011 27
Ext4 – migration via tune2fs
● Results in mix of ext3 and ext4 structure
● Access via ext3 driver impossible
● fsck() needed
parameter description
extent Extent based block allocation
flex_bg Flexible placement of meta data
uninit_bg Flag uninitialized blocks for faster fsck
dir_nlink Infinite number of sub directories
extra_isize Timestamps with nano seconds
OSDC 2011 28
Ext4 – migration hints
● fsck() recommended
● /boot – booting from ext4 possible?
● Rescue media enabled for ext4?
OSDC 2011 29
Ext4 – summary
● Good successor of ext3
● Manages higher amount of data
● Faster
● Performance
● recovery
● Safer
● Sufficient migration options from ext2/3
OSDC 2011 30
Better/b-tree file system
● Shipped since 2.6.29
● Still experimental
● Replace ext3/4
● New storage management approach
OSDC 2011 31
BTRFS - history
● Basic idea
● Shown 2007
● Usage of B trees for standard structures
● Not new ... see XFS, ReiserFS
● Chris Mason
● Worked on ReiserFS for SUSE
● Moved to Oracle -> started BTRFS developement
OSDC 2011 32
BTRFS - facts
● Max file/volume size: 16 EByte
● Max length of file name: 256 Bytes
● Support of
● Extended attributes
● Encryption
● Compression
● Snapshot
● Copy-on-Write
OSDC 2011 33
BTRFS – global structure
● Entry point -> superblock
● More than one file system per volume
● Extents
● Put together in block groups
● No mix of data and meta data
OSDC 2011 34
BTRFS – internals: the trees
● Consists of B+ trees
● Root tree
● File system tree
● Extent allocation tree
● Checksum tree
● Log tree
● Chunk & device tree
● Data relocation tree
OSDC 2011 35
BTRFS – internals: structures
● 3 structures
● Key
– index of the tree structure
● Block header
– ID of file system
– Reference of insert time
– Level position
● Item
– Different types: inodes, extents, directories
OSDC 2011 36
BTRFS – internals: the key
● Index of the tree structure
● Size: 136 bit
● First 64 bit: unique object ID
● Next 8 bit: type/item
● Last 64 bit: item dependent
● e.g. Hash of directory name
● e.g. Number of elements in directory
● e.g. object ID of upper layer directory
OSDC 2011 37
BTRFS – internals: the item
● More than one item per object ID possible
Item Value
INODE_ITEM 1
XATTR_ITEM 24
DIR_ITEM 84
DIR_INDEX 96
EXTENT_DATA 108
EXTENT_CSUM 128
ROOT_ITEM 132
EXTENT_ITEM 168
OSDC 2011 38
BTRFS – more about trees
● Highest layer
● Root tree
● Referenced in superblock
● Other trees => object ID in root tree
● Some trees unique
● Extent allocation
● Data relocation
● Possibly multiple trees
● File system
OSDC 2011 39
BTRFS – file system tree
● Visible part
● Contains:
● Inode items
● Reference items
● No data of files
● See extents
● Exception: small files
OSDC 2011 40
BTRFS – extent allocation tree
● Space management
● Backward reference
● file system object
● Possibly multiple per extent
● Maybe move to extent data reference object
OSDC 2011 41
BTRFS – other trees
● Log tree
● Collects fsync() calls
● Journal of this kind of COW calls
● Checksum tree
● CRC32 checksums of data and meta data
● Chunk tree
● Manage devices: device item and chunk map item
● Device tree
● Counterpart of chunk tree
OSDC 2011 42
BTRFS – device management
● Included volume manager
● pool concept
● RAID-0 and RAID-1
● For data and meta data
● Not necessarily identical
● Chunk tree
● abstract from disk block
OSDC 2011 43
BTRFS – extents, chunks, blocks
OSDC 2011 44
BTRFS – what else
● Transparent compression via zlib
● Support of POSIX ACL's
● Online grow/shrink
● Online add/removal of disks
● No fsck() tool (yet)
● Management tool evolution (btrfsctl -> btrfs)
OSDC 2011 45
BTRFS – migration I
● Via tool btrfs-convert
● du/df not fully BTRFS-aware
● In place from ext3/4
● Via libe2fs
● BTRFS meta data location flexible
● Old ext3/4 organized in snapshot
● Roll-back possible to date/time of conversion
OSDC 2011 46
BTRFS – migration II
OSDC 2011 47
BTRFS summary
● Still experimental
● Meets standard file systems requirements
● Bridges existing gaps
● e.g. snapshots
● easy migration from ext3/4 possible
● New approach to storage management
● e.g. included volume manager
OSDC 2011 48
Summary
● Improvement moving to ext4
● Safe switching to ext4
● In place migration from ext3 possible
● Future is BTRFS
● In place migration from ext3/4 to BTRFS
possible
OSDC 2011 49
References
● http://ext4.wiki.kernel.org
● http://btrfs.wiki.kernel.org
OSDC 2011 50
Thank you!

OSDC 2011 | Enterprise Linux Server Filesystems by Remo Rickli

  • 1.
    Quo vadis LinuxFile Systems Ext4 or BTRFS Udo Seidel
  • 2.
    OSDC 2011 2 Agenda ●Introduction/motivation ● ext4 – the new member of the extfs family ● Facts, specs ● Migration ● BTRFS – the newbie .. the hope ● Facts, specs ● Migration ● Summary
  • 3.
    OSDC 2011 3 Linuxfile systems ● More than 50 file systems shipped with Linux kernel ● Local ● Remote ● Cluster ● ... ● A few as standard for root directory ● ext2, ext3 ● XFS
  • 4.
    OSDC 2011 4 Linuxfile systems – challenges ● ReiserFS sun-setted ● Limitations of ext3 ● Changes in recent Enterprise distributions
  • 5.
    OSDC 2011 5 Linuxfile systems – new players ● New version of the ext family -> ext4 ● Marked as stable ● Shipped with Enterprise distributions ● New approach with BTRFS ● Still experimental ● Default by some projects, e.g. MeeGo
  • 6.
    OSDC 2011 6 4th extendedfile system ● Shipped since 2.6.19 ● Stable since 2.6.28 ● To overcome limits of ext3 ● Size ● Performance
  • 7.
    OSDC 2011 7 Ext4- history ● Successor of ext3 ● Started as set of patches for ext3 ● Later forked ● First called ext3dev (sometimes ext4dev) ● Not impact ext3 stability ● Less dependencies to ext3 code ● Easier to maintain source code
  • 8.
    OSDC 2011 8 Ext4- facts ● Max volume size: 1 EByte = 1024 PByte ● Max file size: 16 TByte ● Max length of file name: 256 Bytes ● Support of extended attributes ● No encryption ● Not really compression ● Partially 64bit
  • 9.
    OSDC 2011 9 Ext4– starting from known ● Known tools ● mkfs ● fsck ● tune2fs ● e2label
  • 10.
    OSDC 2011 10 Ext4– global structure I ● Entry point -> superblock ● Block size ● Number of blocks and inodes ● Number of free blocks and inodes ● Disk divided in block groups ● backup of superblock ● Block group description (inode/block bitmaps)
  • 11.
    OSDC 2011 11 Ext4– global structure II ● Similar to ext3 ● Inherits some ext3 limitations ● Number of inodes per block group ● 2nd type of block groups => flexible ● Flexible placement of bitmaps ● Bigger inodes to store additional information ● 256 Bytes ● Nano second time stamps
  • 12.
    OSDC 2011 12 Ext4– from blocks to extents ● Common addressing for modern file systems ● Contiguous area of blocks ● Less management information needed ● Less meta data operations ● Less “fragmentation” ● Requires change of on-disk format
  • 13.
    OSDC 2011 13 Ext4– extent I ● 15 bit for extent size ● Block size of 4 KByte => 128 MByte ● 1 bit for extent initialization information struct ext4_extent {   __le32  ee_block; /* first logical block extent covers */   __le16  ee_len;  /* number of blocks covered by extent */   __le16  ee_start_hi; /* high 16 bits of physical block */   __le32  ee_start_lo; /* low 32 bits of physical block */ };
  • 14.
    OSDC 2011 14 Ext4– extent II ● 32 bit for block addresses inside file ● Block size of 4 KByte => 16 TByte ● 48 (!) bit for block addresses of file system ● Block size of 4 KByte => 1 EByte
  • 15.
    OSDC 2011 15 Ext4– extent III ● 60 Byte for extent information ● 12 Byte for extent header ● 12 Byte for extent structure – Up to 4 extents per inode – max. 512 MByte direct addressable (ext3: 48 KByte) – Different schema for bigger files
  • 16.
    OSDC 2011 16 Ext4– extent tree I ● For files > 512 MByte ● B+ tree ● Extent structure only at leaf nodes ● New element: extent index ● Same header structure like data extent ● Points to data block ● Data block contains either extent index or extent structure
  • 17.
    OSDC 2011 17 Ext4– extent tree II
  • 18.
    OSDC 2011 18 Ext4– from extents to blocks ● At the end block allocation ● New features ● Multi-block allocation ● Delayed allocation ● Persistent allocation
  • 19.
    OSDC 2011 19 Ext4– multi-block allocation ● Ext3: only one block ● 12800 calls for 50 MByte file ● Ext4: multiple blocks per call ● Less overhead ● Contiguous physical location of data
  • 20.
    OSDC 2011 20 Ext4– delayed allocation ● Ext3 ● Instant block allocation ● Fragmentation due to buffers and caches ● Ext4 ● Delayed block allocation ● Use cache information for placement ● Risk of data loss in early versions => improved since 2.6.30
  • 21.
    OSDC 2011 21 Ext4– “clever” allocation ● Support of system call fallocate() ● Application reserves blocks ahead ● File system ensures disk space availability ● Allocation information in extent structure ● Remember 16th bit
  • 22.
    OSDC 2011 22 Ext4– consistent status ● New journaling => JBD2 ● Transactions have checksums ● 64 bit ready ● Deactivation possible
  • 23.
    OSDC 2011 23 Ext4– repair ● Improved fsck() ● No check of unused blocks – information stored in block group header – Information secured via checksums – (de)activation possible at any time ● First run as slow like in ext3
  • 24.
    OSDC 2011 24 Ext4– other news ● Nano second precision time stamps ● Unix millennium bug shifted to 2514 ● More subdirectories ● Up to 65000 ● More than 65000 ... with limitation
  • 25.
    OSDC 2011 25 Ext4– general migration paths ● mkfs() and backup/restore ● Clean new file system structure ● Only way for file systems other than ext2/3 ● Extended outage ● Conversion via tune2fs ● Partial only ● Only possible for ext family ● Faster/easier
  • 26.
    OSDC 2011 26 Ext4– background for migration ● 2 kind of changes compared to ext3 ● change of ondisk format: – Extents – Only enabled for new files via tune2fs – Additional tasks needed ● Ondisk format not relevant – block allocation – Immediately enabled via tune2fs
  • 27.
    OSDC 2011 27 Ext4– migration via tune2fs ● Results in mix of ext3 and ext4 structure ● Access via ext3 driver impossible ● fsck() needed parameter description extent Extent based block allocation flex_bg Flexible placement of meta data uninit_bg Flag uninitialized blocks for faster fsck dir_nlink Infinite number of sub directories extra_isize Timestamps with nano seconds
  • 28.
    OSDC 2011 28 Ext4– migration hints ● fsck() recommended ● /boot – booting from ext4 possible? ● Rescue media enabled for ext4?
  • 29.
    OSDC 2011 29 Ext4– summary ● Good successor of ext3 ● Manages higher amount of data ● Faster ● Performance ● recovery ● Safer ● Sufficient migration options from ext2/3
  • 30.
    OSDC 2011 30 Better/b-treefile system ● Shipped since 2.6.29 ● Still experimental ● Replace ext3/4 ● New storage management approach
  • 31.
    OSDC 2011 31 BTRFS- history ● Basic idea ● Shown 2007 ● Usage of B trees for standard structures ● Not new ... see XFS, ReiserFS ● Chris Mason ● Worked on ReiserFS for SUSE ● Moved to Oracle -> started BTRFS developement
  • 32.
    OSDC 2011 32 BTRFS- facts ● Max file/volume size: 16 EByte ● Max length of file name: 256 Bytes ● Support of ● Extended attributes ● Encryption ● Compression ● Snapshot ● Copy-on-Write
  • 33.
    OSDC 2011 33 BTRFS– global structure ● Entry point -> superblock ● More than one file system per volume ● Extents ● Put together in block groups ● No mix of data and meta data
  • 34.
    OSDC 2011 34 BTRFS– internals: the trees ● Consists of B+ trees ● Root tree ● File system tree ● Extent allocation tree ● Checksum tree ● Log tree ● Chunk & device tree ● Data relocation tree
  • 35.
    OSDC 2011 35 BTRFS– internals: structures ● 3 structures ● Key – index of the tree structure ● Block header – ID of file system – Reference of insert time – Level position ● Item – Different types: inodes, extents, directories
  • 36.
    OSDC 2011 36 BTRFS– internals: the key ● Index of the tree structure ● Size: 136 bit ● First 64 bit: unique object ID ● Next 8 bit: type/item ● Last 64 bit: item dependent ● e.g. Hash of directory name ● e.g. Number of elements in directory ● e.g. object ID of upper layer directory
  • 37.
    OSDC 2011 37 BTRFS– internals: the item ● More than one item per object ID possible Item Value INODE_ITEM 1 XATTR_ITEM 24 DIR_ITEM 84 DIR_INDEX 96 EXTENT_DATA 108 EXTENT_CSUM 128 ROOT_ITEM 132 EXTENT_ITEM 168
  • 38.
    OSDC 2011 38 BTRFS– more about trees ● Highest layer ● Root tree ● Referenced in superblock ● Other trees => object ID in root tree ● Some trees unique ● Extent allocation ● Data relocation ● Possibly multiple trees ● File system
  • 39.
    OSDC 2011 39 BTRFS– file system tree ● Visible part ● Contains: ● Inode items ● Reference items ● No data of files ● See extents ● Exception: small files
  • 40.
    OSDC 2011 40 BTRFS– extent allocation tree ● Space management ● Backward reference ● file system object ● Possibly multiple per extent ● Maybe move to extent data reference object
  • 41.
    OSDC 2011 41 BTRFS– other trees ● Log tree ● Collects fsync() calls ● Journal of this kind of COW calls ● Checksum tree ● CRC32 checksums of data and meta data ● Chunk tree ● Manage devices: device item and chunk map item ● Device tree ● Counterpart of chunk tree
  • 42.
    OSDC 2011 42 BTRFS– device management ● Included volume manager ● pool concept ● RAID-0 and RAID-1 ● For data and meta data ● Not necessarily identical ● Chunk tree ● abstract from disk block
  • 43.
    OSDC 2011 43 BTRFS– extents, chunks, blocks
  • 44.
    OSDC 2011 44 BTRFS– what else ● Transparent compression via zlib ● Support of POSIX ACL's ● Online grow/shrink ● Online add/removal of disks ● No fsck() tool (yet) ● Management tool evolution (btrfsctl -> btrfs)
  • 45.
    OSDC 2011 45 BTRFS– migration I ● Via tool btrfs-convert ● du/df not fully BTRFS-aware ● In place from ext3/4 ● Via libe2fs ● BTRFS meta data location flexible ● Old ext3/4 organized in snapshot ● Roll-back possible to date/time of conversion
  • 46.
    OSDC 2011 46 BTRFS– migration II
  • 47.
    OSDC 2011 47 BTRFSsummary ● Still experimental ● Meets standard file systems requirements ● Bridges existing gaps ● e.g. snapshots ● easy migration from ext3/4 possible ● New approach to storage management ● e.g. included volume manager
  • 48.
    OSDC 2011 48 Summary ●Improvement moving to ext4 ● Safe switching to ext4 ● In place migration from ext3 possible ● Future is BTRFS ● In place migration from ext3/4 to BTRFS possible
  • 49.
    OSDC 2011 49 References ●http://ext4.wiki.kernel.org ● http://btrfs.wiki.kernel.org
  • 50.