Ext4 filesystem(1)

ファイルシステム(1)
2014. 6. 28
Yoshihiro YUNOMAE

今回の発表の目的
• ext4ファイルシステム全体の概要理解
• ext4特有の機能を確認することで、既存の文献
(赤本・黒本)とのギャップを埋める
– ext2/ext3と基本的な処理は同じ
– ところどころext2/ext3の共通の処理にも触れる
– 今回は静的な機能(主にdisk layout)中心の説明
• vfsレイヤ/ジャーナリングについては次回以降
1

基本的な用語集
• inode
– ファイルと1対1対応するデータの管理構造体(inode構造体)
• 文脈によってinode番号(これも1意)を示すこともある
• super block
– ファイルシステムの構成を管理する構造体(super_block構造体)
• ブロック
– ファイルシステムで管理する最小単位
• 1KB/2KB/4KB/8KB、最大サイズはページサイズに依存
• インテルなら4KBが標準（特に断りが無い限り1block = 4KBとする）
– 文脈によってブロックを集めたクラスタの意味を表すこともある
• メタデータ
– ファイルデータのディスク内に格納される管理情報データ
2

アジェンダ
• ext2/ext3/ext4の大まかな機能比較
• ext4のディスクレイアウト
• ext4特有の機能
– 一部ext2/ext3共通機能も含む
3

ファイルシステムとは
4
• ディスク内のデータを"ファイル"という単位でユーザ
に提供する仕組み
– ディスク内ではセクタ単位で管理
– ファイルシステムではブロック単位(セクタを複数束ねた単
位)で管理し、ユーザの要求に従い、その一部分をユーザ
に見せる
– ディスクのデータだけでなく、カーネルへ直接要求する
I/F(procfs, sysfs, debugfs等メモリ上に存在)もある
• 本発表でのスコープはディスク内のデータを管理する仕組みの方
• ファイルシステムは階層構造を持っている
• 今カーネルで提供しているファイルシステムは
/proc/filesystemsを参照すればわかる

ext2/ext3/ext4 (1)
• おそらく最もよく使われているファイルシステム
– 多くのディストリビューションでデフォルト
– RHEL7ではxfsがデフォルトとなった
• 歴史
– ext2: 1993年(kernel-0.99)に導入
– ext3: 2001年(kernel-2.4.15)に導入
– ext4: 2006年(kernel-2.6.19)に導入/
安定板は2008年(kernel-2.6.28)
5

• 基本的には相互互換性あり
例: ext2をext3としてマウント
– オプションによっては前方互換性が保たれない
– extent機能
• 大まかな違い
– ext3: ext2 + journal
– ext4: ext3 + 大容量化 + パフォーマンス強化 +
信頼性向上
6
ext2/ext3/ext4 (2)

7
ext2/ext3/ext4 (3)
機能説明 ext2 ext3 ext4
Max file/filesystem size
(4kb/block)
2TB/16TB 2TB/16TB 16TB/1EB
Max sub directories 32,000 32,000 65000/no
limit
Journaling ジャーナル領域にメタデータを
書き込む機能(実データ破壊防
止、fsck高速化)
ー ○(JBD) ○(JBD2)
Extent 連続ブロックを単位にデータ管
理(管理領域削減)
ーー ○
Delayed allocation diskに実際に書く際にブロックを
アロケーション(writeのパフォー
マンス向上)
ーー ○
Multiblock allocation 連続ブロックを一気に確保(パ
フォーマンス向上・フラグメン
テーションの回避)
ーー ○
Persistent preallocation ファイルごとにブロックをプリア
ロケーション(fallocate(2)で実施;
media streamingやDB用)
ーー ○
*wikipediaとext4wikiを参考に作成。

8
ext2/ext3/ext4 (4)
• ext4におけるその他の違い
– ディレクトリ作成時のinode予約機能
– nanosecタイムスタンプ
– larger inodes (デフォルト128bytes -> 256bytes)
– デフォルトでbarrierオン
– no journal mode
– 高速なfsck
*wikipediaとext4wikiを参考に作成。

• 互換性を保つためのフラグの名称(の一部)
• EXTx_FEATURE_(COMPAT|RO_COMPAT|INCOMPAT)_SUPP
– 各ファイルシステムでサポートしているフラグを集めたもの
– マウント時にこれらをチェックして、マウント可能/不可能を判断
– extx_super_block構造体内のメンバーで管理
s_feature_compat/s_feature_ro_compat/s_feature_incompat
• COMPAT: 互換性あり。知らないフラグがあっても問題なし。
• RO_COMPAT: Read onlyでマウントするなら、
知らないフラグがあってもマウント可能。
• INCOMPAT: 知らないフラグがあったらマウント不可
9
COMPAT/RO_COMPAT/INCOMPAT

10
mkfsのfeatureオプション(1)
Features EXTX_FEATURE_COMPAT_ ext2 ext3 ext4
has_journal HAS_JOURNAL - ○ ○
ext_attr EXT_ATTR ○ ○ ○
dir_index DIR_INDEX - ○ ○
resize_inode RESIZE_INODE - ○ ○
sparse_super2 SPARSE_SUPER2 - - ○
知らないフラグがあってもマウント可能
○/-: 機能あり/なし
: 本発表で扱う機能

11
Features EXTX_FEATURE_RO_COMPAT_ ext2 ext3 ext4
sparse_super SPARSE_SUPER ○ ○ ○
large_file LARGE_FILE ○ ○ ○
huge_file HUGE_FILE - - ○
uninit_bg GDT_CSUM - - ○
uninit_groups GDT_CSUM - - ○
dir_nlink DIR_NLINK - - ○
extra_isize EXTRA_ISIZE - - ○
quota QUOTA - - ○
bigalloc BIGALLOC - - ○
metadata_csum METADATA_CSUM - - ○
知らないフラグがあってもROでマウント可能

12
Features EXTX_FEATURE_INCOMPAT_ ext2 ext3 ext4
filetype FILETYPE ○ ○ ○
needs_recovery RECOVER - ○ ○
journal_dev JOURNAL_DEV - ○ ○
extent EXTENTS - - ○
extents EXTENTS - - ○
meta_bg META_BG ○ ○ ○
64bit 64BIT - - ○
mmp MMP - - ○
flex_bg FLEX_BG - - ○
inline_data INLINE_DATA ○
知らないフラグがあったらマウント不可

• mkfsコマンド用オプション管理ファイル
13
/etc/mke2fs.conf
[defaults]
base_features = sparse_super,filetype,resize_inode,dir_index,ext_attr
default_mntopts = acl,user_xattr
enable_periodic_fsck = 0
blocksize = 4096
inode_size = 256
inode_ratio = 16384
[fs_types]
ext3 = {
features = has_journal
}
ext4 = {
features = has_journal,extent,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize
auto_64-bit_support = 1
inode_size = 256
}

• 基本的なレイアウト
– 基本的にはext2/ext3と同様(メタデータの中身は異なる)
– mkfs時にいくつかのオプションを指定していたらレイアウトが変わる(次ス
ライド以降)
14
ext4のdisk layout
Super
Block
Group
Descriptors
data block
bitmap
Reserved
GDT
Blocks
inode
Bitmaps
inode
table
data
blocks
1024byte
1block n blocks n blocks n blocks n blocks1 block 1 block
block group 0 block group 1 block group N…
padding: ブートセクター用にreserve
Super Block: ext4_super_block構造体を格納。全体を管理しているのでこれが壊れるとまずい。
Group Descriptors: 全ブロックグループのグループディスクリプタ(ext4_group_desc構造体)を格納。
Reserved GDT Blocks: 将来用にreserve
data block bitmap: ブロックグループ内の空きデータブロックの管理
inode bidmaps: ブロックグループ内の空きinodeの管理
inode table: inode構造体を格納(ext4よりデフォルトで256byteなので1blockあたり16個)
data blocks: 実際のデータ
padding
metadata

15
ext4_super_block構造体
struct ext4_super_block {
/*00*/ __le32 s_inodes_count; /* Inodes count */
__le32 s_blocks_count_lo; /* Blocks count */
__le32 s_r_blocks_count_lo; /* Reserved blocks count */
__le32 s_free_blocks_count_lo; /* Free blocks count */
/*10*/ __le32 s_free_inodes_count; /* Free inodes count */
__le32 s_first_data_block; /* First Data Block */
__le32 s_log_block_size; /* Block size */
__le32 s_log_cluster_size; /* Allocation cluster size */
/*20*/ __le32 s_blocks_per_group; /* # Blocks per group */
__le32 s_clusters_per_group; /* # Clusters per group */
__le32 s_inodes_per_group; /* # Inodes per group */
__le32 s_mtime; /* Mount time */
/*30*/ __le32 s_wtime; /* Write time */
__le16 s_mnt_count; /* Mount count */
__le16 s_max_mnt_count; /* Maximal mount count */
__le16 s_magic; /* Magic signature */
__le16 s_state; /* File system state */
__le16 s_errors; /* Behaviour when detecting errors */
__le16 s_minor_rev_level; /* minor revision level */
/*40*/ __le32 s_lastcheck; /* time of last check */
__le32 s_checkinterval; /* max. time between checks */
__le32 s_creator_os; /* OS */
__le32 s_rev_level; /* Revision level */

16
/*50*/ __le16 s_def_resuid; /* Default uid for reserved blocks */
__le16 s_def_resgid; /* Default gid for reserved blocks */
__le32 s_first_ino; /* First non-reserved inode */
__le16 s_inode_size; /* size of inode structure */
__le16 s_block_group_nr; /* block group # of this superblock */
__le32 s_feature_compat; /* compatible feature set */
/*60*/ __le32 s_feature_incompat; /* incompatible feature set */
__le32 s_feature_ro_compat; /* readonly-compatible feature set */
/*68*/ __u8 s_uuid[16]; /* 128-bit uuid for volume */
/*78*/ char s_volume_name[16]; /* volume name */
/*88*/ char s_last_mounted[64]; /* directory where last mounted */
/*C8*/ __le32 s_algorithm_usage_bitmap; /* For compression */
__u8 s_prealloc_blocks; /* Nr of blocks to try to preallocate*/
__u8 s_prealloc_dir_blocks; /* Nr to preallocate for dirs */
__le16 s_reserved_gdt_blocks; /* Per group desc for online growth */
/*D0*/ __u8 s_journal_uuid[16]; /* uuid of journal superblock */
/*E0*/ __le32 s_journal_inum; /* inode number of journal file */
__le32 s_journal_dev; /* device number of journal file */
__le32 s_last_orphan; /* start of list of inodes to delete */
__le32 s_hash_seed[4]; /* HTREE hash seed */
__u8 s_def_hash_version; /* Default hash version to use */
__u8 s_jnl_backup_type;
__le16 s_desc_size; /* size of group descriptor */

17
/*100*/ __le32 s_default_mount_opts;
__le32 s_first_meta_bg; /* First metablock block group */
__le32 s_mkfs_time; /* When the filesystem was created */
__le32 s_jnl_blocks[17]; /* Backup of the journal inode */
/* 64bit support valid if EXT4_FEATURE_COMPAT_64BIT */
/*150*/ __le32 s_blocks_count_hi; /* Blocks count */
__le32 s_r_blocks_count_hi; /* Reserved blocks count */
__le32 s_free_blocks_count_hi; /* Free blocks count */
__le16 s_min_extra_isize; /* All inodes have at least # bytes */
__le16 s_want_extra_isize; /* New inodes should reserve # bytes */
__le32 s_flags; /* Miscellaneous flags */
__le16 s_raid_stride; /* RAID stride */
__le16 s_mmp_update_interval; /* # seconds to wait in MMP checking */
__le64 s_mmp_block; /* Block for multi-mount protection */
__le32 s_raid_stripe_width; /* blocks on all data disks (N*stride)*/
__u8 s_log_groups_per_flex; /* FLEX_BG group size */
__u8 s_checksum_type; /* metadata checksum algorithm used */
__le16 s_reserved_pad;
__le64 s_kbytes_written; /* nr of lifetime kilobytes written */
__le32 s_snapshot_inum; /* Inode number of active snapshot */
__le32 s_snapshot_id; /* sequential ID of active snapshot */
__le64 s_snapshot_r_blocks_count; /* reserved blocks for active
snapshot's future use */

18
__le32 s_snapshot_list; /* inode number of the head of the
on-disk snapshot list */
#define EXT4_S_ERR_START offsetof(struct ext4_super_block, s_error_count)
__le32 s_error_count; /* number of fs errors */
__le32 s_first_error_time; /* first time an error happened */
__le32 s_first_error_ino; /* inode involved in first error */
__le64 s_first_error_block; /* block involved of first error */
__u8 s_first_error_func[32]; /* function where the error happened */
__le32 s_first_error_line; /* line number where error happened */
__le32 s_last_error_time; /* most recent time of an error */
__le32 s_last_error_ino; /* inode involved in last error */
__le32 s_last_error_line; /* line number where error happened */
__le64 s_last_error_block; /* block involved of last error */
__u8 s_last_error_func[32]; /* function where the error happened */
#define EXT4_S_ERR_END offsetof(struct ext4_super_block, s_mount_opts)
__u8 s_mount_opts[64];
__le32 s_usr_quota_inum; /* inode for tracking user quota */
__le32 s_grp_quota_inum; /* inode for tracking group quota */
__le32 s_overhead_clusters; /* overhead blocks/clusters in fs */
__le32 s_backup_bgs[2]; /* groups with sparse_super2 SBs */
__le32 s_reserved[106]; /* Padding to the end of the block */
__le32 s_checksum; /* crc32c(superblock) */
};

19
__le32 s_snapshot_list; /* inode number of the head of the
on-disk snapshot list */
#define EXT4_S_ERR_START offsetof(struct ext4_super_block, s_error_count)
__le32 s_error_count; /* number of fs errors */
__le32 s_first_error_time; /* first time an error happened */
__le32 s_first_error_ino; /* inode involved in first error */
__le64 s_first_error_block; /* block involved of first error */
__u8 s_first_error_func[32]; /* function where the error happened */
__le32 s_first_error_line; /* line number where error happened */
__le32 s_last_error_time; /* most recent time of an error */
__le32 s_last_error_ino; /* inode involved in last error */
__le32 s_last_error_line; /* line number where error happened */
__le64 s_last_error_block; /* block involved of last error */
__u8 s_last_error_func[32]; /* function where the error happened */
#define EXT4_S_ERR_END offsetof(struct ext4_super_block, s_mount_opts)
__u8 s_mount_opts[64];
__le32 s_usr_quota_inum; /* inode for tracking user quota */
__le32 s_grp_quota_inum; /* inode for tracking group quota */
__le32 s_overhead_clusters; /* overhead blocks/clusters in fs */
__le32 s_backup_bgs[2]; /* groups with sparse_super2 SBs */
__le32 s_reserved[106]; /* Padding to the end of the block */
__le32 s_checksum; /* crc32c(superblock) */
};

20
ext4_inode構造体
struct ext4_inode {
__le16 i_mode; /* File mode */
__le16 i_uid; /* Low 16 bits of Owner Uid */
__le32 i_size_lo; /* Size in bytes */
__le32 i_atime; /* Access time */
__le32 i_ctime; /* Inode Change time */
__le32 i_mtime; /* Modification time */
__le32 i_dtime; /* Deletion Time */
__le16 i_gid; /* Low 16 bits of Group Id */
__le16 i_links_count; /* Links count */
__le32 i_blocks_lo; /* Blocks count */
__le32 i_flags; /* File flags */
union {
struct {
__le32 l_i_version;
} linux1;
…
} osd1; /* OS dependent 1 */
__le32 i_block[EXT4_N_BLOCKS];/* Pointers to blocks */
__le32 i_generation; /* File version (for NFS) */
__le32 i_file_acl_lo; /* File ACL */
__le32 i_size_high;
__le32 i_obso_faddr; /* Obsoleted fragment address */

21
ext4_inode構造体
union {
struct {
__le16 l_i_blocks_high; /* were l_i_reserved1 */
__le16 l_i_file_acl_high;
__le16 l_i_uid_high; /* these 2 fields */
__le16 l_i_gid_high; /* were reserved2[0] */
__le16 l_i_checksum_lo;/* crc32c(uuid+inum+inode) LE */
__le16 l_i_reserved;
} linux2;
…
} osd2; /* OS dependent 2 */
__le16 i_extra_isize;
__le16 i_checksum_hi; /* crc32c(uuid+inum+inode) BE */
__le32 i_ctime_extra; /* extra Change time (nsec << 2 | epoch) */
__le32 i_mtime_extra; /* extra Modification time(nsec << 2 | epoch) */
__le32 i_atime_extra; /* extra Access time (nsec << 2 | epoch) */
__le32 i_crtime; /* File Creation time */
__le32 i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */
__le32 i_version_hi; /* high 32 bits for 64-bit version */
};
128byte超

• data block bitmap/inode bitmap/inode tableがflex_bgのサイズ
に合わせたブロック数に。
– 例えば、サイズが4なら、group 0-3用のそれぞれのメタデータが連続
– メタデータを近づけることでロードが速くなる（らしい）
– 大きなファイルを連続でディスクに置くことが出来る
• mkfsのオプション: flex_bg
• ext4特有の機能
• ext4_super_block構造体のs_log_groups_per_flexにflex_bgのグ
ループサイズを格納
• 各メタデータ間は空きがあることがある *1 (詳細要調査)
*1 http://kernhack.hatenablog.com/entry/2014/01/28/230808
22
Flexible Block Groups

flex_bgのレイアウト
Super
Block
Group
Descriptors
data block
bitmap
Reserved
GDT
Blocks
inode
Bitmaps
inode
table
data
blocks
Super
Block
Group
Descript
ors
data
block
bitmap
Reserve
d GDT
Blocks
inode
Bitmaps
inode
table
data
blocks
inode
Bitmaps
inode
table
data
block
bitmap
g0 gn g0 gn g0 gn
group 0
Super
Block
Group
Descript
ors
Reserve
d GDT
Blocks
data blocksgroup 1
Super
Block
Group
Descript
ors
Reserve
d GDT
Blocks
data blocksgroup n
Super
Block
Group
Descript
ors
Reserve
d GDT
Blocks
data blocksgroup 2
・・・

Sparse super block / Sparse super 2
24
• sparse super block : super blockとGDTのバックアップを
削減する仕組み
– ディスク使用率向上、連続ブロックの割り当てのための設定
– block group 1/3/5/7/3^2/5^2/7^2/…/7^nにバックアップ
– ext2よりサポート
– mkfsのオプション: sparse_super
– フラグ: EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER
• sparse super2: バックアップblock groupを1番目と最後
に限定
– ext4_super_block構造体のs_backup_bgs[2]にそれぞれグループを格納
• s_backup_bgs[0]: blockgroup 1
• s_backup_bgs[1]: blockgroupの最後
– mkfsのオプション:sparse_super2
– フラグ: EXT4_FEATURE_COMPAT_SPARSE_SUPER2
現状のmkfsの実装より

Sparse super blockのレイアウト
25
Super
Block
Group
Descriptors
data block
bitmap
Reserved
GDT
Blocks
inode
Bitmaps
inode
table
data
blocksgroup 0
Super
Block
Group
Descriptors
data block
bitmap
Reserved
GDT
Blocks
inode
Bitmaps
inode
table
data
blocksgroup 1
data block
bitmap
inode
Bitmaps
inode
table data blocksgroup 2
Super
Block
Group
Descriptors
data block
bitmap
Reserved
GDT
Blocks
inode
Bitmaps
inode
table
data
blocksgroup 3

Sparse super2のレイアウト
26
Super
Block
Group
Descriptors
data block
bitmap
Reserved
GDT
Blocks
inode
Bitmaps
inode
table
data
blocksgroup 0
Super
Block
Group
Descriptors
data block
bitmap
Reserved
GDT
Blocks
inode
Bitmaps
inode
table
data
blocksgroup 1
data block
bitmap
inode
Bitmaps
inode
Super
Block
Group
Descriptors
data block
bitmap
Reserved
GDT
Blocks
inode
Bitmaps
inode
table
data
blocksgroup n
(最後)
data block
bitmap
inode
Bitmaps
inode
・・・

27
super blockの存在確認
int ext4_bg_has_super(struct super_block *sb, ext4_group_t group)
{
struct ext4_super_block *es = EXT4_SB(sb)->s_es;
if (group == 0) /* group 0には必ずsuper blockがある */
return 1;
if (EXT4_HAS_COMPAT_FEATURE(sb, EXT4_FEATURE_COMPAT_SPARSE_SUPER2)) {
if (group == le32_to_cpu(es->s_backup_bgs[0]) || /* group1 */
group == le32_to_cpu(es->s_backup_bgs[1])) /* group n(最後) */
return 1;
return 0;
}
if ((group <= 1) || !EXT4_HAS_RO_COMPAT_FEATURE(sb,
EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER))
return 1; /* group1以下かSPARSE_SUPERでないなら必ずsuper blockがある */
if (!(group & 1)) /* 偶数だったら必ず無い */
return 0;
if (test_root(group, 3) || (test_root(group, 5)) ||
test_root(group, 7)) /* 3, 5, 7のn乗グループであればある */
return 1;
return 0;
}

28
Meta Block Groups
Super
Block
Group
Descriptors
data block
bitmap
Reserved
GDT
Blocks
inode
Bitmaps
inode
table
data
blocks
• グループディスクリプタにはext4_group_desc構造体(64byte)を格納
struct ext4_group_desc
{
__le32 bg_block_bitmap_lo; /* Blocks bitmap block */
__le32 bg_inode_bitmap_lo; /* Inodes bitmap block */
__le32 bg_inode_table_lo; /* Inodes table block */
__le16 bg_free_blocks_count_lo; /* Free blocks count */
__le16 bg_free_inodes_count_lo; /* Free inodes count */
__le16 bg_used_dirs_count_lo; /* Directories count */
__le16 bg_flags; /* EXT4_BG_flags (INODE_UNINIT, etc) */
__le32 bg_exclude_bitmap_lo; /* Exclude bitmap for snapshots */
__le16 bg_block_bitmap_csum_lo;/* crc32c(s_uuid+grp_num+bbitmap) LE */
__le16 bg_inode_bitmap_csum_lo;/* crc32c(s_uuid+grp_num+ibitmap) LE */
__le16 bg_itable_unused_lo; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
…
};
64byte

29
Meta Block Groups
Super
Block
Group
Descriptors
data block
bitmap
Reserved
GDT
Blocks
inode
Bitmaps
inode
table
data
blocks
• グループディスクリプタにはext4_group_desc構造体(64byte)を格納
• ブロックグループは最大128MB(=2^27)
– まるまるグループディスクリプタに使うと2^21個のブロックグループを管理
– 2^27 * 2^21 =256TBが最大ファイルシステムサイズ
• meta_bgを有効にする(mkfsのオプション)と、metagroupという単位でブロック
グループを管理
– 1ブロックあたり64個のブロックグループを管理できるので、64ブロックグループを
1metagroupとする(GDT blocksの予約領域は削除)
– 1metagroup = 128M * 64 = 8GB
– ブロックグループは管理の単位にext4_group_t(32bit)を使用
– 2^32 * 2^27 = 2^59 = 512PBがファイルシステムの最大となる
• グループディスクリプタはgroup 0 / 1 / 63単位で格納
• ext2/ext3でも読み込み可能

30
Meta Block Groups
metagroup 0 metagroup 1 metagroup 2
bg0 bg1 bg63・・・
Super
Block
Group
Descriptors
data block
bitmap
inode
Bitmaps
inode
table
data
blocks
Super
Block
Group
Descriptors
data block
bitmap
inode
Bitmaps
inode
table
data
blocks
data block
bitmap
inode
Bitmaps
inode
table
data blocks
Super
Block
Group
Descriptors
data block
bitmap
inode
Bitmaps
inode
table
data
blocks
・・・
bg0
bg1
bg2
bg63
Super
Block

31
meta_bgのGDTのブロック数チェック
static unsigned long ext4_bg_num_gdb_meta(struct super_block *sb, ext4_group_t group)
{
unsigned long metagroup = group / EXT4_DESC_PER_BLOCK(sb);
ext4_group_t first = metagroup * EXT4_DESC_PER_BLOCK(sb);
ext4_group_t last = first + EXT4_DESC_PER_BLOCK(sb) - 1;
if (group == first || group == first + 1 || group == last)
return 1; /* metagroupの0番目, 1番目, 最後だけGDが1ブロック存在 */
return 0;
}

Big allocation
32
• アロケーション単位をブロック単位ではなく、クラスタ単位(例
えば1MB単位)で管理できるようにする機能
– data block bitmapのサイズを減らすことが出来る
– １グループあたり1clustersize * blocksize * 8がMAX
– Big allocationが有効になっていない場合、クラスタ特有の処理が無い
ところは、 clusterと書いていても、blockと読み替える必要あり(あるい
はその逆)
• フラグ: EXT4_FEATURE_RO_COMPAT_BIGALLOC
– mkfs時に指定
– extentsを有効にしていないといけない

Big allocation
33
static int ext4_fill_super() /* mount時に呼ばれる */
{
…
/* BLOCK_SIZE=1024, s_log_cluster_size: 1024byteのシフト数 */
clustersize = BLOCK_SIZE << le32_to_cpu(es->s_log_cluster_size);
has_bigalloc = EXT4_HAS_RO_COMPAT_FEATURE(sb,
EXT4_FEATURE_RO_COMPAT_BIGALLOC);
if (has_bigalloc) {
…
sbi->s_cluster_bits = le32_to_cpu(es->s_log_cluster_size) -
le32_to_cpu(es->s_log_block_size);
sbi->s_clusters_per_group =
le32_to_cpu(es->s_clusters_per_group);
/* 1group辺りのクラスタ数は2^15個が限界*/
if (sbi->s_clusters_per_group > blocksize * 8) {
ext4_msg(sb, KERN_ERR,
"#clusters per group too big: %lu",
sbi->s_clusters_per_group);
goto failed_mount;
}
…

34
全メタデータブロックcount
int ext4_calculate_overhead() /* mount時やリサイズ時に呼ばれる */
{
…
/* 空のpageの各bitをcluster(block)に見立てる */
char *buf = (char *) get_zeroed_page(GFP_KERNEL);
…
overhead = EXT4_B2C(sbi, le32_to_cpu(es->s_first_data_block));
for (i = 0; i < ngroups; i++) {
…
/* 各グループのメタデータをカウント */
blks = count_overhead(sb, i, buf);
overhead += blks;
if (blks)
memset(buf, 0, PAGE_SIZE);
cond_resched();
}
/* Add the journal blocks as well */
if (sbi->s_journal)
overhead += EXT4_NUM_B2C(sbi, sbi->s_journal->j_maxlen);
sbi->s_overhead = overhead;
…
}

35
各グループのメタデータブロックcount
static int count_overhead(struct super_block *sb, ext4_group_t grp, char *buf)
{
/* 2は1ブロック固定のdata block bitmapとinode Bitmaps */
if (!EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_BIGALLOC))
return (ext4_bg_has_super(sb, grp) + ext4_bg_num_gdb(sb, grp) +
sbi->s_itb_per_group + 2);
…
for (i = 0; i < ngroups; i++) { /* flex_bgを考慮して全グループをチェック */
gdp = ext4_get_group_desc(sb, i, NULL);
b = ext4_block_bitmap(sb, gdp);
/* 今着目しているグループ内にあればbuf(ページ)の対応箇所にビットを立てる */
if (b >= first_block && b <= last_block) {
ext4_set_bit(EXT4_B2C(sbi, b - first_block), buf);
count++;
}
… /* 以下inode bitmapとinode tableも同じ */
if (i != grp)
continue;
… /* 自グループのときはsuper blockとGDTの対応箇所にビットを立てる */
}
…
/* ビットの数からデータブロック数カウント(ext4_count_free)してメタデータのブロック数を算出 */
return EXT4_CLUSTERS_PER_GROUP(sb) –
ext4_count_free(buf, EXT4_CLUSTERS_PER_GROUP(sb) / 8);

36
Extent
• ext2/3のように論理ブロックと物理ブロックを1対1で管理する
と、大きなファイルに対しては管理領域が圧迫される
– 12ブロック以上使うファイルには間接参照が必要になり、I/Oパフォー
マンスが低下
• Extent機能では、"使用ブロック数"を保持することで、一種の
可変長ブロックを実装
– Extent機能を使っていると、ext2/3でマウントすることは出来なくなる
struct ext4_extent {
__le32 ee_block; /* 最初の論理blockの位置 */
__le16 ee_len /* 使用ブロック数 */
__le16 ee_start_hi; /* 物理ブロックの上16bit */
__le32 ee_start_lo; /*物理ブロックの下32bit */
};

37
従来の論理ブロックの管理
ext3_inode
__le32
i_data[15]
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
間接ブロック
・・・
2段間接ブロック
・・・
1036
1037
1038
間接ブロック
・・・
3段間接ブロック
論理ブロック
物理ブロック
3
715
12 754
874 8363
1035
4096/4=1024個
715
874
4byte

38
Extentの論理ブロックの管理
__le32
i_data[15]
ext4_extent_header
ext4_extent_idx
ext4_inode
12byte
ext4_extent_header
ext4_extent
ext4_extent_idx
ext4_extent_idx
ext4_extent_idx
eh_depth=n(tree)
…
eh_depth=0(leaf)
ext4_extent
ext4_extent
ext4_extent
・各tree/leafには同一の構造体を格納。
・ツリーの深さは1ファイル内で一定
物理ブロックの位置+
使用ブロック数
・・・

39
__le32
i_data[15]
ext4_extent_header
ext4_inode
eh_magic
eh_entries
eh_max
eh_depth
eh_generation
2byte
4byte
eh_magic: Magic#(0xf30a)
eh_entries: extentの数
eh_max: extentの最大数
eh_depth: 今いるtreeの深さ
eh_generation: treeの世代(現在未使用)
・i_data[]に書かれているeh_entriesが5以上となるとき、ext4_extent_idxとなる
・ブロックにおけるeh_maxは
(4096 – 12(header)) / 12 = 340 (4byte余り => checksum領域として使用)
・eh_depth=0のとき、ext4_extentを使用し、"leaf"となる
12byte

40
__le32
i_data[15]
ext4_extent
ext4_inode
ee_len
ee_block
2byte
4byte
ee_block: 最初の論理ブロック
ee_len: 使用ブロック数(下位15bit)
ee_start_hi: 物理ブロックの上16bit
ee_start_low: 物理ブロックの下32bit
・ee_blockで32bit使用: 2^32 * 4096(byte/block) = 16TBが1ファイルの上限
・ee_lenの下位15bit使用: 2^15 * 4096(byte/block) = 128MBが1extentの上限
・e_lenのMSB: unwrittenフラグ(preallocationのときに使用)
・ee_start_*で48bit使用 : 2^48*4096(byte/block) = 1EBがファイルシステムの上限
12byte
ee_start_hi
ee_start_low

41
__le32
i_data[15]
ext4_extent_idx
ext4_inode
ei_leaf_hi
ei_block
2byte
4byte
ei_block: カバーしている論理ブロック
ei_leaf_lo: 次の深さlevelのextentを格納して
いる物理ブロックの下32bit
ei_leaf_hi:次の深さlevelのextentを格納して
いる物理ブロックの上16bit
12byte
ei_leaf_lo
ei_unused

42
Extentの探し方
struct ext4_ext_path * ext4_ext_find_extent()
{
…
eh = ext_inode_hdr(inode); /* 最上位のtreeのextent_headerを取得 */
…
while (i) { /* leafに到達するまでループ */
…
/* バイナリサーチでextent_idxを探す */
ext4_ext_binsearch_idx(inode, path + ppos, block);
…
/* extent_idxの参照先のブロックを読み出す(１階層下へ) */
bh = read_extent_tree_block(inode, path[ppos].p_block, --i, flags); if (IS_ERR(bh));
…
/* １階層下のextent_headerに付け替える */
eh = ext_block_hdr(bh);
…
}
…
/* バイナリサーチで目的のextentを探す */
ext4_ext_binsearch(inode, path + ppos, block);
…
}

今後調べること
• マウントオプション
• ジャーナリング(jbd/jbd2の違いも)
• delayed / multiblock / persistent allocation
• inline data/xattr
• vfs
43

参考文献
• https://ext4.wiki.kernel.org/index.php/Main_
Page
• 詳解Linuxカーネル第3版
• Linuxカーネル解読室
44

45
Linux is a registered trademark of Linus Torvalds.
All other trademarks and copyrights are the
property of their respective owners.

Ext4 filesystem(1)

More Related Content

What's hot

Viewers also liked

Similar to Ext4 filesystem(1)

Ext4 filesystem(1)