Introduction to File System &
OCFS2
Gang He <ghe@suse.com>
Oct 14th, 2016
Basic Concepts
3
How to store your data
• Block device
No meta information, usually read/write data
sequentially.
• Database
Key/value style data.
Use their own CLI/API to access data.
• File system
Tree directory structure.
Customized directory/file name, file size.
Comply with POSIX standard interfaces.
4
File system interfaces
• Open
• Read/pread/readv Write/pwrite/writev
• Mmap/munmap
• Splice/sendfile (Linux-specific)
• Io_submit/io_getevents (Linux-specific)
• Reflink (in the future)
• Fsync/msync
• Close
5
File access IO models
• Buffered/Direct IO
• Blocking/Nonblocking IO
• Synchronous/Asynchronous IO
• Mmap (on-demand read/write)
• IO multiplexing – select/poll/epoll
6
File system classification
• Pseudo file system
proc sysfs debugfs tmpfs devtmpfs
• Local file system
ext4 reiserfs xfs btrfs
• Cluster file system
OCFS2 GFS2 GPFS2 VxFS
• Distributed file system
GoogleFS HDFS GlusterFS ceph
Understanding VFS
8
Why introduce VFS
• A virtual file system (VFS) or virtual filesystem switch
is an abstraction layer on top of a more concrete file
system. The purpose of a VFS
is to allow client
applications to
access different
types of concrete
file systems in
a uniform way.
9
Data structures(1) in VFS
• struct super_block
one per file system, file system global state/settings, file system related operations, super
block operations, root dentry, inode list, inode alloc/free/dirty/write/evict.
• struct inode
one per file object, which has a unique ID within the file system. It includes file meta
information (ino, mode, owner, blocks/size, a/m/c times, block pointers, etc.), all kinds of
list_head, inode operations (lookup, create, unlink, setattr/getattr, etc.), address_space.
• struct dentry
one per file name (hard link), it includes file name string, parent dentry pointer, inode
pointer, all kind of list_head, dentry operations (e.g. d_hash, d_compare, d_revalidate).
• struct file
one per open() system call, it includes file open mode, read/write position, dentry pointer,
file operations (e.g. llseek, read, write, fsync, flock, etc.).
10
Data structures(1) relationships
11
Data structures(2) in VFS
• struct address_space
one per inode, which is used to manage page cache, looks like VM mechanism.
12
Data structures(3) in VFS
struct page/buffer_head/bio
13
File access work flow
• Open
SYSCALL_DEFINE3(open) → do_sys_open → do_filp_open → path_openat →
walk_component → do_last → fd_install
• Read
SYSCALL_DEFINE3(read) → vfs_read → do_sync_read → (filp→f_op→ aio_read) →
generic_file_aio_read → do_generic_file_read → (mapping→a_ops→aio_read) →
block_read_full_page(page, xxx_get_block) → submit_bh → submit_bio
• Write
SYSCALL_DEFINE3(write) → vfs_write → do_sync_write → (filp→f_op→aio_write) →
generic_file_buffered_write → generic_perform_write → (mapping→a_ops→write_begin)
→ iov_iter_copy_from_user_atomic → (mapping->a_ops->write_end)
• Close
SYSCALL_DEFINE1(close) → __close_fd → filp_close → (filp→f_op→flush) → fput
EXT3 File System
15
EXT3 file system layout
16
EXT3 inode block layout
17
EXT3 dir entry block layout
18
EXT3 dir index block layout (HashTree)
19
Journal(JBD2)
• Block is marked dirty for the journal.
• Block is written to the journal.
• Transaction is committed.
• Block is marked dirty for the file system.
• Block is written to the file system.
• The transaction is check-pointed.
20
File system mounting
• Try to find file system already using the block device.
If match, use the existing super block.
• For each possible file system, attempt to read the
super block.
• Setup file system-specific structures.
• Replay journal, if applicable.
• Locate root directory inode block.
• Instantiate dentry for root directory and pin it to super
block.
• File system returns success to VFS.
OCFS2 File System
22
OCFS2 overview
• Shared storage (SAN/iSCSI).
• Multiple File system nodes which connects to the
share storage (Also support single node).
• Allow add/delete node online.
• Comply with traditional
file system semantics.
23
OCFS2 design
• Use system files to manage meta information, instead
of traditional fixed bitmap blocks.
• Use extent + B-tree mode to manage file data blocks,
instead of direct/indirect block pointers .
• Each inode occupies an entire block, support data-in-
inode feature.
• Use JBD2 to handle file system journal.
• Use DLM to manage the locks across the nodes.
• Each node has own system files to avoid competition
for global resources.
• Use block/cluster to manage meta data block and data
block.
24
OCFS2 system files
25
OCFS2 space management
global_bitmap: one large flat bitmap in cluster to keep a record of the
allocated data for the whole file system.
local_alloc: allocates data in cluster sizes for the local node.
inode_alloc: allocates inode blocks for the local node.
extent_alloc: allocates meta-data blocks for the local node.
truncate_log: records of deallocated clusters and back to global_bitmap
26
OCFS2 inode block layout
27
OCFS2 inode block mapping
28
OCFS2 dir entry/index block layout
29
OCFS2 file clone(reflink)
• dd if=/dev/zero of=./test1 bs=1M count=8192
• reflink test1 test2
• dd conv=notrunc if=/dev/random of=test2 bs=4096 count=100
30
OCFS2 node cooperation
Introduction to file system and OCFS2

Introduction to file system and OCFS2

  • 1.
    Introduction to FileSystem & OCFS2 Gang He <ghe@suse.com> Oct 14th, 2016
  • 2.
  • 3.
    3 How to storeyour data • Block device No meta information, usually read/write data sequentially. • Database Key/value style data. Use their own CLI/API to access data. • File system Tree directory structure. Customized directory/file name, file size. Comply with POSIX standard interfaces.
  • 4.
    4 File system interfaces •Open • Read/pread/readv Write/pwrite/writev • Mmap/munmap • Splice/sendfile (Linux-specific) • Io_submit/io_getevents (Linux-specific) • Reflink (in the future) • Fsync/msync • Close
  • 5.
    5 File access IOmodels • Buffered/Direct IO • Blocking/Nonblocking IO • Synchronous/Asynchronous IO • Mmap (on-demand read/write) • IO multiplexing – select/poll/epoll
  • 6.
    6 File system classification •Pseudo file system proc sysfs debugfs tmpfs devtmpfs • Local file system ext4 reiserfs xfs btrfs • Cluster file system OCFS2 GFS2 GPFS2 VxFS • Distributed file system GoogleFS HDFS GlusterFS ceph
  • 7.
  • 8.
    8 Why introduce VFS •A virtual file system (VFS) or virtual filesystem switch is an abstraction layer on top of a more concrete file system. The purpose of a VFS is to allow client applications to access different types of concrete file systems in a uniform way.
  • 9.
    9 Data structures(1) inVFS • struct super_block one per file system, file system global state/settings, file system related operations, super block operations, root dentry, inode list, inode alloc/free/dirty/write/evict. • struct inode one per file object, which has a unique ID within the file system. It includes file meta information (ino, mode, owner, blocks/size, a/m/c times, block pointers, etc.), all kinds of list_head, inode operations (lookup, create, unlink, setattr/getattr, etc.), address_space. • struct dentry one per file name (hard link), it includes file name string, parent dentry pointer, inode pointer, all kind of list_head, dentry operations (e.g. d_hash, d_compare, d_revalidate). • struct file one per open() system call, it includes file open mode, read/write position, dentry pointer, file operations (e.g. llseek, read, write, fsync, flock, etc.).
  • 10.
  • 11.
    11 Data structures(2) inVFS • struct address_space one per inode, which is used to manage page cache, looks like VM mechanism.
  • 12.
    12 Data structures(3) inVFS struct page/buffer_head/bio
  • 13.
    13 File access workflow • Open SYSCALL_DEFINE3(open) → do_sys_open → do_filp_open → path_openat → walk_component → do_last → fd_install • Read SYSCALL_DEFINE3(read) → vfs_read → do_sync_read → (filp→f_op→ aio_read) → generic_file_aio_read → do_generic_file_read → (mapping→a_ops→aio_read) → block_read_full_page(page, xxx_get_block) → submit_bh → submit_bio • Write SYSCALL_DEFINE3(write) → vfs_write → do_sync_write → (filp→f_op→aio_write) → generic_file_buffered_write → generic_perform_write → (mapping→a_ops→write_begin) → iov_iter_copy_from_user_atomic → (mapping->a_ops->write_end) • Close SYSCALL_DEFINE1(close) → __close_fd → filp_close → (filp→f_op→flush) → fput
  • 14.
  • 15.
  • 16.
  • 17.
    17 EXT3 dir entryblock layout
  • 18.
    18 EXT3 dir indexblock layout (HashTree)
  • 19.
    19 Journal(JBD2) • Block ismarked dirty for the journal. • Block is written to the journal. • Transaction is committed. • Block is marked dirty for the file system. • Block is written to the file system. • The transaction is check-pointed.
  • 20.
    20 File system mounting •Try to find file system already using the block device. If match, use the existing super block. • For each possible file system, attempt to read the super block. • Setup file system-specific structures. • Replay journal, if applicable. • Locate root directory inode block. • Instantiate dentry for root directory and pin it to super block. • File system returns success to VFS.
  • 21.
  • 22.
    22 OCFS2 overview • Sharedstorage (SAN/iSCSI). • Multiple File system nodes which connects to the share storage (Also support single node). • Allow add/delete node online. • Comply with traditional file system semantics.
  • 23.
    23 OCFS2 design • Usesystem files to manage meta information, instead of traditional fixed bitmap blocks. • Use extent + B-tree mode to manage file data blocks, instead of direct/indirect block pointers . • Each inode occupies an entire block, support data-in- inode feature. • Use JBD2 to handle file system journal. • Use DLM to manage the locks across the nodes. • Each node has own system files to avoid competition for global resources. • Use block/cluster to manage meta data block and data block.
  • 24.
  • 25.
    25 OCFS2 space management global_bitmap:one large flat bitmap in cluster to keep a record of the allocated data for the whole file system. local_alloc: allocates data in cluster sizes for the local node. inode_alloc: allocates inode blocks for the local node. extent_alloc: allocates meta-data blocks for the local node. truncate_log: records of deallocated clusters and back to global_bitmap
  • 26.
  • 27.
  • 28.
  • 29.
    29 OCFS2 file clone(reflink) •dd if=/dev/zero of=./test1 bs=1M count=8192 • reflink test1 test2 • dd conv=notrunc if=/dev/random of=test2 bs=4096 count=100
  • 30.