3. Introduction
“A file system is a means to organize data expected
to be retained after a program terminates by
providing procedures to store, retrieve and update
data, as well as manage the available space on the
device(s) which contain it.” – from Wikipedia
Store data
Organize data
Access data
Manage storage resources (e.g. hard drive)
5. Relationship to Architecture Course
File system is designed between memory and
secondary storage (or remote servers)
One of the most complex part in an operating system
Main R&D focuses:
Performance: throughput, latency, scalability
Reliability and availability
Management: snapshot and etc.
Acknowledge to the slides from 830 course
6. Different types of file systems
Local file systems
Stored data on local hard drives, SSDs, floppy drives,
optical disks or etc.
Examples: NTFS, EXT4, HFS+, ZFS
Network/distributed file systems
Stored data on remote file server(s)
Example: NFS, CIFS/Samba, AFP, Hadoop DFS, Ceph
Pseudo file systems
Example: procfs, devfs, tmpfs
“List of file systems”
http://en.wikipedia.org/wiki/List_of_file_systems
8. Overall Architecture of Linux file
system components
Acknowledgement: “Anatomy of the Linux file system”, IBM
developerWorks.
9. Virtual File System (VFS)
VFS is the essential concept in UNIX-like FS
Specify an interface between the kernel and a concrete file
system
Introduced by SUN in 1985
Pass system calls to the underlying file systems
E.g. pass sys_write() to Ext4 (i.e. ext4_write())
Three major metadata in VFS
Metadata: the data about data (wikipedia)
Super block, dentry and inode
OO design
Each component defines a set of data members and the functions
to access them
10. Super block
A segment of metadata that describes a file system
Is constructed when mount a file system
Usually, a persistent copy of super block is stored in the
beginning of a storage device
Describes:
File system type, size, status (e.g. dirty bit, read only bit)
Block size, max file bytes, device size..
How to find other metadata and data.
How to manipulates these data (i.e. sb_ops)
11. Inode
“Index-node” in Unix-style file system
All information about one file (or directory)
Except its name
In UNIX-like system, file names are stored in the directory file:
the content of it is an “array” of file names
E.g. owner, access rights, mode, size, time and etc.
Pointers to data
12. Directory Entry (dentry)
Dentry conceptually points a file name to its
corresponding Inode
Each file/directory has a dentry presenting it
File systems use dentry to lookup a file in the
hierarchical namespace
Each dentry has a pointer to the dentry of its parent
directory
Each dentry of a directory has a list of dentries of its sub-
directories and sub-files
14. Optimizations
Most of file system optimizations are designed
based on the characteristics of the memory
hierarchy and storage devices.
Recall:
RAM 50-100 ns
Disks: 5-10 ms
2-3 orders of magnitude difference
Almost all widely used local file systems are designed for
hard disk drives, which have their unique characteristics
15. Hard Disk Drive (HDD)
Stores data on one or
more rotating disks,
coated with magnetic
material
Introduce by IBM in
1956
Use magnetic head to
read data
17. HDD (Cont’d)
The essential structure of
HDD has not changed
too much…
Constitute with several
disks
Each disk is divided to
tracks, each of which
then is divided to sectors
The single most
significant factor:
Seek time
18. Why seek time matters
When access a data (sector), the HDD head must
first move to the track (seek time), then rotates the
disk to the sector (rotational time)
Seek time: 3 ms on high-end server disks, 12 ms on
desktop-level disks [1]
Rotational time: 5.56ms on 5400 RPM HDD, 4.17ms on
7200 RPM HDD [1]
As a result, sequential IO is much faster than
random IO, because there is no seek /rotational
time
[1], http://en.wikipedia.org/wiki/Disk-drive_performance_characteristics
19. General Optimizations
Based on two principles:
RAM access is much faster than the access on disk
Sequential IOs is much faster than random IOs on disk
So we design file systems that
Largely utilizes CPU/RAM to reduce IO to disks (various
caches/write buffers)
Prefers sequential IOs
Computes disk layout to arrange related data sequentially
located on disks
20. Dcache
Dentry cache (dcache)
Directories are stored as files on disks.
For each file lookup, we want obtain the inode from the
given full file path
OS looks the dentries from the root to all parent directories in the
path.
E.g. for looking up file “/Users/john/Documents/course.pdf”, OS
needs traverse the dentries that presents “/”, “Users”, “john”,
“Documents”, and “course.pdf”
To accelerate this:
We use a global hash table (dcache) to map “file path” -> dentry
A two-list solution: one for active dentries, and one for “recent
unused dentries” (LRU).
21. Inode cache
Similar to the dcache,
OS maintains a cache
for inode objects.
Each inode object has
1-to-1 relation to a
dentry
If the dentry object is
evicted, this inode is
evicted
P1 P2 P10
f0 f1 f0 f2 f3 f0
File
Objects
VFS
Processes
Dentry Cache (hash table)
Dentry 0 Dentry 10 Dentry 20
Inode 0 Inode 10 Inode 20
Inode Cache
Page
Cache 0
Page
Cache 10
Page
Cache 20
Page Cache
(Radix Tree)
22. Page Cache
…a “transparent” buffer for disk-backed pages kept in
RAM for fast access… [wikipedia]
A write-back cache
Main purpose: reducing the # of IOs to disks
Access based on page (usually 4KB).
Page cache is per-file based.
A Redix-tree in inode object.
Prefetch pages to serve future read
Absorb writes to reduce # of IOs
The dirty pages (modified) are flushed to disks for : 1) each
30s or 5s, or 2) OS wants to reclaim RAMs
Also can be forced to flush by calling “fsync()” system call
24. Examples
Several concrete file system designs
Ext4, classic UNIX-like file system concepts
NTFS, advanced Windows file system
ZFS, “the last word of file system”
NFS, a standard network file system
Google File System, a special distributed file system for
special requirements
25. Ext4
The latest version of
the “extended file
system” (Ext2/3/4)
The standard Linux file
system for a long time
Inspired from UFS from
BSD/Solaris
Group files to block
groups
Keep file data near to
inodes
Ack: http://bit.ly/tjipWY
26. NTFS
“New Technology File
System” (NTFS)
The standard file
system in Windows
world.
A Master File Table
(MFT) contains all
metadata.
Directory is also a file
27. ZFS
ZFS: “the last word of file system”
The most advanced local file system in production
128 bits space (2128 bytes in theory)
larger the # of sand in the earth…
A lot of advanced features:
E.g. transactional commits, end-to-end integration, snapshot,
volume management and much more…
Will never lose data and always be consistent.
Every OS community wants to clone or copy its features…
Btrfs on Linux, ReFS on Windows, ZFS on FreeBSD
28. NFS
“Network File System
(NFS)”
A protocol developed
by SUN in 1984
A set of RPC calls
IETF standard
Supported by all major
OSs
Simple and efficient
29. Google File System (GFS)
A large distributed file
system specially
designed for
MapReduce framework
High throughput
High availability
Special designed. Not
compatible to
VFS/POSIX API.
Requires clients linked to
the GFS library.
Hadoop DFS clones the
concepts of GFS
30. More File Systems
Interesting file systems that are worth to explore
Btrfs (B-tree FS) from oracle, expected to be the next
standard Linux file system. Many concepts are shared
with ZFS.
ReFS: The file system for Windows 8 (from Microsoft).
Many concepts are shared with ZFS (too!).
WAFL (Write Anywhere File Layout) file system from
NetApp.
FUSE (Filesystem in Userspace): a cross-platform library
that allows developers to write file system running in
user mode