SlideShare a Scribd company logo
1 of 38
Download to read offline
Linux	
  VFS	
  
COMS	
  W4118	
  
Prof.	
  Kaustubh	
  R.	
  Joshi	
  
krj@cs.columbia.edu	
  
	
  
hFp://www.cs.columbia.edu/~krj/os	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
   1	
  
References:	
  OperaYng	
  Systems	
  Concepts	
  (9e),	
  Linux	
  Kernel	
  Development,	
  Understanding	
  
the	
  Linux	
  Kernel	
  3rd	
  ediYon	
  (Bovet	
  and	
  CesaY),	
  previous	
  W4118s	
  
Copyright	
  no2ce:	
  	
  care	
  has	
  been	
  taken	
  to	
  use	
  only	
  those	
  web	
  images	
  deemed	
  by	
  the	
  
instructor	
  to	
  be	
  in	
  the	
  public	
  domain.	
  If	
  you	
  see	
  a	
  copyrighted	
  image	
  on	
  any	
  slide	
  and	
  are	
  
the	
  copyright	
  owner,	
  please	
  contact	
  the	
  instructor.	
  It	
  will	
  be	
  removed.	
  
File	
  Systems	
  
• old	
  days	
  –	
  "the"	
  filesystem!	
  
• now	
  –	
  many	
  filesystem	
  types,	
  many	
  instances	
  
– need	
  to	
  copy	
  file	
  from	
  NTFS	
  to	
  Ext3	
  
• original	
  moYvaYon	
  –	
  NFS	
  support	
  (Sun)	
  
• idea	
  –	
  filesystem	
  op	
  abstracYon	
  layer	
  (VFS)	
  
– Virtual	
  File	
  System	
  (aka	
  Virtual	
  Filesystem	
  Switch)	
  
– File-­‐related	
  ops	
  determine	
  filesystem	
  type	
  
– Dispatch	
  (via	
  funcYon	
  pointers)	
  filesystem-­‐specific	
  op	
  
2	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
File	
  System	
  Types	
  
• lots	
  and	
  lots	
  of	
  filesystem	
  types!	
  
– 2.6	
  has	
  nearly	
  100	
  in	
  the	
  standard	
  kernel	
  tree	
  
• examples	
  
– standard:	
  Ext2,	
  ufs	
  (Solaris),	
  svfs	
  (SysV),	
  ffs	
  (BSD)	
  
– network:	
  RFS,	
  NFS,	
  Andrew,	
  Coda,	
  Samba,	
  Novell	
  
– journaling:	
  Ext3,	
  Veritas,	
  ReiserFS,	
  XFS,	
  JFS	
  
– media-­‐specific:	
  jffs,	
  ISO9660	
  (cd),	
  UDF	
  (dvd)	
  
– special:	
  /proc,	
  tmpfs,	
  sockfs,	
  etc.	
  
• proprietary	
  
– MSDOS,	
  VFAT,	
  NTFS,	
  Mac,	
  Amiga,	
  etc.	
  
• new	
  generaYon	
  for	
  Linux	
  
– Ext3,	
  ReiserFS,	
  XFS,	
  JFS	
  
3	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
(VFS)	
  Virtual	
  File	
  System	
  
• Object-­‐oriented	
  way	
  of	
  implemenYng	
  FSs	
  
• Same	
  API	
  for	
  different	
  types	
  of	
  file	
  systems	
  
– Separates	
  file-­‐system	
  generic	
  operaYons	
  from	
  
implementaYon	
  details	
  
– ImplementaYon	
  can	
  be	
  one	
  of	
  many	
  file	
  systems	
  
types,	
  or	
  network	
  file	
  system	
  
– Then	
  dispatches	
  operaYon	
  to	
  appropriate	
  file	
  
system	
  implementaYon	
  rouYnes	
  
• Syscalls	
  program	
  to	
  VFS	
  API	
  rather	
  than	
  
specific	
  FS	
  interface	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
   4	
  
Linux	
  Virtual	
  File	
  System	
  (VFS)	
  
• Very	
  flexible	
  use	
  cases:	
  
– User	
  files	
  remote	
  and	
  system	
  files	
  local?	
  No	
  problem.	
  
– Boot	
  from	
  USB?	
  Network?	
  RAM?	
  No	
  problem.	
  
– Boot	
  from	
  another	
  file?	
  No	
  problem.	
  
– InteresYng	
  FSes:	
  sshfs,	
  gmailfs,	
  FUSE	
  (user	
  space	
  FS)	
  	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
   5	
  
VFS	
  Stakeholders	
  
• VFS	
  Objects	
  
– inode,	
  file,	
  superblock,	
  dentry	
  
– VFS	
  defines	
  which	
  ops	
  on	
  each	
  object	
  
– Each	
  object	
  has	
  a	
  pointer	
  to	
  a	
  funcYon	
  table	
  
• Addresses	
  of	
  rouYnes	
  to	
  implement	
  that	
  funcYon	
  on	
  that	
  object	
  
• VFS	
  Users	
  
– System	
  calls	
  that	
  provide	
  file	
  related	
  services	
  
– Use	
  VFS	
  funcYon	
  pointer	
  and	
  objects	
  only	
  
• VFS	
  Implementers	
  
– File	
  systems	
  that	
  translate	
  VFS	
  ops	
  into	
  naYve	
  operaYons	
  
– Store	
  on	
  disk,	
  send	
  over	
  network,	
  etc.	
  
– Provide	
  the	
  funcYons	
  pointer	
  to	
  by	
  funcYon	
  pointers	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
   6	
  
Linux	
  File	
  System	
  Model	
  
• basically	
  UNIX	
  file	
  semanYcs	
  
– File	
  systems	
  are	
  mounted	
  at	
  various	
  points	
  
– Files	
  idenYfied	
  by	
  device	
  inode	
  numbers	
  
• VFS	
  layer	
  just	
  dispatches	
  to	
  fs-­‐specific	
  funcYons	
  
– libc	
  read()	
  -­‐>	
  sys_read()	
  
• what	
  type	
  of	
  filesystem	
  does	
  this	
  file	
  belong	
  to?	
  
• call	
  filesystem	
  (fs)	
  specific	
  read	
  funcYon	
  
• maintained	
  in	
  open	
  file	
  object	
  (file)	
  
– example:	
  file-­‐>f_op-­‐>read(…)	
  
• similar	
  to	
  device	
  abstracYon	
  model	
  in	
  UNIX	
  
7	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
VFS	
  Users	
  
• fundamental	
  UNIX	
  abstracYons	
  
– files	
  (everything	
  is	
  a	
  file)	
  
• ex:	
  /dev/FyS0	
  –	
  device	
  as	
  a	
  file	
  
• ex:	
  /proc/123	
  –	
  process	
  as	
  a	
  file	
  
– processes	
  
– users	
  
• lots	
  of	
  syscalls	
  related	
  to	
  files!	
  (~100)	
  
– most	
  dispatch	
  to	
  filesystem-­‐specific	
  calls	
  
– some	
  require	
  no	
  filesystem	
  acYon	
  
• example:	
  lseek(pos)	
  –	
  change	
  posiYon	
  in	
  file	
  
– others	
  have	
  default	
  VFS	
  implementaYons	
  
8	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
VFS	
  System	
  Calls	
  
• filesystem	
  ops	
  –	
  mounYng,	
  info,	
  flushing,	
  chroot,	
  pivot_root	
  
• directory	
  ops	
  –	
  chdir,	
  getcwd,	
  link,	
  unlink,	
  rename,	
  symlink	
  
• file	
  ops	
  –	
  open/close,	
  (p)read(v)/(p)write(v),	
  seek,	
  truncate,	
  dup	
  fcntl,	
  
creat,	
  	
  
• inode	
  ops	
  –	
  stat,	
  permissions,	
  chmod,	
  chown	
  
• memory	
  mapping	
  files	
  –	
  mmap,	
  munmap,	
  madvise,	
  mlock	
  
• wait	
  for	
  input	
  –	
  poll,	
  select	
  
• flushing	
  –	
  sync,	
  fsync,	
  msync,	
  fdatasync	
  
• file	
  locking	
  –	
  flock	
  
9	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
VFS-­‐related	
  Task	
  Fields	
  
• task_struct	
  fields	
  
– fs	
  –	
  includes	
  root,	
  pwd	
  
• pointers	
  to	
  dentries	
  
– files	
  –	
  includes	
  file	
  descriptor	
  array	
  fd[]	
  
• pointers	
  to	
  open	
  file	
  objects	
  
10	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
VFS	
  Objects:	
  The	
  Big	
  Four	
  
• struct	
  file	
  
– informaYon	
  about	
  an	
  open	
  file	
  
– includes	
  current	
  posiYon	
  (file	
  pointer)	
  
• struct	
  dentry	
  
– informaYon	
  about	
  a	
  directory	
  entry	
  
– includes	
  name	
  +	
  inode#	
  
• struct	
  inode	
  
– unique	
  descriptor	
  of	
  a	
  file	
  or	
  directory	
  
– contains	
  permissions,	
  Ymestamps,	
  block	
  map	
  (data)	
  
– inode#:	
  integer	
  (unique	
  per	
  mounted	
  filesystem)	
  
• struct	
  superblock	
  
– descriptor	
  of	
  a	
  mounted	
  filesystem	
  
11	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
Two	
  More	
  Data	
  Structures	
  
• struct	
  file_system_type	
  
– name	
  of	
  file	
  system	
  
– pointer	
  to	
  implemenYng	
  module	
  
– including	
  how	
  to	
  read	
  a	
  superblock	
  
– On	
  module	
  load,	
  you	
  call	
  register_file_system	
  and	
  pass	
  a	
  pointer	
  to	
  
this	
  structure	
  
• struct	
  vfsmount	
  
– Represents	
  a	
  mounted	
  instance	
  of	
  a	
  parYcular	
  file	
  system	
  
– One	
  super	
  block	
  can	
  be	
  mounted	
  in	
  two	
  places,	
  with	
  different	
  covering	
  
sub	
  mounts	
  
– Thus	
  lookup	
  requires	
  parent	
  dentry	
  and	
  a	
  vfsmount	
  
12	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
Data	
  Structure	
  RelaYonships	
  
13	
  
task_struct	
  
open	
  
file	
  
object	
  
inode	
  
superblock	
  
disk	
  
fds	
  
open	
  
file	
  
object	
  
open	
  
file	
  
object	
  
inode	
  
inode	
  
dentry	
  
dentry	
  
dentry	
  
dentry	
  
dentry	
  
inode	
  cache	
  
dentry	
  cache	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
Data	
  Structure	
  RelaYonships	
  
f_dentry
d_subdirs
d_inode
d_subdirs
d_parent
i_sb
d_sb
i_dentries
14	
  
open	
  
file	
  
object	
  
superblock	
  
open	
  
file	
  
object	
  
open	
  
file	
  
object	
  
inode	
  
inode	
  
dentry	
  
dentry	
   dentry	
   dentry	
  
inode	
   ┴	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
Sharing	
  Data	
  Structures	
  
• calling	
  dup()	
  –	
  	
  
– shares	
  open	
  file	
  objects	
  
– example:	
  2>&1	
  
• opening	
  the	
  same	
  file	
  twice	
  –	
  
– shares	
  dentries	
  
• opening	
  same	
  file	
  via	
  different	
  hard	
  links	
  –	
  
– shares	
  inodes	
  
• mounYng	
  same	
  filesystem	
  on	
  different	
  dirs	
  –	
  
– shares	
  superblocks	
  
15	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
Superblock	
  
• mounted	
  filesystem	
  descriptor	
  
– usually	
  first	
  block	
  on	
  disk	
  (ater	
  boot	
  block)	
  
– copied	
  into	
  (similar)	
  memory	
  structure	
  on	
  mount	
  
• disYncYon:	
  disk	
  superblock	
  vs	
  memory	
  superblock	
  
• dirty	
  bit	
  (s_dirt),	
  copied	
  to	
  disk	
  frequently	
  
• important	
  fields	
  
– s_dev,	
  s_bdev	
  –	
  device,	
  device-­‐driver	
  
– s_blocksize,	
  s_maxbytes,	
  s_type	
  
– s_flags,	
  s_magic,	
  s_count,	
  s_root,	
  s_dquot	
  
– s_dirty	
  –	
  dirty	
  inodes	
  for	
  this	
  filesystem	
  
– s_op	
  –	
  superblock	
  operaYons	
  
– u	
  –	
  filesystem	
  specific	
  data	
  
16	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
Superblock	
  OperaYons	
  
• filesystem-­‐specific	
  operaYons	
  
– read/write/clear/delete	
  inode	
  
– write_super,	
  put_super	
  (release)	
  
• no	
  get_super()!	
  that	
  lives	
  in	
  file_system_type	
  descriptor	
  
– write_super_lockfs,	
  unlockfs,	
  stavs	
  
– file_handle	
  ops	
  (NFS-­‐related)	
  
– show_opYons	
  
17	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
Inode	
  
• "index"	
  node	
  –	
  unique	
  file	
  or	
  directory	
  descriptor	
  
– meta-­‐data:	
  permissions,	
  owner,	
  Ymestamps,	
  size,	
  link	
  count	
  
– data:	
  pointers	
  to	
  disk	
  blocks	
  containing	
  actual	
  data	
  
• data	
  pointers	
  are	
  "indices"	
  into	
  file	
  contents	
  (hence	
  "inode")	
  
• inode	
  #	
  -­‐	
  unique	
  integer	
  (per-­‐mounted	
  filesystem)	
  
• what	
  about	
  names	
  and	
  paths?	
  
– high-­‐level	
  fluff	
  on	
  top	
  of	
  a	
  "flat-­‐filesystem"	
  
– implemented	
  by	
  directory	
  files	
  (directories)	
  
– directory	
  contents:	
  name	
  +	
  inode	
  
18	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
File	
  Links	
  
• UNIX	
  link	
  semanYcs	
  
– hard	
  links	
  –	
  mulYple	
  dir	
  entries	
  with	
  same	
  inode	
  #	
  
• equal	
  status;	
  first	
  is	
  not	
  "real"	
  entry	
  
• file	
  deleted	
  when	
  link	
  count	
  goes	
  to	
  0	
  
• restricYons	
  
– can't	
  hard	
  link	
  to	
  directories	
  (avoids	
  cycles)	
  
– or	
  across	
  filesystems	
  
– sot	
  (symbolic)	
  links	
  –	
  liFle	
  files	
  with	
  pathnames	
  
• just	
  aliases	
  for	
  another	
  pathname	
  
• no	
  restricYons,	
  cycles	
  possible,	
  dangling	
  links	
  possible	
  
19	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
Inode	
  Fields	
  
• large	
  struct	
  (~50	
  fields)	
  
• important	
  fields	
  
– i_sb,	
  i_ino	
  (number),	
  i_nlink	
  (link	
  count)	
  
– metadata:	
  i_mode,	
  i_uid,	
  i_gid,	
  i_size,	
  i_Ymes	
  
– i_flock	
  (lock	
  list),	
  i_wait	
  (waitq	
  –	
  for	
  blocking	
  ops)	
  
– linkage:	
  i_hash,	
  i_list,	
  i_dentry	
  (aliases)	
  
– i_op	
  (inode	
  ops),	
  i_fop	
  (default	
  file	
  ops)	
  
– u	
  (filesystem	
  specific	
  data	
  –	
  includes	
  block	
  map!)	
  
20	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
Inode	
  OperaYons	
  
• create	
  –	
  new	
  inode	
  for	
  regular	
  file	
  
• link/unlink/rename	
  –	
  	
  
– add/remove/modify	
  dir	
  entry	
  
• symlink,	
  readlink,	
  follow_link	
  –	
  sot	
  link	
  ops	
  
• mkdir/rmdir	
  –	
  new	
  inode	
  for	
  directory	
  file	
  
• mknod	
  –	
  new	
  inode	
  for	
  device	
  file	
  
• truncate	
  –	
  modify	
  file	
  size	
  
• permission	
  –	
  check	
  access	
  permissions	
  
21	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
(Open)	
  File	
  Object	
  
• struct	
  file	
  (usual	
  variable	
  name	
  -­‐	
  filp)	
  
– associaYon	
  between	
  file	
  and	
  process	
  
– no	
  disk	
  representaYon	
  
– created	
  for	
  each	
  open	
  (mulYple	
  possible,	
  even	
  same	
  file)	
  
– most	
  important	
  info:	
  file	
  pointer	
  
• file	
  descriptor	
  (small	
  ints)	
  
– index	
  into	
  array	
  of	
  pointers	
  to	
  open	
  file	
  objects	
  
• file	
  object	
  states	
  
– unused	
  (memory	
  cache	
  +	
  root	
  reserve	
  (10))	
  
• get_empty_filp()	
  
– inuse	
  (per-­‐superblock	
  lists)	
  
• system-­‐wide	
  max	
  on	
  open	
  file	
  objects	
  (~8K)	
  
– /proc/sys/fs/file-­‐max	
  
22	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
File	
  Object	
  Fields	
  
• important	
  fields	
  
– f_dentry	
  (directory	
  entry	
  of	
  file)	
  
– f_vfsmnt	
  (fs	
  mount	
  point)	
  
– f_op	
  (fs-­‐specific	
  funcYons	
  –	
  table	
  of	
  funcYon	
  pointers)	
  
– f_count,	
  f_flags,	
  f_mode	
  (r/w,	
  permissions,	
  etc.)	
  
– f_pos	
  (current	
  posiYon	
  –	
  file	
  pointer)	
  
– info	
  for	
  read-­‐ahead	
  (more	
  later)	
  
– f_uid,	
  f_gid,	
  f_owner	
  
– f_version	
  (for	
  consistency	
  maintenance)	
  
– private_data	
  (fs-­‐specific	
  data)	
  
23	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
File	
  Object	
  OperaYons	
  
• f_op	
  field	
  –	
  table	
  of	
  funcYon	
  pointers	
  
– copied	
  from	
  inode	
  (i_fop)	
  iniYally	
  (fs-­‐specific)	
  
– possible	
  to	
  change	
  to	
  customize	
  (per-­‐open)	
  
• device-­‐drivers	
  do	
  some	
  tricks	
  like	
  this	
  someYmes	
  
• important	
  operaYons	
  
– llseek(),	
  read(),	
  write(),	
  readdir(),	
  poll()	
  
– ioctl()	
  –	
  "wildcard"	
  funcYon	
  for	
  per-­‐fs	
  semanYcs	
  
– mmap(),	
  open(),	
  flush(),	
  release(),	
  fsync()	
  	
  
– fasync()	
  –	
  turn	
  on/off	
  asynchronous	
  i/o	
  noYficaYons	
  
– lock()	
  –	
  file-­‐locks	
  (more	
  later)	
  
– readv(),	
  writev()	
  –	
  "scaFer/gather	
  i/o"	
  
• read/write	
  with	
  disconYguous	
  buffers	
  (e.g.	
  packets)	
  
– sendpage()	
  –	
  page-­‐opYmized	
  socket	
  transfer	
  
24	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
Dentry	
  
• abstracYon	
  of	
  directory	
  entry	
  
– ex:	
  line	
  from	
  ls	
  -­‐l	
  
– either	
  files	
  (hard	
  links)	
  or	
  sot	
  links	
  or	
  subdirectories	
  
– every	
  dentry	
  has	
  a	
  parent	
  dentry	
  (except	
  root)	
  
– sibling	
  dentries	
  –	
  other	
  entries	
  in	
  the	
  same	
  directory	
  
• directory	
  api:	
  dentry	
  iterators	
  
– posix:	
  opendir(),	
  readdir(),	
  scandir(),	
  seekdir(),	
  rewinddir()	
  
– syscall:	
  getdents()	
  
• why	
  an	
  abstracYon?	
  
– Local	
  filesystems:	
  directories	
  are	
  really	
  files	
  with	
  directory	
  
"records”	
  
– Network	
  filesystems:	
  oten	
  have	
  separate	
  directory	
  
operaYons	
  (e.g.,	
  NFS,	
  FTP)	
  
– Having	
  abstracYon	
  allows	
  unificaYon,	
  caching	
  
25	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
Dentry	
  (conYnued)	
  
• not-­‐disk	
  based	
  (no	
  dirty	
  bit)	
  
– dentry_cache	
  –	
  slab	
  cache	
  
• important	
  fields	
  
– d_name	
  (qstr),	
  d_count,	
  d_flags	
  
– d_inode	
  –	
  associated	
  inode	
  
– d_parent	
  –	
  parent	
  dentry	
  
– d_child	
  –	
  siblings	
  list	
  
– d_subdirs	
  –	
  my	
  children	
  (if	
  i'm	
  a	
  subdirectory)	
  
– d_alias	
  –	
  other	
  names	
  (links)	
  for	
  the	
  same	
  object	
  (inode)?	
  
– d_lru	
  –	
  unused	
  state	
  linkage	
  
– d_op	
  –	
  dentry	
  operaYons	
  (funcYon	
  pointer	
  table)	
  
– d_fsdata	
  –	
  filesystem-­‐specific	
  data	
  
26	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
Dentry	
  Cache	
  
• very	
  important	
  cache	
  for	
  filesystem	
  performance	
  
– every	
  file	
  access	
  causes	
  mulYple	
  dentry	
  accesses!	
  
– example:	
  /tmp/foo	
  
• dentries	
  for	
  "/",	
  "/tmp",	
  "/tmp/foo"	
  (path	
  components)	
  
• dentry	
  cache	
  "controls"	
  inode	
  cache	
  
– inodes	
  released	
  only	
  when	
  dentry	
  is	
  released	
  
• dentry	
  cache	
  accessed	
  via	
  hash	
  table	
  
– hash(dir,	
  filename)	
  -­‐>	
  dentry	
  
27	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
Dentry	
  Cache	
  (conYnued)	
  
• dentry	
  states	
  
– free	
  (not	
  valid;	
  maintained	
  by	
  slab	
  cache)	
  
– in-­‐use	
  (associated	
  with	
  valid	
  open	
  inode)	
  
– unused	
  (valid	
  but	
  not	
  being	
  used;	
  LRU	
  list)	
  
– negaYve	
  (file	
  that	
  does	
  not	
  exist)	
  
• dentry	
  ops	
  
– just	
  a	
  few,	
  mostly	
  default	
  acYons	
  
– ex:	
  d_compare(dir,	
  name1,	
  name2)	
  	
  
• case-­‐insensiYve	
  for	
  MSDOS	
  
28	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
Process-­‐related	
  Files	
  
• current-­‐>fs	
  (fs_struct)	
  
– root	
  (for	
  chroot	
  jails)	
  
– pwd	
  
– umask	
  (default	
  file	
  permissions)	
  
• current-­‐>files	
  (files_struct)	
  
– fd[]	
  (file	
  descriptor	
  array	
  –	
  pointers	
  to	
  file	
  objects)	
  
• 0,	
  1,	
  2	
  –	
  stdin,	
  stdout,	
  stderr	
  
– originally	
  32,	
  growable	
  to	
  1,024	
  (RLIMIT_NOFILE)	
  
• complex	
  structure	
  for	
  growing	
  …	
  see	
  book	
  
– close_on_exec	
  memory	
  (bitmap)	
  
• open	
  files	
  normally	
  inherited	
  across	
  exec	
  
29	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
Filesystem	
  Types	
  
• Linux	
  must	
  "know	
  about"	
  filesystem	
  before	
  mount	
  
– mulYple	
  (mounted)	
  instances	
  of	
  each	
  type	
  possible	
  
• special	
  (virtual)	
  filesystems	
  (like	
  /proc)	
  
– structuring	
  technique	
  to	
  touch	
  kernel	
  data	
  
– examples:	
  	
  
• /proc,	
  /dev	
  (devfs)	
  
• sockfs,	
  pipefs,	
  tmpfs,	
  roovs,	
  shmfs	
  
– associated	
  with	
  ficYYous	
  block	
  device	
  (major#	
  0)	
  
• minor#	
  disYnguishes	
  special	
  filesystem	
  types	
  
30	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
Registering	
  a	
  Filesystem	
  Type	
  
• must	
  register	
  before	
  mount	
  
– staYc	
  (compile-­‐Yme)	
  or	
  dynamic	
  (modules)	
  
• register_filesystem()	
  /	
  unregister_filesystem	
  
– adds	
  file_system_type	
  object	
  to	
  linked-­‐list	
  
• file_systems	
  (head;	
  kernel	
  global	
  variable)	
  
• file_systems_lock	
  (rw	
  spinlock	
  to	
  protect	
  list)	
  
• file_system_type	
  descriptor	
  
– name,	
  flags,	
  pointer	
  to	
  implemenYng	
  module	
  
– list	
  of	
  superblocks	
  (mounted	
  instances)	
  
– read_super()	
  –	
  pointer	
  to	
  method	
  for	
  reading	
  superblock	
  
• most	
  important	
  thing!	
  filesystem	
  specific	
  
31	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
  
IntegraYon	
  with	
  Memory	
  Subsystem	
  
• The	
  address_space	
  structure	
  
– One	
  per	
  file,	
  device,	
  etc.	
  
– Mapping	
  between	
  logical	
  offset	
  in	
  object	
  to	
  page	
  in	
  
memory	
  
– Pages	
  in	
  memory	
  are	
  called	
  “page	
  cache”	
  
– Files	
  can	
  be	
  large:	
  need	
  efficient	
  data	
  structure	
  
• You	
  don’t	
  have	
  to	
  use	
  address_space	
  for	
  hw4.	
  
Use	
  a	
  simple	
  array	
  to	
  maintain	
  your	
  offset-­‐
>page	
  mapping.	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
   32	
  
The	
  address_space	
  structure	
  
struct address_space {!
struct inode *host; /* owner: inode, block_device */!
struct radix_tree_root page_tree; /* radix tree of all pages */!
spinlock_t tree_lock; /* and lock protecting it */!
unsigned int i_mmap_writable;/* count VM_SHARED mappings */!
struct prio_tree_root i_mmap; /* tree of private and shared
mappings */!
struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */!
spinlock_t i_mmap_lock; /* protect tree, count, list */!
unsigned long nrpages; /* number of total pages */!
pgoff_t writeback_index;/* writeback starts here */!
const struct address_space_operations *a_ops; /* methods */!
unsigned long flags; /* error bits/gfp mask */!
struct backing_dev_info *backing_dev_info; /* device readahead, etc */!
}!
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
   33	
  
The	
  Page	
  Cache	
  Radix	
  Tree	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
   34	
  
address_space_operaYons	
  structure	
  
struct address_space_operations {!
int (*writepage)(struct page *page, struct writeback_control *wbc);!
int (*readpage)(struct file *, struct page *);!
!
! int (*write_begin)(struct file *, struct address_space *mapping,!
loff_t pos, unsigned len, unsigned flags,!
struct page **pagep, void **fsdata);!
int (*write_end)(struct file *, struct address_space *mapping,!
loff_t pos, unsigned len, unsigned copied,!
struct page *page, void *fsdata);!
!
! sector_t (*bmap)(struct address_space *, sector_t);!
! void (*invalidatepage) (struct page *, unsigned long);!
}!
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
   35	
  
Buffer	
  Cache	
  Descriptors	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
   36	
  
Reverse	
  Mapping	
  for	
  Memory	
  Maps	
  
• Problem:	
  anon_vma	
  is	
  good	
  for	
  limited	
  sharing	
  
– Memory	
  maps	
  can	
  be	
  shared	
  by	
  large	
  numbers	
  of	
  processes	
  
– E.g.,	
  libc	
  shared	
  by	
  everyone	
  
– I.e.,	
  need	
  to	
  do	
  linear	
  search	
  for	
  every	
  evicYon	
  
– Also,	
  different	
  processes	
  may	
  map	
  different	
  ranges	
  of	
  a	
  
memory	
  map	
  into	
  their	
  address	
  space	
  
• Need	
  efficient	
  data	
  structure	
  
– Basic	
  operaYon:	
  given	
  an	
  offset	
  in	
  an	
  object	
  (such	
  as	
  a	
  file),	
  
or	
  a	
  range	
  of	
  offsets,	
  return	
  vmas	
  that	
  map	
  that	
  range	
  
– Enter	
  priority	
  search	
  trees	
  
– Allows	
  efficient	
  interval	
  queries	
  
• Note:	
  you	
  don’t	
  need	
  this	
  for	
  hw4.	
  Use	
  anon_vma	
  	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
   37	
  
i_mmap	
  Priority	
  Tree	
  
4/22/13	
   COMS	
  W4118.	
  Spring	
  2013,	
  Columbia	
  University.	
  Instructor:	
  Dr.	
  Kaustubh	
  Joshi,	
  AT&T	
  Labs.	
   38	
  
Part	
  of	
  	
  struct	
  address_space	
  in	
  fs.h	
  	
  

More Related Content

Featured

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Featured (20)

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 

L21-LinuxVFS.pdf

  • 1. Linux  VFS   COMS  W4118   Prof.  Kaustubh  R.  Joshi   krj@cs.columbia.edu     hFp://www.cs.columbia.edu/~krj/os   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   1   References:  OperaYng  Systems  Concepts  (9e),  Linux  Kernel  Development,  Understanding   the  Linux  Kernel  3rd  ediYon  (Bovet  and  CesaY),  previous  W4118s   Copyright  no2ce:    care  has  been  taken  to  use  only  those  web  images  deemed  by  the   instructor  to  be  in  the  public  domain.  If  you  see  a  copyrighted  image  on  any  slide  and  are   the  copyright  owner,  please  contact  the  instructor.  It  will  be  removed.  
  • 2. File  Systems   • old  days  –  "the"  filesystem!   • now  –  many  filesystem  types,  many  instances   – need  to  copy  file  from  NTFS  to  Ext3   • original  moYvaYon  –  NFS  support  (Sun)   • idea  –  filesystem  op  abstracYon  layer  (VFS)   – Virtual  File  System  (aka  Virtual  Filesystem  Switch)   – File-­‐related  ops  determine  filesystem  type   – Dispatch  (via  funcYon  pointers)  filesystem-­‐specific  op   2   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 3. File  System  Types   • lots  and  lots  of  filesystem  types!   – 2.6  has  nearly  100  in  the  standard  kernel  tree   • examples   – standard:  Ext2,  ufs  (Solaris),  svfs  (SysV),  ffs  (BSD)   – network:  RFS,  NFS,  Andrew,  Coda,  Samba,  Novell   – journaling:  Ext3,  Veritas,  ReiserFS,  XFS,  JFS   – media-­‐specific:  jffs,  ISO9660  (cd),  UDF  (dvd)   – special:  /proc,  tmpfs,  sockfs,  etc.   • proprietary   – MSDOS,  VFAT,  NTFS,  Mac,  Amiga,  etc.   • new  generaYon  for  Linux   – Ext3,  ReiserFS,  XFS,  JFS   3   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 4. (VFS)  Virtual  File  System   • Object-­‐oriented  way  of  implemenYng  FSs   • Same  API  for  different  types  of  file  systems   – Separates  file-­‐system  generic  operaYons  from   implementaYon  details   – ImplementaYon  can  be  one  of  many  file  systems   types,  or  network  file  system   – Then  dispatches  operaYon  to  appropriate  file   system  implementaYon  rouYnes   • Syscalls  program  to  VFS  API  rather  than   specific  FS  interface   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   4  
  • 5. Linux  Virtual  File  System  (VFS)   • Very  flexible  use  cases:   – User  files  remote  and  system  files  local?  No  problem.   – Boot  from  USB?  Network?  RAM?  No  problem.   – Boot  from  another  file?  No  problem.   – InteresYng  FSes:  sshfs,  gmailfs,  FUSE  (user  space  FS)     4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   5  
  • 6. VFS  Stakeholders   • VFS  Objects   – inode,  file,  superblock,  dentry   – VFS  defines  which  ops  on  each  object   – Each  object  has  a  pointer  to  a  funcYon  table   • Addresses  of  rouYnes  to  implement  that  funcYon  on  that  object   • VFS  Users   – System  calls  that  provide  file  related  services   – Use  VFS  funcYon  pointer  and  objects  only   • VFS  Implementers   – File  systems  that  translate  VFS  ops  into  naYve  operaYons   – Store  on  disk,  send  over  network,  etc.   – Provide  the  funcYons  pointer  to  by  funcYon  pointers   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   6  
  • 7. Linux  File  System  Model   • basically  UNIX  file  semanYcs   – File  systems  are  mounted  at  various  points   – Files  idenYfied  by  device  inode  numbers   • VFS  layer  just  dispatches  to  fs-­‐specific  funcYons   – libc  read()  -­‐>  sys_read()   • what  type  of  filesystem  does  this  file  belong  to?   • call  filesystem  (fs)  specific  read  funcYon   • maintained  in  open  file  object  (file)   – example:  file-­‐>f_op-­‐>read(…)   • similar  to  device  abstracYon  model  in  UNIX   7   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 8. VFS  Users   • fundamental  UNIX  abstracYons   – files  (everything  is  a  file)   • ex:  /dev/FyS0  –  device  as  a  file   • ex:  /proc/123  –  process  as  a  file   – processes   – users   • lots  of  syscalls  related  to  files!  (~100)   – most  dispatch  to  filesystem-­‐specific  calls   – some  require  no  filesystem  acYon   • example:  lseek(pos)  –  change  posiYon  in  file   – others  have  default  VFS  implementaYons   8   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 9. VFS  System  Calls   • filesystem  ops  –  mounYng,  info,  flushing,  chroot,  pivot_root   • directory  ops  –  chdir,  getcwd,  link,  unlink,  rename,  symlink   • file  ops  –  open/close,  (p)read(v)/(p)write(v),  seek,  truncate,  dup  fcntl,   creat,     • inode  ops  –  stat,  permissions,  chmod,  chown   • memory  mapping  files  –  mmap,  munmap,  madvise,  mlock   • wait  for  input  –  poll,  select   • flushing  –  sync,  fsync,  msync,  fdatasync   • file  locking  –  flock   9   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 10. VFS-­‐related  Task  Fields   • task_struct  fields   – fs  –  includes  root,  pwd   • pointers  to  dentries   – files  –  includes  file  descriptor  array  fd[]   • pointers  to  open  file  objects   10   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 11. VFS  Objects:  The  Big  Four   • struct  file   – informaYon  about  an  open  file   – includes  current  posiYon  (file  pointer)   • struct  dentry   – informaYon  about  a  directory  entry   – includes  name  +  inode#   • struct  inode   – unique  descriptor  of  a  file  or  directory   – contains  permissions,  Ymestamps,  block  map  (data)   – inode#:  integer  (unique  per  mounted  filesystem)   • struct  superblock   – descriptor  of  a  mounted  filesystem   11   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 12. Two  More  Data  Structures   • struct  file_system_type   – name  of  file  system   – pointer  to  implemenYng  module   – including  how  to  read  a  superblock   – On  module  load,  you  call  register_file_system  and  pass  a  pointer  to   this  structure   • struct  vfsmount   – Represents  a  mounted  instance  of  a  parYcular  file  system   – One  super  block  can  be  mounted  in  two  places,  with  different  covering   sub  mounts   – Thus  lookup  requires  parent  dentry  and  a  vfsmount   12   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 13. Data  Structure  RelaYonships   13   task_struct   open   file   object   inode   superblock   disk   fds   open   file   object   open   file   object   inode   inode   dentry   dentry   dentry   dentry   dentry   inode  cache   dentry  cache   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 14. Data  Structure  RelaYonships   f_dentry d_subdirs d_inode d_subdirs d_parent i_sb d_sb i_dentries 14   open   file   object   superblock   open   file   object   open   file   object   inode   inode   dentry   dentry   dentry   dentry   inode   ┴   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 15. Sharing  Data  Structures   • calling  dup()  –     – shares  open  file  objects   – example:  2>&1   • opening  the  same  file  twice  –   – shares  dentries   • opening  same  file  via  different  hard  links  –   – shares  inodes   • mounYng  same  filesystem  on  different  dirs  –   – shares  superblocks   15   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 16. Superblock   • mounted  filesystem  descriptor   – usually  first  block  on  disk  (ater  boot  block)   – copied  into  (similar)  memory  structure  on  mount   • disYncYon:  disk  superblock  vs  memory  superblock   • dirty  bit  (s_dirt),  copied  to  disk  frequently   • important  fields   – s_dev,  s_bdev  –  device,  device-­‐driver   – s_blocksize,  s_maxbytes,  s_type   – s_flags,  s_magic,  s_count,  s_root,  s_dquot   – s_dirty  –  dirty  inodes  for  this  filesystem   – s_op  –  superblock  operaYons   – u  –  filesystem  specific  data   16   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 17. Superblock  OperaYons   • filesystem-­‐specific  operaYons   – read/write/clear/delete  inode   – write_super,  put_super  (release)   • no  get_super()!  that  lives  in  file_system_type  descriptor   – write_super_lockfs,  unlockfs,  stavs   – file_handle  ops  (NFS-­‐related)   – show_opYons   17   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 18. Inode   • "index"  node  –  unique  file  or  directory  descriptor   – meta-­‐data:  permissions,  owner,  Ymestamps,  size,  link  count   – data:  pointers  to  disk  blocks  containing  actual  data   • data  pointers  are  "indices"  into  file  contents  (hence  "inode")   • inode  #  -­‐  unique  integer  (per-­‐mounted  filesystem)   • what  about  names  and  paths?   – high-­‐level  fluff  on  top  of  a  "flat-­‐filesystem"   – implemented  by  directory  files  (directories)   – directory  contents:  name  +  inode   18   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 19. File  Links   • UNIX  link  semanYcs   – hard  links  –  mulYple  dir  entries  with  same  inode  #   • equal  status;  first  is  not  "real"  entry   • file  deleted  when  link  count  goes  to  0   • restricYons   – can't  hard  link  to  directories  (avoids  cycles)   – or  across  filesystems   – sot  (symbolic)  links  –  liFle  files  with  pathnames   • just  aliases  for  another  pathname   • no  restricYons,  cycles  possible,  dangling  links  possible   19   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 20. Inode  Fields   • large  struct  (~50  fields)   • important  fields   – i_sb,  i_ino  (number),  i_nlink  (link  count)   – metadata:  i_mode,  i_uid,  i_gid,  i_size,  i_Ymes   – i_flock  (lock  list),  i_wait  (waitq  –  for  blocking  ops)   – linkage:  i_hash,  i_list,  i_dentry  (aliases)   – i_op  (inode  ops),  i_fop  (default  file  ops)   – u  (filesystem  specific  data  –  includes  block  map!)   20   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 21. Inode  OperaYons   • create  –  new  inode  for  regular  file   • link/unlink/rename  –     – add/remove/modify  dir  entry   • symlink,  readlink,  follow_link  –  sot  link  ops   • mkdir/rmdir  –  new  inode  for  directory  file   • mknod  –  new  inode  for  device  file   • truncate  –  modify  file  size   • permission  –  check  access  permissions   21   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 22. (Open)  File  Object   • struct  file  (usual  variable  name  -­‐  filp)   – associaYon  between  file  and  process   – no  disk  representaYon   – created  for  each  open  (mulYple  possible,  even  same  file)   – most  important  info:  file  pointer   • file  descriptor  (small  ints)   – index  into  array  of  pointers  to  open  file  objects   • file  object  states   – unused  (memory  cache  +  root  reserve  (10))   • get_empty_filp()   – inuse  (per-­‐superblock  lists)   • system-­‐wide  max  on  open  file  objects  (~8K)   – /proc/sys/fs/file-­‐max   22   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 23. File  Object  Fields   • important  fields   – f_dentry  (directory  entry  of  file)   – f_vfsmnt  (fs  mount  point)   – f_op  (fs-­‐specific  funcYons  –  table  of  funcYon  pointers)   – f_count,  f_flags,  f_mode  (r/w,  permissions,  etc.)   – f_pos  (current  posiYon  –  file  pointer)   – info  for  read-­‐ahead  (more  later)   – f_uid,  f_gid,  f_owner   – f_version  (for  consistency  maintenance)   – private_data  (fs-­‐specific  data)   23   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 24. File  Object  OperaYons   • f_op  field  –  table  of  funcYon  pointers   – copied  from  inode  (i_fop)  iniYally  (fs-­‐specific)   – possible  to  change  to  customize  (per-­‐open)   • device-­‐drivers  do  some  tricks  like  this  someYmes   • important  operaYons   – llseek(),  read(),  write(),  readdir(),  poll()   – ioctl()  –  "wildcard"  funcYon  for  per-­‐fs  semanYcs   – mmap(),  open(),  flush(),  release(),  fsync()     – fasync()  –  turn  on/off  asynchronous  i/o  noYficaYons   – lock()  –  file-­‐locks  (more  later)   – readv(),  writev()  –  "scaFer/gather  i/o"   • read/write  with  disconYguous  buffers  (e.g.  packets)   – sendpage()  –  page-­‐opYmized  socket  transfer   24   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 25. Dentry   • abstracYon  of  directory  entry   – ex:  line  from  ls  -­‐l   – either  files  (hard  links)  or  sot  links  or  subdirectories   – every  dentry  has  a  parent  dentry  (except  root)   – sibling  dentries  –  other  entries  in  the  same  directory   • directory  api:  dentry  iterators   – posix:  opendir(),  readdir(),  scandir(),  seekdir(),  rewinddir()   – syscall:  getdents()   • why  an  abstracYon?   – Local  filesystems:  directories  are  really  files  with  directory   "records”   – Network  filesystems:  oten  have  separate  directory   operaYons  (e.g.,  NFS,  FTP)   – Having  abstracYon  allows  unificaYon,  caching   25   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 26. Dentry  (conYnued)   • not-­‐disk  based  (no  dirty  bit)   – dentry_cache  –  slab  cache   • important  fields   – d_name  (qstr),  d_count,  d_flags   – d_inode  –  associated  inode   – d_parent  –  parent  dentry   – d_child  –  siblings  list   – d_subdirs  –  my  children  (if  i'm  a  subdirectory)   – d_alias  –  other  names  (links)  for  the  same  object  (inode)?   – d_lru  –  unused  state  linkage   – d_op  –  dentry  operaYons  (funcYon  pointer  table)   – d_fsdata  –  filesystem-­‐specific  data   26   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 27. Dentry  Cache   • very  important  cache  for  filesystem  performance   – every  file  access  causes  mulYple  dentry  accesses!   – example:  /tmp/foo   • dentries  for  "/",  "/tmp",  "/tmp/foo"  (path  components)   • dentry  cache  "controls"  inode  cache   – inodes  released  only  when  dentry  is  released   • dentry  cache  accessed  via  hash  table   – hash(dir,  filename)  -­‐>  dentry   27   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 28. Dentry  Cache  (conYnued)   • dentry  states   – free  (not  valid;  maintained  by  slab  cache)   – in-­‐use  (associated  with  valid  open  inode)   – unused  (valid  but  not  being  used;  LRU  list)   – negaYve  (file  that  does  not  exist)   • dentry  ops   – just  a  few,  mostly  default  acYons   – ex:  d_compare(dir,  name1,  name2)     • case-­‐insensiYve  for  MSDOS   28   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 29. Process-­‐related  Files   • current-­‐>fs  (fs_struct)   – root  (for  chroot  jails)   – pwd   – umask  (default  file  permissions)   • current-­‐>files  (files_struct)   – fd[]  (file  descriptor  array  –  pointers  to  file  objects)   • 0,  1,  2  –  stdin,  stdout,  stderr   – originally  32,  growable  to  1,024  (RLIMIT_NOFILE)   • complex  structure  for  growing  …  see  book   – close_on_exec  memory  (bitmap)   • open  files  normally  inherited  across  exec   29   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 30. Filesystem  Types   • Linux  must  "know  about"  filesystem  before  mount   – mulYple  (mounted)  instances  of  each  type  possible   • special  (virtual)  filesystems  (like  /proc)   – structuring  technique  to  touch  kernel  data   – examples:     • /proc,  /dev  (devfs)   • sockfs,  pipefs,  tmpfs,  roovs,  shmfs   – associated  with  ficYYous  block  device  (major#  0)   • minor#  disYnguishes  special  filesystem  types   30   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 31. Registering  a  Filesystem  Type   • must  register  before  mount   – staYc  (compile-­‐Yme)  or  dynamic  (modules)   • register_filesystem()  /  unregister_filesystem   – adds  file_system_type  object  to  linked-­‐list   • file_systems  (head;  kernel  global  variable)   • file_systems_lock  (rw  spinlock  to  protect  list)   • file_system_type  descriptor   – name,  flags,  pointer  to  implemenYng  module   – list  of  superblocks  (mounted  instances)   – read_super()  –  pointer  to  method  for  reading  superblock   • most  important  thing!  filesystem  specific   31   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.  
  • 32. IntegraYon  with  Memory  Subsystem   • The  address_space  structure   – One  per  file,  device,  etc.   – Mapping  between  logical  offset  in  object  to  page  in   memory   – Pages  in  memory  are  called  “page  cache”   – Files  can  be  large:  need  efficient  data  structure   • You  don’t  have  to  use  address_space  for  hw4.   Use  a  simple  array  to  maintain  your  offset-­‐ >page  mapping.   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   32  
  • 33. The  address_space  structure   struct address_space {! struct inode *host; /* owner: inode, block_device */! struct radix_tree_root page_tree; /* radix tree of all pages */! spinlock_t tree_lock; /* and lock protecting it */! unsigned int i_mmap_writable;/* count VM_SHARED mappings */! struct prio_tree_root i_mmap; /* tree of private and shared mappings */! struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */! spinlock_t i_mmap_lock; /* protect tree, count, list */! unsigned long nrpages; /* number of total pages */! pgoff_t writeback_index;/* writeback starts here */! const struct address_space_operations *a_ops; /* methods */! unsigned long flags; /* error bits/gfp mask */! struct backing_dev_info *backing_dev_info; /* device readahead, etc */! }! 4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   33  
  • 34. The  Page  Cache  Radix  Tree   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   34  
  • 35. address_space_operaYons  structure   struct address_space_operations {! int (*writepage)(struct page *page, struct writeback_control *wbc);! int (*readpage)(struct file *, struct page *);! ! ! int (*write_begin)(struct file *, struct address_space *mapping,! loff_t pos, unsigned len, unsigned flags,! struct page **pagep, void **fsdata);! int (*write_end)(struct file *, struct address_space *mapping,! loff_t pos, unsigned len, unsigned copied,! struct page *page, void *fsdata);! ! ! sector_t (*bmap)(struct address_space *, sector_t);! ! void (*invalidatepage) (struct page *, unsigned long);! }! 4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   35  
  • 36. Buffer  Cache  Descriptors   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   36  
  • 37. Reverse  Mapping  for  Memory  Maps   • Problem:  anon_vma  is  good  for  limited  sharing   – Memory  maps  can  be  shared  by  large  numbers  of  processes   – E.g.,  libc  shared  by  everyone   – I.e.,  need  to  do  linear  search  for  every  evicYon   – Also,  different  processes  may  map  different  ranges  of  a   memory  map  into  their  address  space   • Need  efficient  data  structure   – Basic  operaYon:  given  an  offset  in  an  object  (such  as  a  file),   or  a  range  of  offsets,  return  vmas  that  map  that  range   – Enter  priority  search  trees   – Allows  efficient  interval  queries   • Note:  you  don’t  need  this  for  hw4.  Use  anon_vma     4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   37  
  • 38. i_mmap  Priority  Tree   4/22/13   COMS  W4118.  Spring  2013,  Columbia  University.  Instructor:  Dr.  Kaustubh  Joshi,  AT&T  Labs.   38   Part  of    struct  address_space  in  fs.h