• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
20100425 ocfs2 direct_write
 

20100425 ocfs2 direct_write

on

  • 434 views

 

Statistics

Views

Total Views
434
Views on SlideShare
434
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    20100425 ocfs2 direct_write 20100425 ocfs2 direct_write Presentation Transcript

    • The Story of a Bug in ocfs2 Direct Write Li Dongyang OPS Automation QA [email_address]
    • Novell Bugzilla, Bug 591039
        (5431,0):ocfs2_truncate_file:465 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode) (5431,0):ocfs2_truncate_file:465 ERROR: Inode 95483, inode i_size = 1105920 != di i_size = 1103252, i_flags = 0x1 kernel BUG at fs/ocfs2/file.c:465! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/fs/ocfs2/loaded_cluster_plugins Pid: 11076, comm: fsstress Not tainted (2.6.33.1-ocfs2 #1) EIP: 0061:[<d24701ba>] EFLAGS: 00010296 CPU: 1 EIP is at ocfs2_setattr+0xc1a/0x1d10 [ocfs2] Process fsstress (pid: 11076, ti=c33fc000 task=c31b45b0 task.ti=c33fc000) Call Trace: [<c00d6191>] notify_change+0x141/0x320 [<c00bf1a8>] do_truncate+0x68/0xa0 [<c00bf547>] do_sys_truncate+0x177/0x220 [<c000666d>] syscall_call+0x7/0xb [<f57fe424>] 0xf57fe424
    • Symptoms
      • fsstress from ltp
      • We are using truncate sys call
      • The inode in memory (i_size_read(inode)) is always a bit greater than the on-disk inode size (di->i_size)
      • notify_change calls setattr in the i_op of the inode, which is ocfs2_setattr in our case
      • ocfs2_setattr calls ocfs2_truncate_file, and we meet the bug expression, BOOM
      • But how did they become inconsistent?
    • Hunting
      • Look at every place calling i_size_write in ocfs2
      • We write the i_size of on disk inode by hand around calling i_size_write in ocfs2
      • printk around i_size_write in VFS
      • The i_size got modified in generic_file_direct_write
      • written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_segs);
      • if (written > 0) {
        • loff_t end = pos + written;
        • if (end > i_size_read(inode) && !S_ISBLK(inode->i_mode)) {
          • i_size_write(inode, end);
          • mark_inode_dirty(inode);
        • }
        • *ppos = end;
      • }
      • How did that happen?
    • Direct Write
      • open() with O_DIRECT, bypass page cache, databases like it.
      • Call trace:
      • generic_file_direct_write() ← i_size change
      • __generic_file_aio_write() ← check the O_DIRECT flag on file struct
      • ocfs2_file_aio_write() ← file->f_op->aio_write()
      • do_sync_write() ← file->f_op->write()
      • vfs_write()
      • generic_file_direct_write() will update the i_size if pos + written > i_size_read(inode)
      • We will meet the bug if we do a direct write extending the inode, then a truncate on it
    • ocfs2_file_aio_write()
      • Check if we have O_DIRECT flag
      • Can we do direct write? Not if end > i_size_read(inode)
      • down_read(&inode->i_alloc_sem); if doing direct write
        • To protect us from truncate on the same node
      • Only takes PR lock on the ocfs2_rw_lock of the inode when doing direct write(we are not going to change metadata)
        • To protect i_size against other nodes
        • ocfs2 uses dlm to sync the metadata in the cluster
        • 3 levels of locks are used, NL, PR, EX
      • Better performance when multiple nodes write to a file, what is oracle famous for? ;-)
    • ocfs2_file_aio_write() count.
      • Call generic_file_direct_write directly if we can do direct write (end <= i_size_read(inode))
      • Call __generic_file_aio_write otherwise
        • Buffered write, no down_read on inode->i_alloc_sem
        • Take EX lock on the ocfs2_rw_lock of the inode
      • __generic_file_aio_write will check if the file have O_DIRECT flag, and try gerneric_file_direct_write first,then fall back to buffered write
      • So if (end > i_size_read(inode)), we are still doing direct write
      • So what? __generic_file_aio_write will fall back to buffered write if direct write fails
    • ocfs2_direct_IO()
        if (i_size_read(inode) <= offset)
          return 0;
      • Only checking the offset is not enough
      • But wait, we have ocfs2_direct_IO_get_blocks()
        • ret = blockdev_direct_IO_no_locking(rw, iocb, inode,
        • inode->i_sb->s_bdev, iov, offset,
        • nr_segs,
        • ocfs2_direct_IO_get_blocks, ocfs2_dio_end_io);
      • In most cases, direct_IO method is a wrapper for the __blockdev_direct_IO() function
      • direct_IO will pass a function pointer to __blockdev_direct_IO() for translating the blocks of the inode
    • ocfs2_direct_IO_get_blocks()
      • We do have a check to see whether we are extending the inode, but it's not in the linus tree
      • if (create && (iblock + max_blocks) > inode_blocks) {
        • ret = -EIO;
        • goto bail;
      • }
      • What if the inode has a partial block in the end, and we are writing till the end of the block?
      • We can still get a pass, because at the level, we are talking in blocks, not bytes
    • The fix
      • We could check if offset + length > i_size in ocfs2_direct_IO()
      • But in ocfs2_file_aio_write, we won't down_read(&inode->i_alloc_sem) when we could not do direct write.
      • Thus if we do a direct write extending i_size, ocfs2_file_aio_write() just prepared to do buffered write, however __generic_file_aio_write will try direct write first, so we will do direct write without down_read the i_alloc_sem
      • Will race with allocation change like truncate
      • So if ocfs2_file_aio_write() decides we could not do direct write, we call generic_file_buffered_write() instead of __generic_file_aio_write
    • Thanks to
      • Colyli
      • Jan Kara
      • JiaJu Zhang
      • Joel Becker
      • Mark Fasheh
      • Tao Ma
    • Q & A
    •  
    • Unpublished Work of Novell, Inc. All Rights Reserved. This work is an unpublished work and contains confidential, proprietary, and trade secret information of Novell, Inc. Access to this work is restricted to Novell employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of Novell, Inc. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability. General Disclaimer This document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. Novell, Inc. makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for Novell products remains at the sole discretion of Novell. Further, Novell, Inc. reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All Novell marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.