Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

20100425 ocfs2 direct_write


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

20100425 ocfs2 direct_write

  1. 1. The Story of a Bug in ocfs2 Direct Write Li Dongyang OPS Automation QA [email_address]
  2. 2. Novell Bugzilla, Bug 591039 <ul>(5431,0):ocfs2_truncate_file:465 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode) (5431,0):ocfs2_truncate_file:465 ERROR: Inode 95483, inode i_size = 1105920 != di i_size = 1103252, i_flags = 0x1 kernel BUG at fs/ocfs2/file.c:465! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/fs/ocfs2/loaded_cluster_plugins Pid: 11076, comm: fsstress Not tainted ( #1) EIP: 0061:[<d24701ba>] EFLAGS: 00010296 CPU: 1 EIP is at ocfs2_setattr+0xc1a/0x1d10 [ocfs2] Process fsstress (pid: 11076, ti=c33fc000 task=c31b45b0 task.ti=c33fc000) Call Trace: [<c00d6191>] notify_change+0x141/0x320 [<c00bf1a8>] do_truncate+0x68/0xa0 [<c00bf547>] do_sys_truncate+0x177/0x220 [<c000666d>] syscall_call+0x7/0xb [<f57fe424>] 0xf57fe424 </ul>
  3. 3. Symptoms <ul><li>fsstress from ltp
  4. 4. We are using truncate sys call
  5. 5. The inode in memory (i_size_read(inode)) is always a bit greater than the on-disk inode size (di->i_size)
  6. 6. notify_change calls setattr in the i_op of the inode, which is ocfs2_setattr in our case
  7. 7. ocfs2_setattr calls ocfs2_truncate_file, and we meet the bug expression, BOOM
  8. 8. But how did they become inconsistent? </li></ul>
  9. 9. Hunting <ul><li>Look at every place calling i_size_write in ocfs2
  10. 10. We write the i_size of on disk inode by hand around calling i_size_write in ocfs2
  11. 11. printk around i_size_write in VFS
  12. 12. The i_size got modified in generic_file_direct_write
  13. 13. written = mapping->a_ops->direct_IO(WRITE, iocb, iov, pos, *nr_segs);
  14. 14. if (written > 0) { </li><ul><li>loff_t end = pos + written;
  15. 15. if (end > i_size_read(inode) && !S_ISBLK(inode->i_mode)) { </li><ul><li>i_size_write(inode, end);
  16. 16. mark_inode_dirty(inode); </li></ul><li>}
  17. 17. *ppos = end; </li></ul><li>}
  18. 18. How did that happen? </li></ul>
  19. 19. Direct Write <ul><li>open() with O_DIRECT, bypass page cache, databases like it.
  20. 20. Call trace:
  21. 21. generic_file_direct_write() ← i_size change
  22. 22. __generic_file_aio_write() ← check the O_DIRECT flag on file struct
  23. 23. ocfs2_file_aio_write() ← file->f_op->aio_write()
  24. 24. do_sync_write() ← file->f_op->write()
  25. 25. vfs_write()
  26. 26. generic_file_direct_write() will update the i_size if pos + written > i_size_read(inode)
  27. 27. We will meet the bug if we do a direct write extending the inode, then a truncate on it </li></ul>
  28. 28. ocfs2_file_aio_write() <ul><li>Check if we have O_DIRECT flag
  29. 29. Can we do direct write? Not if end > i_size_read(inode)
  30. 30. down_read(&inode->i_alloc_sem); if doing direct write </li><ul><li>To protect us from truncate on the same node </li></ul><li>Only takes PR lock on the ocfs2_rw_lock of the inode when doing direct write(we are not going to change metadata) </li><ul><li>To protect i_size against other nodes
  31. 31. ocfs2 uses dlm to sync the metadata in the cluster
  32. 32. 3 levels of locks are used, NL, PR, EX </li></ul><li>Better performance when multiple nodes write to a file, what is oracle famous for? ;-) </li></ul>
  33. 33. ocfs2_file_aio_write() count. <ul><li>Call generic_file_direct_write directly if we can do direct write (end <= i_size_read(inode))
  34. 34. Call __generic_file_aio_write otherwise </li><ul><li>Buffered write, no down_read on inode->i_alloc_sem
  35. 35. Take EX lock on the ocfs2_rw_lock of the inode </li></ul><li>__generic_file_aio_write will check if the file have O_DIRECT flag, and try gerneric_file_direct_write first,then fall back to buffered write
  36. 36. So if (end > i_size_read(inode)), we are still doing direct write
  37. 37. So what? __generic_file_aio_write will fall back to buffered write if direct write fails </li></ul>
  38. 38. ocfs2_direct_IO() <ul>if (i_size_read(inode) <= offset) <ul>return 0; </ul><li>Only checking the offset is not enough
  39. 39. But wait, we have ocfs2_direct_IO_get_blocks() </li><ul><li>ret = blockdev_direct_IO_no_locking(rw, iocb, inode,
  40. 40. inode->i_sb->s_bdev, iov, offset,
  41. 41. nr_segs,
  42. 42. ocfs2_direct_IO_get_blocks, ocfs2_dio_end_io); </li></ul><li>In most cases, direct_IO method is a wrapper for the __blockdev_direct_IO() function
  43. 43. direct_IO will pass a function pointer to __blockdev_direct_IO() for translating the blocks of the inode </li></ul>
  44. 44. ocfs2_direct_IO_get_blocks() <ul><li>We do have a check to see whether we are extending the inode, but it's not in the linus tree
  45. 45. if (create && (iblock + max_blocks) > inode_blocks) { </li><ul><li>ret = -EIO;
  46. 46. goto bail; </li></ul><li>}
  47. 47. What if the inode has a partial block in the end, and we are writing till the end of the block?
  48. 48. We can still get a pass, because at the level, we are talking in blocks, not bytes </li></ul>
  49. 49. The fix <ul><li>We could check if offset + length > i_size in ocfs2_direct_IO()
  50. 50. But in ocfs2_file_aio_write, we won't down_read(&inode->i_alloc_sem) when we could not do direct write.
  51. 51. Thus if we do a direct write extending i_size, ocfs2_file_aio_write() just prepared to do buffered write, however __generic_file_aio_write will try direct write first, so we will do direct write without down_read the i_alloc_sem
  52. 52. Will race with allocation change like truncate
  53. 53. So if ocfs2_file_aio_write() decides we could not do direct write, we call generic_file_buffered_write() instead of __generic_file_aio_write </li></ul>
  54. 54. Thanks to <ul><li>Colyli
  55. 55. Jan Kara
  56. 56. JiaJu Zhang
  57. 57. Joel Becker
  58. 58. Mark Fasheh
  59. 59. Tao Ma </li></ul>
  60. 60. Q & A
  61. 62. Unpublished Work of Novell, Inc. All Rights Reserved. This work is an unpublished work and contains confidential, proprietary, and trade secret information of Novell, Inc. Access to this work is restricted to Novell employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of Novell, Inc. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability. General Disclaimer This document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. Novell, Inc. makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for Novell products remains at the sole discretion of Novell. Further, Novell, Inc. reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All Novell marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.