Eat my data

3,875
-1

Published on

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,875
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
22
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Eat my data

  1. 1. Eat My Data: (now with 20% more rant!) How everybody gets file I/O wrong Stewart Smith [email_address] Senior Software Engineer, MySQL Cluster MySQL AB
  2. 2. What I work on <ul><li>MySQL Cluster </li></ul><ul><li>High Availability </li></ul><ul><li>(Shared Nothing) Clustered Database </li></ul><ul><li>with some real time properties </li></ul>
  3. 3. Overview <ul><li>Common mistakes that lead to data loss </li></ul>
  4. 4. Overview <ul><li>Common mistakes that lead to data loss </li></ul><ul><li>Mistakes by: </li></ul><ul><ul><li>the application programmer </li></ul></ul>
  5. 5. Overview <ul><li>Common mistakes that lead to data loss </li></ul><ul><li>Mistakes by: </li></ul><ul><ul><li>the application programmer </li></ul></ul><ul><ul><li>the library programmer </li></ul></ul>
  6. 6. Overview <ul><li>Common mistakes that lead to data loss </li></ul><ul><li>Mistakes by: </li></ul><ul><ul><li>the application programmer </li></ul></ul><ul><ul><li>the library programmer </li></ul></ul><ul><ul><li>the kernel programmer </li></ul></ul>
  7. 7. Overview <ul><li>Common mistakes that lead to data loss </li></ul><ul><li>Mistakes by: </li></ul><ul><ul><li>the application programmer </li></ul></ul><ul><ul><li>the library programmer </li></ul></ul><ul><ul><li>the kernel programmer </li></ul></ul><ul><li>Mostly just concentrating on Linux </li></ul><ul><ul><li>will mention war stories on other platforms too </li></ul></ul>
  8. 8. In the beginning <ul><li>All IO was synchronous </li></ul>
  9. 9. In the beginning <ul><li>All IO was synchronous </li></ul><ul><li>When you called write things hit the platter </li></ul>
  10. 10. In the beginning <ul><li>All IO was synchronous </li></ul><ul><li>When you called write things hit the platter </li></ul><ul><li>Turns out that this is slow for a lot of cases </li></ul>
  11. 11. A world without failure
  12. 12. A world without failure <ul><li>doesn't exist </li></ul>
  13. 13. A world without failure <ul><li>doesn't exist </li></ul><ul><li>computers crash </li></ul>
  14. 14. A world without failure <ul><li>doesn't exist </li></ul><ul><li>computers crash </li></ul><ul><li>power goes out </li></ul>
  15. 15. A world without failure <ul><li>doesn't exist </li></ul><ul><li>computers crash </li></ul><ul><li>power goes out </li></ul><ul><ul><li>battery goes flat </li></ul></ul>
  16. 16. A world without failure <ul><li>doesn't exist </li></ul><ul><li>computers crash </li></ul><ul><li>power goes out </li></ul><ul><ul><li>battery goes flat </li></ul></ul><ul><ul><li>kick-the-cord out </li></ul></ul>
  17. 17. A world without failure <ul><li>doesn't exist </li></ul><ul><li>computers crash </li></ul><ul><li>power goes out </li></ul><ul><ul><li>battery goes flat </li></ul></ul><ul><ul><li>kick-the-cord out </li></ul></ul><ul><ul><li>suspend works, resume doesn't </li></ul></ul>
  18. 18. A world without failure <ul><li>doesn't exist </li></ul><ul><li>computers crash </li></ul><ul><li>power goes out </li></ul><ul><ul><li>battery goes flat </li></ul></ul><ul><ul><li>kick-the-cord out </li></ul></ul><ul><ul><li>suspend works, resume doesn't </li></ul></ul><ul><li>When the data is important, live in the world of failure </li></ul>
  19. 19. Data Consistency <ul><li>In the event of failure, what state can I expect my data to be in? </li></ul>
  20. 20. User Expectations <ul><li>If the power goes out, the last saved version of my data is there </li></ul>
  21. 21. User Expectations <ul><li>If the power goes out, the last saved version of my data is there </li></ul><ul><li>If there isn't an explicit save (e.g. RSS readers, IM logs) some recent version should be okay. </li></ul>
  22. 22. User Expectations <ul><li>If the power goes out, the last saved version of my data is there </li></ul><ul><li>If there isn't an explicit save (e.g. RSS readers, IM logs) some recent version should be okay. </li></ul><ul><li>Not Acceptable: </li></ul><ul><ul><li>I hit save, why is none of my work there? </li></ul></ul><ul><ul><li>Why have all my IM logs disappeared? </li></ul></ul><ul><ul><li>Why have all my saved passwords disappeared? </li></ul></ul>
  23. 23. Databases
  24. 24. Databases <ul><li>ACID </li></ul>
  25. 25. Databases <ul><li>ACID </li></ul><ul><ul><li>D is for Durability </li></ul></ul>
  26. 26. Databases <ul><li>ACID </li></ul><ul><ul><li>D is for Durability </li></ul></ul><ul><li>Transactions </li></ul><ul><ul><li>A committed transaction survives failure </li></ul></ul>
  27. 27. Databases <ul><li>ACID </li></ul><ul><ul><li>D is for Durability </li></ul></ul><ul><li>Transactions </li></ul><ul><ul><li>A committed transaction survives failure </li></ul></ul><ul><li>Isolation Levels </li></ul><ul><ul><li>Repeatable Read, Read Committed etc </li></ul></ul>
  28. 28. Databases <ul><li>ACID </li></ul><ul><ul><li>D is for Durability </li></ul></ul><ul><li>Transactions </li></ul><ul><ul><li>A committed transaction survives failure </li></ul></ul><ul><li>Isolation Levels </li></ul><ul><ul><li>Repeatable Read, Read Committed etc </li></ul></ul><ul><li>Very well known consistency issues </li></ul>
  29. 29. Databases <ul><li>ACID </li></ul><ul><ul><li>D is for Durability </li></ul></ul><ul><li>Transactions </li></ul><ul><ul><li>A committed transaction survives failure </li></ul></ul><ul><li>Isolation Levels </li></ul><ul><ul><li>Repeatable Read, Read Committed etc </li></ul></ul><ul><li>Very well known consistency issues </li></ul><ul><ul><li>that lots of people still get wrong </li></ul></ul>
  30. 30. Databases <ul><li>ACID </li></ul><ul><ul><li>D is for Durability </li></ul></ul><ul><li>Transactions </li></ul><ul><ul><li>A committed transaction survives failure </li></ul></ul><ul><li>Isolation Levels </li></ul><ul><ul><li>Repeatable Read, Read Committed etc </li></ul></ul><ul><li>Very well known consistency issues </li></ul><ul><ul><li>that lots of people still get wrong </li></ul></ul><ul><ul><li>different engines have different properties </li></ul></ul>
  31. 31. What databases are good at <ul><li>Lots of small objects (rows) </li></ul>
  32. 32. What databases are good at <ul><li>Lots of small objects (rows) </li></ul><ul><li>okay at larger objects (BLOBs) </li></ul>
  33. 33. What databases are good at <ul><li>Lots of small objects (rows) </li></ul><ul><li>okay at larger objects (BLOBs) </li></ul><ul><ul><li>named after “The Blob”, not Binary Large Objects </li></ul></ul>
  34. 34. What databases are good at <ul><li>Lots of small objects (rows) </li></ul><ul><li>okay at larger objects (BLOBs) </li></ul><ul><ul><li>named after “The Blob”, not Binary Large Objects </li></ul></ul><ul><li>Accessed by a variety of ways </li></ul><ul><ul><li>indexes </li></ul></ul>
  35. 35. Easy solution to data consistency <ul><li>put it in a database </li></ul><ul><ul><li>that gives data consistency guarantees </li></ul></ul><ul><li>We'll talk about this later </li></ul>
  36. 36. Revelation #1 <ul><li>Databases are not file systems! </li></ul>
  37. 37. Revelation #2 <ul><li>File systems are not databases! </li></ul>
  38. 38. Revelation #3 <ul><li>A database has different consistency semantics than a file system </li></ul>
  39. 39. Revelation #3 <ul><li>A database has different consistency semantics than a file system </li></ul><ul><ul><li>typically file systems a lot more relaxed </li></ul></ul>
  40. 40. Eat my data <ul><li>Where can the data be to eat? </li></ul>
  41. 41. Where data can be <ul><li>Application Buffer - CoolApp </li></ul><ul><ul><li>application crash = loss of this data </li></ul></ul>
  42. 42. Where data can be <ul><li>Application Buffer - CoolApp </li></ul><ul><ul><li>application crash = loss of this data </li></ul></ul><ul><li>Library buffer - glibc </li></ul><ul><ul><li>application crash = loss of this data </li></ul></ul>
  43. 43. Where data can be <ul><li>Application Buffer - CoolApp </li></ul><ul><ul><li>application crash = loss of this data </li></ul></ul><ul><li>Library buffer - glibc </li></ul><ul><ul><li>application crash = loss of this data </li></ul></ul><ul><li>Operating System buffer – page/buffer cache </li></ul><ul><ul><li>system crash = loss of this data </li></ul></ul>
  44. 44. Where data can be <ul><li>Application Buffer - CoolApp </li></ul><ul><ul><li>application crash = loss of this data </li></ul></ul><ul><li>Library buffer - glibc </li></ul><ul><ul><li>application crash = loss of this data </li></ul></ul><ul><li>Operating System buffer – page/buffer cache </li></ul><ul><ul><li>system crash = loss of this data </li></ul></ul><ul><li>on disk </li></ul><ul><ul><li>disk failure = loss of data </li></ul></ul>
  45. 45. Data flow <ul><li>Application to library buffer </li></ul><ul><ul><li>fwrite(), fprintf() and friends </li></ul></ul>
  46. 46. Data flow <ul><li>Application to library buffer </li></ul><ul><ul><li>fwrite(), fprintf() and friends </li></ul></ul><ul><li>Library to OS buffer </li></ul><ul><ul><li>write(2) </li></ul></ul>
  47. 47. Data flow <ul><li>Application to library buffer </li></ul><ul><ul><li>fwrite(), fprintf() and friends </li></ul></ul><ul><li>Library to OS buffer </li></ul><ul><ul><li>write(2) </li></ul></ul><ul><li>OS buffer to disk </li></ul>
  48. 48. Data flow <ul><li>Application to library buffer </li></ul><ul><ul><li>fwrite(), fprintf() and friends </li></ul></ul><ul><li>Library to OS buffer </li></ul><ul><ul><li>write(2) </li></ul></ul><ul><li>OS buffer to disk </li></ul><ul><ul><li>paged out, periodic flushing (5 or 30secs usually) </li></ul></ul>
  49. 49. Data flow <ul><li>Application to library buffer </li></ul><ul><ul><li>fwrite(), fprintf() and friends </li></ul></ul><ul><li>Library to OS buffer </li></ul><ul><ul><li>write(2) </li></ul></ul><ul><li>OS buffer to disk </li></ul><ul><ul><li>paged out, periodic flushing (5 or 30secs usually) </li></ul></ul><ul><ul><ul><li>Can be very delayed with laptop mode </li></ul></ul></ul>
  50. 50. Data flow <ul><li>Application to library buffer </li></ul><ul><ul><li>fwrite(), fprintf() and friends </li></ul></ul><ul><li>Library to OS buffer </li></ul><ul><ul><li>write(2) </li></ul></ul><ul><li>OS buffer to disk </li></ul><ul><ul><li>paged out, periodic flushing (5 or 30secs usually) </li></ul></ul><ul><ul><ul><li>Can be very delayed with laptop mode </li></ul></ul></ul><ul><ul><li>fsync(2), fdatasync(2), sync(2) </li></ul></ul>
  51. 51. Data flow <ul><li>Application to library buffer </li></ul><ul><ul><li>fwrite(), fprintf() and friends </li></ul></ul><ul><li>Library to OS buffer </li></ul><ul><ul><li>write(2) </li></ul></ul><ul><li>OS buffer to disk </li></ul><ul><ul><li>paged out, periodic flushing (5 or 30secs usually) </li></ul></ul><ul><ul><ul><li>Can be very delayed with laptop mode </li></ul></ul></ul><ul><ul><li>fsync(2), fdatasync(2), sync(2) </li></ul></ul><ul><ul><ul><li>with caveats! </li></ul></ul></ul>
  52. 52. Simple Application: Save==on disk <ul><li>User hits “Save” in Word Processor </li></ul><ul><ul><li>Expects that data to be on disk when “Saved” </li></ul></ul><ul><li>How? </li></ul>
  53. 53. Saving a simple document <ul><li>struct wp_doc { </li></ul><ul><ul><li>char *document; </li></ul></ul><ul><ul><li>size_t len; </li></ul></ul><ul><li>}; </li></ul><ul><li>struct wp_doc d; </li></ul><ul><li>... </li></ul><ul><li>FILE *f; </li></ul><ul><li>f= fopen(“important_document”,”w”); </li></ul><ul><li>fwrite(d.document,d.len,1,f); </li></ul>
  54. 54. Bug #1 <ul><li>No fclose(2) </li></ul><ul><ul><li>Buffers for the stream may not be flushed from libc cache </li></ul></ul>
  55. 55. Word Processor Saving -1 Bug <ul><li>struct wp_doc { </li></ul><ul><ul><li>char *document; </li></ul></ul><ul><ul><li>size_t len; </li></ul></ul><ul><li>}; </li></ul><ul><li>struct wp_doc d; </li></ul><ul><li>... </li></ul><ul><li>FILE *f; </li></ul><ul><li>f= fopen(“important_document”,”w”); </li></ul><ul><li>fwrite(d.document,d.len,1,f); </li></ul><ul><li>fclose(f); </li></ul>
  56. 56. Bug #2, 3 and 4 <ul><li>No error checking! </li></ul><ul><li>fopen </li></ul><ul><ul><li>Did we open the file </li></ul></ul><ul><li>fwrite </li></ul><ul><ul><li>did we write the entire file (ENOSPC?) </li></ul></ul><ul><li>fclose </li></ul><ul><ul><li>did we successfully close the file </li></ul></ul>
  57. 57. File System Integrity <ul><li>metadata journaling is just that </li></ul>
  58. 58. File System Integrity <ul><li>metadata journaling is just that </li></ul><ul><ul><li>meta data only </li></ul></ul>
  59. 59. File System Integrity <ul><li>metadata journaling is just that </li></ul><ul><ul><li>meta data only </li></ul></ul><ul><ul><li>no data is written to the journal </li></ul></ul>
  60. 60. File System Integrity <ul><li>metadata journaling is just that </li></ul><ul><ul><li>meta data only </li></ul></ul><ul><ul><li>no data is written to the journal </li></ul></ul><ul><ul><li>integrity of file system structures </li></ul></ul>
  61. 61. File System Integrity <ul><li>metadata journaling is just that </li></ul><ul><ul><li>meta data only </li></ul></ul><ul><ul><li>no data is written to the journal </li></ul></ul><ul><ul><li>integrity of file system structures </li></ul></ul><ul><ul><li>not internals of files </li></ul></ul>
  62. 62. Data journaling <ul><li>is nothing like a database transaction </li></ul>
  63. 63. Atomic write(2)
  64. 64. Atomic write(2) <ul><li>It isn't </li></ul>
  65. 65. Atomic write(2) <ul><li>It isn't </li></ul><ul><li>can half complete </li></ul>
  66. 66. Atomic write(2) <ul><li>It isn't </li></ul><ul><li>can half complete </li></ul><ul><li>A file system with atomic write(2) </li></ul>
  67. 67. Atomic write(2) <ul><li>It isn't </li></ul><ul><li>can half complete </li></ul><ul><li>A file system with atomic write(2) </li></ul><ul><ul><li>can't rely on it being there </li></ul></ul>
  68. 68. Atomic write(2) <ul><li>It isn't </li></ul><ul><li>can half complete </li></ul><ul><li>A file system with atomic write(2) </li></ul><ul><ul><li>can't rely on it being there </li></ul></ul><ul><ul><li>Essentially useless </li></ul></ul>
  69. 69. Eat My Data <ul><li>struct wp_doc { </li></ul><ul><ul><li>char *document; </li></ul></ul><ul><ul><li>size_t len; </li></ul></ul><ul><li>}; </li></ul><ul><li>struct wp_doc d; </li></ul><ul><li>... </li></ul><ul><li>FILE *f; </li></ul><ul><li>f= fopen(“important_document”,”w”); </li></ul><ul><li>fwrite(d.document,d.len,1,f); </li></ul><ul><li>fclose(f); </li></ul>CRASH
  70. 70. Write to Temp file, rename <ul><li>Old trick of writing to temp file first </li></ul>
  71. 71. Write to Temp file, rename <ul><li>Old trick of writing to temp file first </li></ul><ul><li>Can catch any errors </li></ul><ul><ul><li>e.g. ENOSPC </li></ul></ul>
  72. 72. Write to Temp file, rename <ul><li>Old trick of writing to temp file first </li></ul><ul><li>Can catch any errors </li></ul><ul><ul><li>e.g. ENOSPC </li></ul></ul><ul><ul><li>don't rename on error </li></ul></ul>
  73. 73. Write to Temp file, rename <ul><li>Old trick of writing to temp file first </li></ul><ul><li>Can catch any errors </li></ul><ul><ul><li>e.g. ENOSPC </li></ul></ul><ul><ul><li>don't rename on error </li></ul></ul><ul><li>Idea that if we crash during writing temp file user data is safe </li></ul>
  74. 74. Write to Temp file, rename <ul><li>Old trick of writing to temp file first </li></ul><ul><li>Can catch any errors </li></ul><ul><ul><li>e.g. ENOSPC </li></ul></ul><ul><ul><li>don't rename on error </li></ul></ul><ul><li>Idea that if we crash during writing temp file user data is safe </li></ul><ul><ul><li>although we may leave around a temp file </li></ul></ul>
  75. 75. Temp file, rename <ul><li>struct wp_doc { </li></ul><ul><ul><li>char *document; </li></ul></ul><ul><ul><li>size_t len; </li></ul></ul><ul><li>}; </li></ul><ul><li>struct wp_doc d; </li></ul><ul><li>... </li></ul><ul><li>FILE *f; </li></ul><ul><li>f= fopen(“important_document.temp”,”w”); </li></ul><ul><li>if(!f) return errno; </li></ul><ul><li>size_t w= fwrite(d.document,d.len,1,f); </li></ul><ul><li>if(w<d.len) return errno; </li></ul><ul><li>fclose(f); </li></ul><ul><li>rename(“important_document.temp”,”important_document”); </li></ul>
  76. 76. Now all is good with the world...
  77. 77. Now all is good with the world... <ul><li>This is where a lot of people stop </li></ul>
  78. 78. Now all is good with the world... <ul><li>This is where a lot of people stop </li></ul><ul><li>close(2) and rename(2) do not imply sync </li></ul>
  79. 79. <ul><li>close and rename do not imply sync </li></ul>
  80. 80. Now all is good with the world... <ul><li>This is where a lot of people stop </li></ul><ul><li>close(2) and rename(2) do not imply sync </li></ul><ul><li>They make no guarantees on when (or in what order) changes hit the platter </li></ul>
  81. 81. Now all is good with the world... <ul><li>This is where a lot of people stop </li></ul><ul><li>close(2) and rename(2) do not imply sync </li></ul><ul><li>They make no guarantees on when (or in what order) changes hit the platter </li></ul><ul><li>Quite possible (and often) metadata is flushed before data </li></ul>
  82. 82. File System Integrity <ul><li>data=ordered mode on ext3 </li></ul><ul><ul><li>writes data before metadata </li></ul></ul><ul><ul><li>other file systems are different </li></ul></ul>
  83. 83. File System Integrity <ul><li>data=ordered mode on ext3 </li></ul><ul><ul><li>writes data before metadata </li></ul></ul><ul><ul><li>other file systems are different </li></ul></ul><ul><li>ext3 ordered mode is an exception , not the rule </li></ul>
  84. 84. File System Integrity <ul><li>data=ordered mode on ext3 </li></ul><ul><ul><li>writes data before metadata </li></ul></ul><ul><ul><li>other file systems are different </li></ul></ul><ul><li>ext3 ordered mode is an exception , not the rule </li></ul><ul><ul><li>applications relying on this are not portable and depend on file system behaviour. the applications are buggy. </li></ul></ul>
  85. 85. data=ordered <ul><li>write() </li></ul><ul><li>close() </li></ul><ul><li>rename() </li></ul><ul><li>Disk order: </li></ul><ul><ul><li>data from fwrite() </li></ul></ul><ul><ul><li>inode </li></ul></ul><ul><ul><li>directory entry </li></ul></ul>
  86. 86. other systems <ul><li>write() </li></ul><ul><li>close() </li></ul><ul><li>rename() </li></ul><ul><li>Disk order: </li></ul><ul><ul><li>any! </li></ul></ul>
  87. 87. flush and sync <ul><li>struct wp_doc { </li></ul><ul><ul><li>char *document; </li></ul></ul><ul><ul><li>size_t len; </li></ul></ul><ul><li>}; </li></ul><ul><li>struct wp_doc d; </li></ul><ul><li>... </li></ul><ul><li>FILE *f; </li></ul><ul><li>f= fopen(“important_document.temp”,”w”); </li></ul><ul><li>if(!f) return errno; </li></ul><ul><li>size_t w= fwrite(d.document,d.len,1,f); </li></ul><ul><li>if(w<d.len) return errno; </li></ul><ul><li>if(fflush(f)!=0) return errno; </li></ul><ul><li>if(fsync(fileno(f))==-1) return errno; </li></ul><ul><li>fclose(f); </li></ul><ul><li>rename(“important_document.temp”,”important_document”); </li></ul>Flush the buffers! Sync to disk before rename
  88. 88. A tale of libxml2 <ul><li>libxml2 provides utility functions to “write XML to file” </li></ul>
  89. 89. A tale of libxml2 <ul><li>libxml2 provides utility functions to “write XML to file” </li></ul><ul><li>Nice application developer saves time by using libxml2's function </li></ul>
  90. 90. A tale of libxml2 <ul><li>libxml2 provides utility functions to “write XML to file” </li></ul><ul><li>Nice application developer saves time by using libxml2's function </li></ul><ul><ul><li>Application developer writes to temp file, renames </li></ul></ul>
  91. 91. A tale of libxml2 <ul><li>libxml2 provides utility functions to “write XML to file” </li></ul><ul><li>Nice application developer saves time by using libxml2's function </li></ul><ul><ul><li>Application developer writes to temp file, renames </li></ul></ul><ul><ul><li>User looses data after crash </li></ul></ul>
  92. 92. A tale of libxml2 <ul><li>libxml2 provides utility functions to “write XML to file” </li></ul><ul><li>Nice application developer saves time by using libxml2's function </li></ul><ul><ul><li>Application developer writes to temp file, renames </li></ul></ul><ul><ul><li>User looses data after crash </li></ul></ul><ul><ul><li>Nice application developer has to work around limitations of library </li></ul></ul>
  93. 93. so, replace <ul><li>xmlSaveFile(foo) </li></ul>
  94. 94. gint common_save_xml(xmlDocPtr doc, gchar *filename) { FILE *fp; char *xmlbuf; int fd, n; fp = g_fopen(filename, &quot;w&quot;); if(NULL == fp) return -1; xmlDocDumpFormatMemory(doc, (xmlChar **)&xmlbuf, &n, TRUE); if(n <= 0) { errno = ENOMEM; return -1; } if(fwrite(xmlbuf, sizeof (xmlChar), n, fp) < n) { xmlFree (xmlbuf); return -1; } xmlFree (xmlbuf); /* flush user-space buffers */ if (fflush (fp) != 0) return -1; if ((fd = fileno (fp)) == -1) return -1; #ifdef HAVE_FSYNC /* sync kernel-space buffers to disk */ if (fsync (fd) == -1) return -1; #endif fclose(fp); return 0; }
  95. 95. Nearing Nirvana <ul><li>If any failure during writing, the previously saved copy is untouched and safe </li></ul><ul><ul><li>User wont get partial or no data </li></ul></ul>
  96. 96. Except if you want to be portable... <ul><li>On Linux, fsync(2) does actually sync </li></ul><ul><ul><li>barring enabling write cache </li></ul></ul>
  97. 97. Except if you want to be portable... <ul><li>On Linux, fsync(2) does actually sync </li></ul><ul><ul><li>barring enabling write cache </li></ul></ul><ul><li>On MacOS X, </li></ul>
  98. 98. Except if you want to be portable... <ul><li>On Linux, fsync(2) does actually sync </li></ul><ul><ul><li>barring enabling write cache </li></ul></ul><ul><li>On MacOS X, not so much </li></ul>
  99. 99. on fsync, POSIX Says... <ul><li>If _POSIX_SYNCHRONIZED_IO is not defined, the wording relies heavily on the conformance document to tell the user what can be expected from the system. It is explicitly intended that a null implementation is permitted. </li></ul>
  100. 100. on fsync, POSIX Says... <ul><li>If _POSIX_SYNCHRONIZED_IO is not defined, the wording relies heavily on the conformance document to tell the user what can be expected from the system. It is explicitly intended that a null implementation is permitted. </li></ul>
  101. 101. POSIX compliant fsync <ul><li>int fsync(int fd) </li></ul>
  102. 102. POSIX compliant fsync <ul><li>int fsync(int fd) </li></ul><ul><li>{ </li></ul>
  103. 103. POSIX compliant fsync <ul><li>int fsync(int fd) </li></ul><ul><li>{ </li></ul><ul><ul><li>return 0; </li></ul></ul>
  104. 104. POSIX compliant fsync <ul><li>int fsync(int fd) </li></ul><ul><li>{ </li></ul><ul><ul><li>return 0; </li></ul></ul><ul><li>} </li></ul>
  105. 105. POSIX compliant fsync <ul><li>int fsync(int fd) </li></ul><ul><li>{ </li></ul><ul><ul><li>return 0; </li></ul></ul><ul><li>} </li></ul>gcc
  106. 106. POSIX compliant fsync <ul><li>int fsync(int fd) </li></ul><ul><li>{ </li></ul><ul><ul><li>return 0; </li></ul></ul><ul><li>} </li></ul>pushl %ebp movl %esp, %ebp movl $0, %eax popl %ebp ret gcc
  107. 107. Tale of a really fast database server <ul><li>A while ago (pre MySQL 4.1.9) </li></ul>
  108. 108. Tale of a really fast database server <ul><li>A while ago (pre MySQL 4.1.9) </li></ul><ul><li>Seeing corruption of InnoDB pages </li></ul>
  109. 109. Tale of a really fast database server <ul><li>A while ago (pre MySQL 4.1.9) </li></ul><ul><li>Seeing corruption of InnoDB pages </li></ul><ul><ul><li>only on MacOS X </li></ul></ul>
  110. 110. Tale of a really fast database server <ul><li>A while ago (pre MySQL 4.1.9) </li></ul><ul><li>Seeing corruption of InnoDB pages </li></ul><ul><ul><li>only on MacOS X </li></ul></ul><ul><li>Also, things seemed pretty fast </li></ul>
  111. 111. fsync() doesn't have to sync <ul><li>On MacOS X, fsync() doesn't flush drive write cache </li></ul>
  112. 112. fsync() doesn't have to sync <ul><li>On MacOS X, fsync() doesn't flush drive write cache </li></ul><ul><li>An extra fcntl is provided to do this </li></ul>
  113. 113. Standards are great <ul><li>everybody has their own </li></ul>
  114. 114. Standards are great <ul><li>everybody has their own </li></ul><ul><li>makes application developers life difficult </li></ul>
  115. 115. Standards are great <ul><li>everybody has their own </li></ul><ul><li>makes application developers life difficult </li></ul><ul><li>Let's see the InnoDB code for ensuring data is synced to disk </li></ul>
  116. 116. Standards are great <ul><li>everybody has their own </li></ul><ul><li>makes application developers life difficult </li></ul><ul><li>Let's see the InnoDB code for ensuring data is synced to disk </li></ul><ul><ul><li>if this doesn't work, transactions don't work </li></ul></ul>
  117. 117. #ifdef HAVE_DARWIN_THREADS # ifdef F_FULLFSYNC /* This executable has been compiled on Mac OS X 10.3 or later. Assume that F_FULLFSYNC is available at run-time. */ srv_have_fullfsync = TRUE; # else /* F_FULLFSYNC */ /* This executable has been compiled on Mac OS X 10.2 or earlier. Determine if the executable is running on Mac OS X 10.3 or later. */ struct utsname utsname; if (uname(&utsname)) { fputs(&quot;InnoDB: cannot determine Mac OS X version!n&quot;, stderr); } else { srv_have_fullfsync = strcmp(utsname.release, &quot;7.&quot;) >= 0; } if (!srv_have_fullfsync) { fputs(&quot;InnoDB: On Mac OS X, fsync() may be&quot; &quot; broken on internal drives,n&quot; &quot;InnoDB: making transactions unsafe!n&quot;, stderr); } # endif /* F_FULLFSYNC */ #endif /* HAVE_DARWIN_THREADS */
  118. 118. #if defined(HAVE_DARWIN_THREADS) # ifndef F_FULLFSYNC /* The following definition is from the Mac OS X 10.3 <sys/fcntl.h> */ # define F_FULLFSYNC 51 /* fsync + ask the drive to flush to the media */ # elif F_FULLFSYNC != 51 # error &quot;F_FULLFSYNC != 51: ABI incompatibility with Mac OS X 10.3&quot; # endif /* Apple has disabled fsync() for internal disk drives in OS X. That caused corruption for a user when he tested a power outage. Let us in OS X use a nonstandard flush method recommended by an Apple engineer. */ if (!srv_have_fullfsync) { /* If we are not on an operating system that supports this, then fall back to a plain fsync. */ ret = fsync(file); } else { ret = fcntl(file, F_FULLFSYNC, NULL); if (ret) { /* If we are not on a file system that supports this, then fall back to a plain fsync. */ ret = fsync(file); } } #elif HAVE_FDATASYNC ret = fdatasync(file); #else /* fprintf(stderr, &quot;Flushing to file %pn&quot;, file); */ ret = fsync(file); #endif
  119. 119. Yes, some OS Vendors hate you <ul><li>Thanks to all the permutations of reliably getting data to a disk platter, a simple call is now two screens of code </li></ul>
  120. 120. Big Files
  121. 121. Big Files <ul><li>Write to temp file, sync, rename works badly with large files </li></ul>
  122. 122. Big Files <ul><li>Write to temp file, sync, rename works badly with large files </li></ul><ul><ul><li>especially frequently modified large files </li></ul></ul>
  123. 123. Big Files <ul><li>Write to temp file, sync, rename works badly with large files </li></ul><ul><ul><li>especially frequently modified large files </li></ul></ul><ul><li>Time to start REDO/UNDO logging </li></ul>
  124. 124. Big Files <ul><li>Write to temp file, sync, rename works badly with large files </li></ul><ul><ul><li>especially frequently modified large files </li></ul></ul><ul><li>Time to start REDO/UNDO logging </li></ul><ul><ul><li>other cool tricks </li></ul></ul>
  125. 125. Big Files <ul><li>Write to temp file, sync, rename works badly with large files </li></ul><ul><ul><li>especially frequently modified large files </li></ul></ul><ul><li>Time to start REDO/UNDO logging </li></ul><ul><ul><li>other cool tricks </li></ul></ul><ul><li>Or not care so much </li></ul><ul><ul><li>e.g. DVD ripping </li></ul></ul>
  126. 126. Big Files <ul><li>Write to temp file, sync, rename works badly with large files </li></ul><ul><ul><li>especially frequently modified large files </li></ul></ul><ul><li>Time to start REDO/UNDO logging </li></ul><ul><ul><li>other cool tricks </li></ul></ul><ul><li>Or not care so much </li></ul><ul><ul><li>e.g. DVD ripping </li></ul></ul><ul><li>Some video software saves frame-per-file </li></ul>
  127. 127. Large directories <ul><li>Traditional directory is stored on disk as list of name,inode </li></ul>
  128. 128. Large directories <ul><li>Traditional directory is stored on disk as list of name,inode </li></ul><ul><li>lookup is search through this list </li></ul>
  129. 129. Large directories <ul><li>Traditional directory is stored on disk as list of name,inode </li></ul><ul><li>lookup is search through this list </li></ul><ul><li>Allocation of disk space to directories is block-at-a-time, leading to fragmentation </li></ul>
  130. 130. Large directories <ul><li>Traditional directory is stored on disk as list of name,inode </li></ul><ul><li>lookup is search through this list </li></ul><ul><li>Allocation of disk space to directories is block-at-a-time, leading to fragmentation </li></ul><ul><li>Directory indexes help </li></ul><ul><ul><li>some better than others </li></ul></ul>
  131. 131. Large directories <ul><li>Traditional directory is stored on disk as list of name,inode </li></ul><ul><li>lookup is search through this list </li></ul><ul><li>Allocation of disk space to directories is block-at-a-time, leading to fragmentation </li></ul><ul><li>Directory indexes help </li></ul><ul><ul><li>some better than others </li></ul></ul><ul><li>Can't always control the file system </li></ul><ul><ul><li>count on over a few thousand files being slow </li></ul></ul>
  132. 132. Where data is after being written <ul><li>txns in Dbs often committed but only written to log, not main data file </li></ul>
  133. 133. Where data is after being written <ul><li>txns in Dbs often committed but only written to log, not main data file </li></ul><ul><li>same with file systems </li></ul><ul><ul><li>metadata that's replayed from the log is committed </li></ul></ul>
  134. 134. Where data is after being written <ul><li>txns in Dbs often committed but only written to log, not main data file </li></ul><ul><li>same with file systems </li></ul><ul><ul><li>metadata that's replayed from the log is committed </li></ul></ul><ul><li>Don't rely on reading the raw disk of a mounted fs </li></ul>
  135. 135. Where data is after being written <ul><li>txns in Dbs often committed but only written to log, not main data file </li></ul><ul><li>same with file systems </li></ul><ul><ul><li>metadata that's replayed from the log is committed </li></ul></ul><ul><li>Don't rely on reading the raw disk of a mounted fs </li></ul><ul><ul><li>it harms kittens, </li></ul></ul>
  136. 136. Where data is after being written <ul><li>txns in Dbs often committed but only written to log, not main data file </li></ul><ul><li>same with file systems </li></ul><ul><ul><li>metadata that's replayed from the log is committed </li></ul></ul><ul><li>Don't rely on reading the raw disk of a mounted fs </li></ul><ul><ul><li>it harms kittens, puppies </li></ul></ul>
  137. 137. Where data is after being written <ul><li>txns in Dbs often committed but only written to log, not main data file </li></ul><ul><li>same with file systems </li></ul><ul><ul><li>metadata that's replayed from the log is committed </li></ul></ul><ul><li>Don't rely on reading the raw disk of a mounted fs </li></ul><ul><ul><li>it harms kittens, puppies and babies </li></ul></ul>
  138. 138. sqlite <ul><li>Many applications have structured data </li></ul>
  139. 139. sqlite <ul><li>Many applications have structured data </li></ul><ul><li>Can be (easily) represented in RDBMS </li></ul>
  140. 140. sqlite <ul><li>Many applications have structured data </li></ul><ul><li>Can be (easily) represented in RDBMS </li></ul><ul><li>sqlite is ACID with a capital D for Durability </li></ul>
  141. 141. sqlite <ul><li>Many applications have structured data </li></ul><ul><li>Can be (easily) represented in RDBMS </li></ul><ul><li>sqlite is ACID with a capital D for Durability </li></ul><ul><li>takes hard work out of things </li></ul>
  142. 142. sqlite <ul><li>Many applications have structured data </li></ul><ul><li>Can be (easily) represented in RDBMS </li></ul><ul><li>sqlite is ACID with a capital D for Durability </li></ul><ul><li>takes hard work out of things </li></ul><ul><li>Not so good with many clients </li></ul>
  143. 143. sqlite <ul><li>Many applications have structured data </li></ul><ul><li>Can be (easily) represented in RDBMS </li></ul><ul><li>sqlite is ACID with a capital D for Durability </li></ul><ul><li>takes hard work out of things </li></ul><ul><li>Not so good with many clients </li></ul><ul><li>Brilliant for a document format though </li></ul>
  144. 144. sqlite <ul><li>Many applications have structured data </li></ul><ul><li>Can be (easily) represented in RDBMS </li></ul><ul><li>sqlite is ACID with a capital D for Durability </li></ul><ul><li>takes hard work out of things </li></ul><ul><li>Not so good with many clients </li></ul><ul><li>Brilliant for a document format though </li></ul><ul><li>Scales up to “a few dozen GB of data” before not being as efficient as other RDBMs </li></ul>
  145. 145. Performance of Large files <ul><li>Once in core, page to disk location is cached </li></ul><ul><ul><li>other OSs may vary </li></ul></ul>
  146. 146. Performance of Large files <ul><li>Once in core, page to disk location is cached </li></ul><ul><ul><li>other OSs may vary </li></ul></ul><ul><li>Performance difference is in initial lookup </li></ul>
  147. 147. Performance of Large files <ul><li>Once in core, page to disk location is cached </li></ul><ul><ul><li>other OSs may vary </li></ul></ul><ul><li>Performance difference is in initial lookup </li></ul><ul><li>Extents based file systems much more efficient </li></ul>
  148. 148. Performance of Large files <ul><li>Once in core, page to disk location is cached </li></ul><ul><ul><li>other OSs may vary </li></ul></ul><ul><li>Performance difference is in initial lookup </li></ul><ul><li>Extents based file systems much more efficient </li></ul><ul><li>Zeroing takes long time </li></ul><ul><ul><li>support for unwritten extents means fast zeroing </li></ul></ul><ul><ul><li>think CREATE TABLESPACE </li></ul></ul><ul><ul><li>think bittorrent </li></ul></ul>
  149. 149. inode Infos Direct blocks Indirect Blocks Double indirect blocks
  150. 150. Extent <ul><li>start disk block </li></ul><ul><li>start file block </li></ul><ul><li>length </li></ul><ul><li>flags </li></ul><ul><ul><li>e.g. unwritten </li></ul></ul>
  151. 151. Parallel writers <ul><li>Multiple threads writing large files </li></ul>
  152. 152. Parallel writers <ul><li>Multiple threads writing large files </li></ul><ul><li>Not uncommon to compete for disk allocation </li></ul><ul><ul><li>especially with O_SYNC </li></ul></ul>
  153. 153. Parallel writers <ul><li>Multiple threads writing large files </li></ul><ul><li>Not uncommon to compete for disk allocation </li></ul><ul><ul><li>especially with O_SYNC </li></ul></ul><ul><li>Files can get interweaved ababab </li></ul><ul><ul><li>extent based file systems suffer </li></ul></ul><ul><ul><li>reading performance suffers </li></ul></ul><ul><ul><li>especially with slow growing files </li></ul></ul>
  154. 154. Parallel writers <ul><li>Multiple threads writing large files </li></ul><ul><li>Not uncommon to compete for disk allocation </li></ul><ul><ul><li>especially with O_SYNC </li></ul></ul><ul><li>Files can get interweaved ababab </li></ul><ul><ul><li>extent based file systems suffer </li></ul></ul><ul><ul><li>reading performance suffers </li></ul></ul><ul><ul><li>especially with slow growing files </li></ul></ul><ul><li>Preallocate disk space </li></ul><ul><ul><li>with no standard way to do it... :( </li></ul></ul>
  155. 155. Preallocation <ul><li>Often the file system will do it for you </li></ul><ul><ul><li>doesn't work as well with O_SYNC </li></ul></ul><ul><li>No (useful) standard way to preallocate space </li></ul><ul><ul><li>posix_fallocate doesn't work </li></ul></ul><ul><ul><li>xfsctl for files on XFS </li></ul></ul>
  156. 156. Tablespace allocation in NDB <ul><li>#ifdef HAVE_XFS_XFS_H </li></ul><ul><li>if( platform_test_xfs_fd(theFd) ) </li></ul><ul><li>{ </li></ul><ul><li>ndbout_c(&quot;Using xfsctl(XFS_IOC_RESVSP64) to allocate disk space&quot;); </li></ul><ul><li>xfs_flock64_t fl; </li></ul><ul><li>fl.l_whence= 0; </li></ul><ul><li>fl.l_start= 0; </li></ul><ul><li>fl.l_len= (off64_t)sz; </li></ul><ul><li>if( xfsctl(NULL, theFd, XFS_IOC_RESVSP64, &fl) < 0) </li></ul><ul><li>ndbout_c(&quot;failed to optimally allocate disk space&quot;); </li></ul><ul><li>} </li></ul><ul><li>#endif </li></ul><ul><li>#ifdef HAVE_POSIX_FALLOCATE </li></ul><ul><li>posix_fallocate(theFd, 0, sz); </li></ul><ul><li>#endif </li></ul>
  157. 157. Improvements in mysql-test-run <ul><li>Would run several nodes on one machine </li></ul><ul><ul><li>each creating tablespace files </li></ul></ul><ul><ul><li>all IO is O_SYNC </li></ul></ul>
  158. 158. Improvements in mysql-test-run <ul><li>Would run several nodes on one machine </li></ul><ul><ul><li>each creating tablespace files </li></ul></ul><ul><ul><li>all IO is O_SYNC </li></ul></ul><ul><li>number of extents for ndb_dd_basic tablespaces and log files </li></ul><ul><ul><li>BEFORE this code: 57, 13, 212, 95, 17, 113 </li></ul></ul><ul><ul><li>WITH this code : ALL 1 or 2 extents </li></ul></ul>
  159. 159. Improvements in mysql-test-run <ul><li>Would run several nodes on one machine </li></ul><ul><ul><li>each creating tablespace files </li></ul></ul><ul><ul><li>all IO is O_SYNC </li></ul></ul><ul><li>number of extents for ndb_dd_basic tablespaces and log files </li></ul><ul><ul><li>BEFORE this code: 57, 13, 212, 95, 17, 113 </li></ul></ul><ul><ul><li>WITH this code : ALL 1 or 2 extents </li></ul></ul><ul><li>30 seconds reduction in each test that created tablespaces </li></ul>
  160. 160. Library Developers <ul><li>No problem cannot be solved by Abstraction! </li></ul>
  161. 161. Library Developers <ul><li>No problem cannot be solved by Abstraction! </li></ul><ul><li>...and making it complex </li></ul>
  162. 162. Library Developers <ul><li>No problem cannot be solved by Abstraction! </li></ul><ul><li>...and making it complex </li></ul><ul><ul><li>mysys my_stat MY_STAT *my_stat(const char *path, MY_STAT *stat_area, myf my_flags) </li></ul></ul>
  163. 163. Library Developers <ul><li>No problem cannot be solved by Abstraction! </li></ul><ul><li>...and making it complex </li></ul><ul><ul><li>mysys my_stat MY_STAT *my_stat(const char *path, MY_STAT *stat_area, myf my_flags) </li></ul></ul><ul><ul><li>versus POSIX stat int stat(const char *path, struct stat *buf); </li></ul></ul>
  164. 164. Library Developers <ul><li>No problem cannot be solved by Abstraction! </li></ul><ul><li>...and making it complex </li></ul><ul><ul><li>mysys my_stat MY_STAT *my_stat(const char *path, MY_STAT *stat_area, myf my_flags) </li></ul></ul><ul><ul><li>versus POSIX stat int stat(const char *path, struct stat *buf); </li></ul></ul><ul><ul><li>my_fstat (grrr) int my_fstat(int Filedes, MY_STAT *stat_area, myf MyFlags __attribute__((unused))) </li></ul></ul>
  165. 165. Library Developers <ul><li>No problem cannot be solved by Abstraction! </li></ul><ul><li>...and making it complex </li></ul><ul><ul><li>mysys my_stat MY_STAT *my_stat(const char *path, MY_STAT *stat_area, myf my_flags) </li></ul></ul><ul><ul><li>versus POSIX stat int stat(const char *path, struct stat *buf); </li></ul></ul><ul><ul><li>my_fstat (grrr) int my_fstat(int Filedes, MY_STAT *stat_area, myf MyFlags __attribute__((unused))) </li></ul></ul><ul><ul><li>versus POSIX fstat int fstat(int filedes, struct stat *buf); </li></ul></ul>
  166. 166. There is hope <ul><li>You can do file IO correctly </li></ul><ul><li>You can prevent data loss </li></ul><ul><li>You can pester people to make life easier </li></ul>
  167. 167. Good Luck!
  168. 168. Good Luck! <ul><li>and please don't eat my data </li></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×