Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Eat My Data: (now with 20% more rant!) How everybody gets file I/O wrong Stewart Smith [email_address] Senior Software Eng...
What I work on <ul><li>MySQL Cluster </li></ul><ul><li>High Availability </li></ul><ul><li>(Shared Nothing) Clustered Data...
Overview <ul><li>Common mistakes that lead to data loss </li></ul>
Overview <ul><li>Common mistakes that lead to data loss </li></ul><ul><li>Mistakes by: </li></ul><ul><ul><li>the applicati...
Overview <ul><li>Common mistakes that lead to data loss </li></ul><ul><li>Mistakes by: </li></ul><ul><ul><li>the applicati...
Overview <ul><li>Common mistakes that lead to data loss </li></ul><ul><li>Mistakes by: </li></ul><ul><ul><li>the applicati...
Overview <ul><li>Common mistakes that lead to data loss </li></ul><ul><li>Mistakes by: </li></ul><ul><ul><li>the applicati...
In the beginning <ul><li>All IO was synchronous </li></ul>
In the beginning <ul><li>All IO was synchronous </li></ul><ul><li>When you called write things hit the platter </li></ul>
In the beginning <ul><li>All IO was synchronous </li></ul><ul><li>When you called write things hit the platter </li></ul><...
A world without failure
A world without failure <ul><li>doesn't exist </li></ul>
A world without failure <ul><li>doesn't exist </li></ul><ul><li>computers crash </li></ul>
A world without failure <ul><li>doesn't exist </li></ul><ul><li>computers crash </li></ul><ul><li>power goes out </li></ul>
A world without failure <ul><li>doesn't exist </li></ul><ul><li>computers crash </li></ul><ul><li>power goes out </li></ul...
A world without failure <ul><li>doesn't exist </li></ul><ul><li>computers crash </li></ul><ul><li>power goes out </li></ul...
A world without failure <ul><li>doesn't exist </li></ul><ul><li>computers crash </li></ul><ul><li>power goes out </li></ul...
A world without failure <ul><li>doesn't exist </li></ul><ul><li>computers crash </li></ul><ul><li>power goes out </li></ul...
Data Consistency <ul><li>In the event of failure, what state can I expect my data to be in? </li></ul>
User Expectations <ul><li>If the power goes out, the last saved version of my data is there </li></ul>
User Expectations <ul><li>If the power goes out, the last saved version of my data is there </li></ul><ul><li>If there isn...
User Expectations <ul><li>If the power goes out, the last saved version of my data is there </li></ul><ul><li>If there isn...
Databases
Databases <ul><li>ACID </li></ul>
Databases <ul><li>ACID </li></ul><ul><ul><li>D is for Durability </li></ul></ul>
Databases <ul><li>ACID </li></ul><ul><ul><li>D is for Durability </li></ul></ul><ul><li>Transactions </li></ul><ul><ul><li...
Databases <ul><li>ACID </li></ul><ul><ul><li>D is for Durability </li></ul></ul><ul><li>Transactions </li></ul><ul><ul><li...
Databases <ul><li>ACID </li></ul><ul><ul><li>D is for Durability </li></ul></ul><ul><li>Transactions </li></ul><ul><ul><li...
Databases <ul><li>ACID </li></ul><ul><ul><li>D is for Durability </li></ul></ul><ul><li>Transactions </li></ul><ul><ul><li...
Databases <ul><li>ACID </li></ul><ul><ul><li>D is for Durability </li></ul></ul><ul><li>Transactions </li></ul><ul><ul><li...
What databases are good at <ul><li>Lots of small objects (rows) </li></ul>
What databases are good at <ul><li>Lots of small objects (rows) </li></ul><ul><li>okay at larger objects (BLOBs) </li></ul>
What databases are good at <ul><li>Lots of small objects (rows) </li></ul><ul><li>okay at larger objects (BLOBs) </li></ul...
What databases are good at <ul><li>Lots of small objects (rows) </li></ul><ul><li>okay at larger objects (BLOBs) </li></ul...
Easy solution to data consistency <ul><li>put it in a database </li></ul><ul><ul><li>that gives data consistency guarantee...
Revelation #1 <ul><li>Databases are not file systems! </li></ul>
Revelation #2 <ul><li>File systems are not databases! </li></ul>
Revelation #3 <ul><li>A database has different consistency semantics than a file system </li></ul>
Revelation #3 <ul><li>A database has different consistency semantics than a file system </li></ul><ul><ul><li>typically fi...
Eat my data <ul><li>Where can the data be to eat? </li></ul>
Where data can be <ul><li>Application Buffer - CoolApp </li></ul><ul><ul><li>application crash = loss of this data </li></...
Where data can be <ul><li>Application Buffer - CoolApp </li></ul><ul><ul><li>application crash = loss of this data </li></...
Where data can be <ul><li>Application Buffer - CoolApp </li></ul><ul><ul><li>application crash = loss of this data </li></...
Where data can be <ul><li>Application Buffer - CoolApp </li></ul><ul><ul><li>application crash = loss of this data </li></...
Data flow <ul><li>Application to library buffer </li></ul><ul><ul><li>fwrite(), fprintf() and friends </li></ul></ul>
Data flow <ul><li>Application to library buffer </li></ul><ul><ul><li>fwrite(), fprintf() and friends </li></ul></ul><ul><...
Data flow <ul><li>Application to library buffer </li></ul><ul><ul><li>fwrite(), fprintf() and friends </li></ul></ul><ul><...
Data flow <ul><li>Application to library buffer </li></ul><ul><ul><li>fwrite(), fprintf() and friends </li></ul></ul><ul><...
Data flow <ul><li>Application to library buffer </li></ul><ul><ul><li>fwrite(), fprintf() and friends </li></ul></ul><ul><...
Data flow <ul><li>Application to library buffer </li></ul><ul><ul><li>fwrite(), fprintf() and friends </li></ul></ul><ul><...
Data flow <ul><li>Application to library buffer </li></ul><ul><ul><li>fwrite(), fprintf() and friends </li></ul></ul><ul><...
Simple Application: Save==on disk <ul><li>User hits “Save” in Word Processor </li></ul><ul><ul><li>Expects that data to be...
Saving a simple document <ul><li>struct wp_doc { </li></ul><ul><ul><li>char *document; </li></ul></ul><ul><ul><li>size_t l...
Bug #1 <ul><li>No fclose(2) </li></ul><ul><ul><li>Buffers for the stream may not be flushed from libc cache </li></ul></ul>
Word Processor Saving -1 Bug <ul><li>struct wp_doc { </li></ul><ul><ul><li>char *document; </li></ul></ul><ul><ul><li>size...
Bug #2, 3 and 4 <ul><li>No error checking! </li></ul><ul><li>fopen </li></ul><ul><ul><li>Did we open the file </li></ul></...
File System Integrity <ul><li>metadata journaling is just that </li></ul>
File System Integrity <ul><li>metadata journaling is just that </li></ul><ul><ul><li>meta data  only </li></ul></ul>
File System Integrity <ul><li>metadata journaling is just that </li></ul><ul><ul><li>meta data  only </li></ul></ul><ul><u...
File System Integrity <ul><li>metadata journaling is just that </li></ul><ul><ul><li>meta data  only </li></ul></ul><ul><u...
File System Integrity <ul><li>metadata journaling is just that </li></ul><ul><ul><li>meta data  only </li></ul></ul><ul><u...
Data journaling <ul><li>is nothing like a database transaction </li></ul>
Atomic write(2)
Atomic write(2) <ul><li>It isn't </li></ul>
Atomic write(2) <ul><li>It isn't </li></ul><ul><li>can half complete </li></ul>
Atomic write(2) <ul><li>It isn't </li></ul><ul><li>can half complete </li></ul><ul><li>A file system with atomic write(2) ...
Atomic write(2) <ul><li>It isn't </li></ul><ul><li>can half complete </li></ul><ul><li>A file system with atomic write(2) ...
Atomic write(2) <ul><li>It isn't </li></ul><ul><li>can half complete </li></ul><ul><li>A file system with atomic write(2) ...
Eat My Data <ul><li>struct wp_doc { </li></ul><ul><ul><li>char *document; </li></ul></ul><ul><ul><li>size_t len; </li></ul...
Write to Temp file, rename <ul><li>Old trick of writing to temp file first </li></ul>
Write to Temp file, rename <ul><li>Old trick of writing to temp file first </li></ul><ul><li>Can catch any errors </li></u...
Write to Temp file, rename <ul><li>Old trick of writing to temp file first </li></ul><ul><li>Can catch any errors </li></u...
Write to Temp file, rename <ul><li>Old trick of writing to temp file first </li></ul><ul><li>Can catch any errors </li></u...
Write to Temp file, rename <ul><li>Old trick of writing to temp file first </li></ul><ul><li>Can catch any errors </li></u...
Temp file, rename <ul><li>struct wp_doc { </li></ul><ul><ul><li>char *document; </li></ul></ul><ul><ul><li>size_t len; </l...
Now all is good with the world...
Now all is good with the world... <ul><li>This is where a lot of people stop </li></ul>
Now all is good with the world... <ul><li>This is where a lot of people stop </li></ul><ul><li>close(2) and rename(2) do  ...
<ul><li>close and rename do  not  imply sync </li></ul>
Now all is good with the world... <ul><li>This is where a lot of people stop </li></ul><ul><li>close(2) and rename(2) do  ...
Now all is good with the world... <ul><li>This is where a lot of people stop </li></ul><ul><li>close(2) and rename(2) do  ...
File System Integrity <ul><li>data=ordered mode on ext3 </li></ul><ul><ul><li>writes data before metadata </li></ul></ul><...
File System Integrity <ul><li>data=ordered mode on ext3 </li></ul><ul><ul><li>writes data before metadata </li></ul></ul><...
File System Integrity <ul><li>data=ordered mode on ext3 </li></ul><ul><ul><li>writes data before metadata </li></ul></ul><...
data=ordered <ul><li>write() </li></ul><ul><li>close() </li></ul><ul><li>rename() </li></ul><ul><li>Disk order: </li></ul>...
other systems <ul><li>write() </li></ul><ul><li>close() </li></ul><ul><li>rename() </li></ul><ul><li>Disk order: </li></ul...
flush and sync <ul><li>struct wp_doc { </li></ul><ul><ul><li>char *document; </li></ul></ul><ul><ul><li>size_t len; </li><...
A tale of libxml2 <ul><li>libxml2 provides utility functions to “write XML to file” </li></ul>
A tale of libxml2 <ul><li>libxml2 provides utility functions to “write XML to file” </li></ul><ul><li>Nice application dev...
A tale of libxml2 <ul><li>libxml2 provides utility functions to “write XML to file” </li></ul><ul><li>Nice application dev...
A tale of libxml2 <ul><li>libxml2 provides utility functions to “write XML to file” </li></ul><ul><li>Nice application dev...
A tale of libxml2 <ul><li>libxml2 provides utility functions to “write XML to file” </li></ul><ul><li>Nice application dev...
so, replace <ul><li>xmlSaveFile(foo) </li></ul>
gint common_save_xml(xmlDocPtr doc, gchar *filename) { FILE  *fp; char  *xmlbuf; int  fd, n; fp = g_fopen(filename, &quot;...
Nearing Nirvana <ul><li>If any failure during writing, the previously saved copy is untouched and safe </li></ul><ul><ul><...
Except if you want to be portable... <ul><li>On Linux, fsync(2) does actually sync </li></ul><ul><ul><li>barring enabling ...
Except if you want to be portable... <ul><li>On Linux, fsync(2) does actually sync </li></ul><ul><ul><li>barring enabling ...
Except if you want to be portable... <ul><li>On Linux, fsync(2) does actually sync </li></ul><ul><ul><li>barring enabling ...
on fsync, POSIX Says... <ul><li>If _POSIX_SYNCHRONIZED_IO is not defined, the wording relies heavily on the conformance do...
on fsync, POSIX Says... <ul><li>If _POSIX_SYNCHRONIZED_IO is not defined, the wording relies heavily on the conformance do...
POSIX compliant fsync <ul><li>int fsync(int fd) </li></ul>
POSIX compliant fsync <ul><li>int fsync(int fd) </li></ul><ul><li>{ </li></ul>
POSIX compliant fsync <ul><li>int fsync(int fd) </li></ul><ul><li>{ </li></ul><ul><ul><li>return 0; </li></ul></ul>
POSIX compliant fsync <ul><li>int fsync(int fd) </li></ul><ul><li>{ </li></ul><ul><ul><li>return 0; </li></ul></ul><ul><li...
POSIX compliant fsync <ul><li>int fsync(int fd) </li></ul><ul><li>{ </li></ul><ul><ul><li>return 0; </li></ul></ul><ul><li...
POSIX compliant fsync <ul><li>int fsync(int fd) </li></ul><ul><li>{ </li></ul><ul><ul><li>return 0; </li></ul></ul><ul><li...
Tale of a really fast database server <ul><li>A while ago (pre MySQL 4.1.9) </li></ul>
Tale of a really fast database server <ul><li>A while ago (pre MySQL 4.1.9) </li></ul><ul><li>Seeing corruption of InnoDB ...
Tale of a really fast database server <ul><li>A while ago (pre MySQL 4.1.9) </li></ul><ul><li>Seeing corruption of InnoDB ...
Tale of a really fast database server <ul><li>A while ago (pre MySQL 4.1.9) </li></ul><ul><li>Seeing corruption of InnoDB ...
fsync() doesn't have to sync <ul><li>On MacOS X, fsync() doesn't flush drive write cache </li></ul>
fsync() doesn't have to sync <ul><li>On MacOS X, fsync() doesn't flush drive write cache </li></ul><ul><li>An extra fcntl ...
Standards are great <ul><li>everybody has their own </li></ul>
Standards are great <ul><li>everybody has their own </li></ul><ul><li>makes application developers life difficult </li></ul>
Standards are great <ul><li>everybody has their own </li></ul><ul><li>makes application developers life difficult </li></u...
Standards are great <ul><li>everybody has their own </li></ul><ul><li>makes application developers life difficult </li></u...
#ifdef HAVE_DARWIN_THREADS # ifdef F_FULLFSYNC /* This executable has been compiled on Mac OS X 10.3 or later. Assume that...
#if defined(HAVE_DARWIN_THREADS) # ifndef F_FULLFSYNC /* The following definition is from the Mac OS X 10.3 <sys/fcntl.h> ...
Yes, some OS Vendors hate you <ul><li>Thanks to all the permutations of reliably getting data to a disk platter, a simple ...
Big Files
Big Files <ul><li>Write to temp file, sync, rename works badly with large files </li></ul>
Big Files <ul><li>Write to temp file, sync, rename works badly with large files </li></ul><ul><ul><li>especially frequentl...
Big Files <ul><li>Write to temp file, sync, rename works badly with large files </li></ul><ul><ul><li>especially frequentl...
Big Files <ul><li>Write to temp file, sync, rename works badly with large files </li></ul><ul><ul><li>especially frequentl...
Big Files <ul><li>Write to temp file, sync, rename works badly with large files </li></ul><ul><ul><li>especially frequentl...
Big Files <ul><li>Write to temp file, sync, rename works badly with large files </li></ul><ul><ul><li>especially frequentl...
Large directories <ul><li>Traditional directory is stored on disk as list of name,inode </li></ul>
Large directories <ul><li>Traditional directory is stored on disk as list of name,inode </li></ul><ul><li>lookup is search...
Large directories <ul><li>Traditional directory is stored on disk as list of name,inode </li></ul><ul><li>lookup is search...
Large directories <ul><li>Traditional directory is stored on disk as list of name,inode </li></ul><ul><li>lookup is search...
Large directories <ul><li>Traditional directory is stored on disk as list of name,inode </li></ul><ul><li>lookup is search...
Where data is after being written <ul><li>txns in Dbs often committed but only written to log, not main data file </li></ul>
Where data is after being written <ul><li>txns in Dbs often committed but only written to log, not main data file </li></u...
Where data is after being written <ul><li>txns in Dbs often committed but only written to log, not main data file </li></u...
Where data is after being written <ul><li>txns in Dbs often committed but only written to log, not main data file </li></u...
Where data is after being written <ul><li>txns in Dbs often committed but only written to log, not main data file </li></u...
Where data is after being written <ul><li>txns in Dbs often committed but only written to log, not main data file </li></u...
sqlite <ul><li>Many applications have structured data </li></ul>
sqlite <ul><li>Many applications have structured data </li></ul><ul><li>Can be (easily) represented in RDBMS </li></ul>
sqlite <ul><li>Many applications have structured data </li></ul><ul><li>Can be (easily) represented in RDBMS </li></ul><ul...
sqlite <ul><li>Many applications have structured data </li></ul><ul><li>Can be (easily) represented in RDBMS </li></ul><ul...
sqlite <ul><li>Many applications have structured data </li></ul><ul><li>Can be (easily) represented in RDBMS </li></ul><ul...
sqlite <ul><li>Many applications have structured data </li></ul><ul><li>Can be (easily) represented in RDBMS </li></ul><ul...
sqlite <ul><li>Many applications have structured data </li></ul><ul><li>Can be (easily) represented in RDBMS </li></ul><ul...
Performance of Large files <ul><li>Once in core, page to disk location is cached </li></ul><ul><ul><li>other OSs may vary ...
Performance of Large files <ul><li>Once in core, page to disk location is cached </li></ul><ul><ul><li>other OSs may vary ...
Performance of Large files <ul><li>Once in core, page to disk location is cached </li></ul><ul><ul><li>other OSs may vary ...
Performance of Large files <ul><li>Once in core, page to disk location is cached </li></ul><ul><ul><li>other OSs may vary ...
inode Infos Direct blocks Indirect Blocks Double indirect blocks
Extent <ul><li>start disk block </li></ul><ul><li>start file block </li></ul><ul><li>length </li></ul><ul><li>flags </li><...
Parallel writers <ul><li>Multiple threads writing large files </li></ul>
Parallel writers <ul><li>Multiple threads writing large files </li></ul><ul><li>Not uncommon to compete for disk allocatio...
Parallel writers <ul><li>Multiple threads writing large files </li></ul><ul><li>Not uncommon to compete for disk allocatio...
Parallel writers <ul><li>Multiple threads writing large files </li></ul><ul><li>Not uncommon to compete for disk allocatio...
Preallocation <ul><li>Often the file system will do it for you </li></ul><ul><ul><li>doesn't work as well with O_SYNC </li...
Tablespace allocation in NDB <ul><li>#ifdef HAVE_XFS_XFS_H </li></ul><ul><li>if( platform_test_xfs_fd(theFd) ) </li></ul><...
Improvements in mysql-test-run <ul><li>Would run several nodes on one machine </li></ul><ul><ul><li>each creating tablespa...
Improvements in mysql-test-run <ul><li>Would run several nodes on one machine </li></ul><ul><ul><li>each creating tablespa...
Improvements in mysql-test-run <ul><li>Would run several nodes on one machine </li></ul><ul><ul><li>each creating tablespa...
Library Developers <ul><li>No problem cannot be solved by Abstraction!  </li></ul>
Library Developers <ul><li>No problem cannot be solved by Abstraction!  </li></ul><ul><li>...and making it complex </li></ul>
Library Developers <ul><li>No problem cannot be solved by Abstraction!  </li></ul><ul><li>...and making it complex </li></...
Library Developers <ul><li>No problem cannot be solved by Abstraction!  </li></ul><ul><li>...and making it complex </li></...
Library Developers <ul><li>No problem cannot be solved by Abstraction!  </li></ul><ul><li>...and making it complex </li></...
Library Developers <ul><li>No problem cannot be solved by Abstraction!  </li></ul><ul><li>...and making it complex </li></...
There is hope <ul><li>You can do file IO correctly </li></ul><ul><li>You can prevent data loss </li></ul><ul><li>You can p...
Good Luck!
Good Luck! <ul><li>and please don't eat my data </li></ul>
Upcoming SlideShare
Loading in …5
×

Eat my data

6,003 views

Published on

Published in: Technology, Education

Eat my data

  1. 1. Eat My Data: (now with 20% more rant!) How everybody gets file I/O wrong Stewart Smith [email_address] Senior Software Engineer, MySQL Cluster MySQL AB
  2. 2. What I work on <ul><li>MySQL Cluster </li></ul><ul><li>High Availability </li></ul><ul><li>(Shared Nothing) Clustered Database </li></ul><ul><li>with some real time properties </li></ul>
  3. 3. Overview <ul><li>Common mistakes that lead to data loss </li></ul>
  4. 4. Overview <ul><li>Common mistakes that lead to data loss </li></ul><ul><li>Mistakes by: </li></ul><ul><ul><li>the application programmer </li></ul></ul>
  5. 5. Overview <ul><li>Common mistakes that lead to data loss </li></ul><ul><li>Mistakes by: </li></ul><ul><ul><li>the application programmer </li></ul></ul><ul><ul><li>the library programmer </li></ul></ul>
  6. 6. Overview <ul><li>Common mistakes that lead to data loss </li></ul><ul><li>Mistakes by: </li></ul><ul><ul><li>the application programmer </li></ul></ul><ul><ul><li>the library programmer </li></ul></ul><ul><ul><li>the kernel programmer </li></ul></ul>
  7. 7. Overview <ul><li>Common mistakes that lead to data loss </li></ul><ul><li>Mistakes by: </li></ul><ul><ul><li>the application programmer </li></ul></ul><ul><ul><li>the library programmer </li></ul></ul><ul><ul><li>the kernel programmer </li></ul></ul><ul><li>Mostly just concentrating on Linux </li></ul><ul><ul><li>will mention war stories on other platforms too </li></ul></ul>
  8. 8. In the beginning <ul><li>All IO was synchronous </li></ul>
  9. 9. In the beginning <ul><li>All IO was synchronous </li></ul><ul><li>When you called write things hit the platter </li></ul>
  10. 10. In the beginning <ul><li>All IO was synchronous </li></ul><ul><li>When you called write things hit the platter </li></ul><ul><li>Turns out that this is slow for a lot of cases </li></ul>
  11. 11. A world without failure
  12. 12. A world without failure <ul><li>doesn't exist </li></ul>
  13. 13. A world without failure <ul><li>doesn't exist </li></ul><ul><li>computers crash </li></ul>
  14. 14. A world without failure <ul><li>doesn't exist </li></ul><ul><li>computers crash </li></ul><ul><li>power goes out </li></ul>
  15. 15. A world without failure <ul><li>doesn't exist </li></ul><ul><li>computers crash </li></ul><ul><li>power goes out </li></ul><ul><ul><li>battery goes flat </li></ul></ul>
  16. 16. A world without failure <ul><li>doesn't exist </li></ul><ul><li>computers crash </li></ul><ul><li>power goes out </li></ul><ul><ul><li>battery goes flat </li></ul></ul><ul><ul><li>kick-the-cord out </li></ul></ul>
  17. 17. A world without failure <ul><li>doesn't exist </li></ul><ul><li>computers crash </li></ul><ul><li>power goes out </li></ul><ul><ul><li>battery goes flat </li></ul></ul><ul><ul><li>kick-the-cord out </li></ul></ul><ul><ul><li>suspend works, resume doesn't </li></ul></ul>
  18. 18. A world without failure <ul><li>doesn't exist </li></ul><ul><li>computers crash </li></ul><ul><li>power goes out </li></ul><ul><ul><li>battery goes flat </li></ul></ul><ul><ul><li>kick-the-cord out </li></ul></ul><ul><ul><li>suspend works, resume doesn't </li></ul></ul><ul><li>When the data is important, live in the world of failure </li></ul>
  19. 19. Data Consistency <ul><li>In the event of failure, what state can I expect my data to be in? </li></ul>
  20. 20. User Expectations <ul><li>If the power goes out, the last saved version of my data is there </li></ul>
  21. 21. User Expectations <ul><li>If the power goes out, the last saved version of my data is there </li></ul><ul><li>If there isn't an explicit save (e.g. RSS readers, IM logs) some recent version should be okay. </li></ul>
  22. 22. User Expectations <ul><li>If the power goes out, the last saved version of my data is there </li></ul><ul><li>If there isn't an explicit save (e.g. RSS readers, IM logs) some recent version should be okay. </li></ul><ul><li>Not Acceptable: </li></ul><ul><ul><li>I hit save, why is none of my work there? </li></ul></ul><ul><ul><li>Why have all my IM logs disappeared? </li></ul></ul><ul><ul><li>Why have all my saved passwords disappeared? </li></ul></ul>
  23. 23. Databases
  24. 24. Databases <ul><li>ACID </li></ul>
  25. 25. Databases <ul><li>ACID </li></ul><ul><ul><li>D is for Durability </li></ul></ul>
  26. 26. Databases <ul><li>ACID </li></ul><ul><ul><li>D is for Durability </li></ul></ul><ul><li>Transactions </li></ul><ul><ul><li>A committed transaction survives failure </li></ul></ul>
  27. 27. Databases <ul><li>ACID </li></ul><ul><ul><li>D is for Durability </li></ul></ul><ul><li>Transactions </li></ul><ul><ul><li>A committed transaction survives failure </li></ul></ul><ul><li>Isolation Levels </li></ul><ul><ul><li>Repeatable Read, Read Committed etc </li></ul></ul>
  28. 28. Databases <ul><li>ACID </li></ul><ul><ul><li>D is for Durability </li></ul></ul><ul><li>Transactions </li></ul><ul><ul><li>A committed transaction survives failure </li></ul></ul><ul><li>Isolation Levels </li></ul><ul><ul><li>Repeatable Read, Read Committed etc </li></ul></ul><ul><li>Very well known consistency issues </li></ul>
  29. 29. Databases <ul><li>ACID </li></ul><ul><ul><li>D is for Durability </li></ul></ul><ul><li>Transactions </li></ul><ul><ul><li>A committed transaction survives failure </li></ul></ul><ul><li>Isolation Levels </li></ul><ul><ul><li>Repeatable Read, Read Committed etc </li></ul></ul><ul><li>Very well known consistency issues </li></ul><ul><ul><li>that lots of people still get wrong </li></ul></ul>
  30. 30. Databases <ul><li>ACID </li></ul><ul><ul><li>D is for Durability </li></ul></ul><ul><li>Transactions </li></ul><ul><ul><li>A committed transaction survives failure </li></ul></ul><ul><li>Isolation Levels </li></ul><ul><ul><li>Repeatable Read, Read Committed etc </li></ul></ul><ul><li>Very well known consistency issues </li></ul><ul><ul><li>that lots of people still get wrong </li></ul></ul><ul><ul><li>different engines have different properties </li></ul></ul>
  31. 31. What databases are good at <ul><li>Lots of small objects (rows) </li></ul>
  32. 32. What databases are good at <ul><li>Lots of small objects (rows) </li></ul><ul><li>okay at larger objects (BLOBs) </li></ul>
  33. 33. What databases are good at <ul><li>Lots of small objects (rows) </li></ul><ul><li>okay at larger objects (BLOBs) </li></ul><ul><ul><li>named after “The Blob”, not Binary Large Objects </li></ul></ul>
  34. 34. What databases are good at <ul><li>Lots of small objects (rows) </li></ul><ul><li>okay at larger objects (BLOBs) </li></ul><ul><ul><li>named after “The Blob”, not Binary Large Objects </li></ul></ul><ul><li>Accessed by a variety of ways </li></ul><ul><ul><li>indexes </li></ul></ul>
  35. 35. Easy solution to data consistency <ul><li>put it in a database </li></ul><ul><ul><li>that gives data consistency guarantees </li></ul></ul><ul><li>We'll talk about this later </li></ul>
  36. 36. Revelation #1 <ul><li>Databases are not file systems! </li></ul>
  37. 37. Revelation #2 <ul><li>File systems are not databases! </li></ul>
  38. 38. Revelation #3 <ul><li>A database has different consistency semantics than a file system </li></ul>
  39. 39. Revelation #3 <ul><li>A database has different consistency semantics than a file system </li></ul><ul><ul><li>typically file systems a lot more relaxed </li></ul></ul>
  40. 40. Eat my data <ul><li>Where can the data be to eat? </li></ul>
  41. 41. Where data can be <ul><li>Application Buffer - CoolApp </li></ul><ul><ul><li>application crash = loss of this data </li></ul></ul>
  42. 42. Where data can be <ul><li>Application Buffer - CoolApp </li></ul><ul><ul><li>application crash = loss of this data </li></ul></ul><ul><li>Library buffer - glibc </li></ul><ul><ul><li>application crash = loss of this data </li></ul></ul>
  43. 43. Where data can be <ul><li>Application Buffer - CoolApp </li></ul><ul><ul><li>application crash = loss of this data </li></ul></ul><ul><li>Library buffer - glibc </li></ul><ul><ul><li>application crash = loss of this data </li></ul></ul><ul><li>Operating System buffer – page/buffer cache </li></ul><ul><ul><li>system crash = loss of this data </li></ul></ul>
  44. 44. Where data can be <ul><li>Application Buffer - CoolApp </li></ul><ul><ul><li>application crash = loss of this data </li></ul></ul><ul><li>Library buffer - glibc </li></ul><ul><ul><li>application crash = loss of this data </li></ul></ul><ul><li>Operating System buffer – page/buffer cache </li></ul><ul><ul><li>system crash = loss of this data </li></ul></ul><ul><li>on disk </li></ul><ul><ul><li>disk failure = loss of data </li></ul></ul>
  45. 45. Data flow <ul><li>Application to library buffer </li></ul><ul><ul><li>fwrite(), fprintf() and friends </li></ul></ul>
  46. 46. Data flow <ul><li>Application to library buffer </li></ul><ul><ul><li>fwrite(), fprintf() and friends </li></ul></ul><ul><li>Library to OS buffer </li></ul><ul><ul><li>write(2) </li></ul></ul>
  47. 47. Data flow <ul><li>Application to library buffer </li></ul><ul><ul><li>fwrite(), fprintf() and friends </li></ul></ul><ul><li>Library to OS buffer </li></ul><ul><ul><li>write(2) </li></ul></ul><ul><li>OS buffer to disk </li></ul>
  48. 48. Data flow <ul><li>Application to library buffer </li></ul><ul><ul><li>fwrite(), fprintf() and friends </li></ul></ul><ul><li>Library to OS buffer </li></ul><ul><ul><li>write(2) </li></ul></ul><ul><li>OS buffer to disk </li></ul><ul><ul><li>paged out, periodic flushing (5 or 30secs usually) </li></ul></ul>
  49. 49. Data flow <ul><li>Application to library buffer </li></ul><ul><ul><li>fwrite(), fprintf() and friends </li></ul></ul><ul><li>Library to OS buffer </li></ul><ul><ul><li>write(2) </li></ul></ul><ul><li>OS buffer to disk </li></ul><ul><ul><li>paged out, periodic flushing (5 or 30secs usually) </li></ul></ul><ul><ul><ul><li>Can be very delayed with laptop mode </li></ul></ul></ul>
  50. 50. Data flow <ul><li>Application to library buffer </li></ul><ul><ul><li>fwrite(), fprintf() and friends </li></ul></ul><ul><li>Library to OS buffer </li></ul><ul><ul><li>write(2) </li></ul></ul><ul><li>OS buffer to disk </li></ul><ul><ul><li>paged out, periodic flushing (5 or 30secs usually) </li></ul></ul><ul><ul><ul><li>Can be very delayed with laptop mode </li></ul></ul></ul><ul><ul><li>fsync(2), fdatasync(2), sync(2) </li></ul></ul>
  51. 51. Data flow <ul><li>Application to library buffer </li></ul><ul><ul><li>fwrite(), fprintf() and friends </li></ul></ul><ul><li>Library to OS buffer </li></ul><ul><ul><li>write(2) </li></ul></ul><ul><li>OS buffer to disk </li></ul><ul><ul><li>paged out, periodic flushing (5 or 30secs usually) </li></ul></ul><ul><ul><ul><li>Can be very delayed with laptop mode </li></ul></ul></ul><ul><ul><li>fsync(2), fdatasync(2), sync(2) </li></ul></ul><ul><ul><ul><li>with caveats! </li></ul></ul></ul>
  52. 52. Simple Application: Save==on disk <ul><li>User hits “Save” in Word Processor </li></ul><ul><ul><li>Expects that data to be on disk when “Saved” </li></ul></ul><ul><li>How? </li></ul>
  53. 53. Saving a simple document <ul><li>struct wp_doc { </li></ul><ul><ul><li>char *document; </li></ul></ul><ul><ul><li>size_t len; </li></ul></ul><ul><li>}; </li></ul><ul><li>struct wp_doc d; </li></ul><ul><li>... </li></ul><ul><li>FILE *f; </li></ul><ul><li>f= fopen(“important_document”,”w”); </li></ul><ul><li>fwrite(d.document,d.len,1,f); </li></ul>
  54. 54. Bug #1 <ul><li>No fclose(2) </li></ul><ul><ul><li>Buffers for the stream may not be flushed from libc cache </li></ul></ul>
  55. 55. Word Processor Saving -1 Bug <ul><li>struct wp_doc { </li></ul><ul><ul><li>char *document; </li></ul></ul><ul><ul><li>size_t len; </li></ul></ul><ul><li>}; </li></ul><ul><li>struct wp_doc d; </li></ul><ul><li>... </li></ul><ul><li>FILE *f; </li></ul><ul><li>f= fopen(“important_document”,”w”); </li></ul><ul><li>fwrite(d.document,d.len,1,f); </li></ul><ul><li>fclose(f); </li></ul>
  56. 56. Bug #2, 3 and 4 <ul><li>No error checking! </li></ul><ul><li>fopen </li></ul><ul><ul><li>Did we open the file </li></ul></ul><ul><li>fwrite </li></ul><ul><ul><li>did we write the entire file (ENOSPC?) </li></ul></ul><ul><li>fclose </li></ul><ul><ul><li>did we successfully close the file </li></ul></ul>
  57. 57. File System Integrity <ul><li>metadata journaling is just that </li></ul>
  58. 58. File System Integrity <ul><li>metadata journaling is just that </li></ul><ul><ul><li>meta data only </li></ul></ul>
  59. 59. File System Integrity <ul><li>metadata journaling is just that </li></ul><ul><ul><li>meta data only </li></ul></ul><ul><ul><li>no data is written to the journal </li></ul></ul>
  60. 60. File System Integrity <ul><li>metadata journaling is just that </li></ul><ul><ul><li>meta data only </li></ul></ul><ul><ul><li>no data is written to the journal </li></ul></ul><ul><ul><li>integrity of file system structures </li></ul></ul>
  61. 61. File System Integrity <ul><li>metadata journaling is just that </li></ul><ul><ul><li>meta data only </li></ul></ul><ul><ul><li>no data is written to the journal </li></ul></ul><ul><ul><li>integrity of file system structures </li></ul></ul><ul><ul><li>not internals of files </li></ul></ul>
  62. 62. Data journaling <ul><li>is nothing like a database transaction </li></ul>
  63. 63. Atomic write(2)
  64. 64. Atomic write(2) <ul><li>It isn't </li></ul>
  65. 65. Atomic write(2) <ul><li>It isn't </li></ul><ul><li>can half complete </li></ul>
  66. 66. Atomic write(2) <ul><li>It isn't </li></ul><ul><li>can half complete </li></ul><ul><li>A file system with atomic write(2) </li></ul>
  67. 67. Atomic write(2) <ul><li>It isn't </li></ul><ul><li>can half complete </li></ul><ul><li>A file system with atomic write(2) </li></ul><ul><ul><li>can't rely on it being there </li></ul></ul>
  68. 68. Atomic write(2) <ul><li>It isn't </li></ul><ul><li>can half complete </li></ul><ul><li>A file system with atomic write(2) </li></ul><ul><ul><li>can't rely on it being there </li></ul></ul><ul><ul><li>Essentially useless </li></ul></ul>
  69. 69. Eat My Data <ul><li>struct wp_doc { </li></ul><ul><ul><li>char *document; </li></ul></ul><ul><ul><li>size_t len; </li></ul></ul><ul><li>}; </li></ul><ul><li>struct wp_doc d; </li></ul><ul><li>... </li></ul><ul><li>FILE *f; </li></ul><ul><li>f= fopen(“important_document”,”w”); </li></ul><ul><li>fwrite(d.document,d.len,1,f); </li></ul><ul><li>fclose(f); </li></ul>CRASH
  70. 70. Write to Temp file, rename <ul><li>Old trick of writing to temp file first </li></ul>
  71. 71. Write to Temp file, rename <ul><li>Old trick of writing to temp file first </li></ul><ul><li>Can catch any errors </li></ul><ul><ul><li>e.g. ENOSPC </li></ul></ul>
  72. 72. Write to Temp file, rename <ul><li>Old trick of writing to temp file first </li></ul><ul><li>Can catch any errors </li></ul><ul><ul><li>e.g. ENOSPC </li></ul></ul><ul><ul><li>don't rename on error </li></ul></ul>
  73. 73. Write to Temp file, rename <ul><li>Old trick of writing to temp file first </li></ul><ul><li>Can catch any errors </li></ul><ul><ul><li>e.g. ENOSPC </li></ul></ul><ul><ul><li>don't rename on error </li></ul></ul><ul><li>Idea that if we crash during writing temp file user data is safe </li></ul>
  74. 74. Write to Temp file, rename <ul><li>Old trick of writing to temp file first </li></ul><ul><li>Can catch any errors </li></ul><ul><ul><li>e.g. ENOSPC </li></ul></ul><ul><ul><li>don't rename on error </li></ul></ul><ul><li>Idea that if we crash during writing temp file user data is safe </li></ul><ul><ul><li>although we may leave around a temp file </li></ul></ul>
  75. 75. Temp file, rename <ul><li>struct wp_doc { </li></ul><ul><ul><li>char *document; </li></ul></ul><ul><ul><li>size_t len; </li></ul></ul><ul><li>}; </li></ul><ul><li>struct wp_doc d; </li></ul><ul><li>... </li></ul><ul><li>FILE *f; </li></ul><ul><li>f= fopen(“important_document.temp”,”w”); </li></ul><ul><li>if(!f) return errno; </li></ul><ul><li>size_t w= fwrite(d.document,d.len,1,f); </li></ul><ul><li>if(w<d.len) return errno; </li></ul><ul><li>fclose(f); </li></ul><ul><li>rename(“important_document.temp”,”important_document”); </li></ul>
  76. 76. Now all is good with the world...
  77. 77. Now all is good with the world... <ul><li>This is where a lot of people stop </li></ul>
  78. 78. Now all is good with the world... <ul><li>This is where a lot of people stop </li></ul><ul><li>close(2) and rename(2) do not imply sync </li></ul>
  79. 79. <ul><li>close and rename do not imply sync </li></ul>
  80. 80. Now all is good with the world... <ul><li>This is where a lot of people stop </li></ul><ul><li>close(2) and rename(2) do not imply sync </li></ul><ul><li>They make no guarantees on when (or in what order) changes hit the platter </li></ul>
  81. 81. Now all is good with the world... <ul><li>This is where a lot of people stop </li></ul><ul><li>close(2) and rename(2) do not imply sync </li></ul><ul><li>They make no guarantees on when (or in what order) changes hit the platter </li></ul><ul><li>Quite possible (and often) metadata is flushed before data </li></ul>
  82. 82. File System Integrity <ul><li>data=ordered mode on ext3 </li></ul><ul><ul><li>writes data before metadata </li></ul></ul><ul><ul><li>other file systems are different </li></ul></ul>
  83. 83. File System Integrity <ul><li>data=ordered mode on ext3 </li></ul><ul><ul><li>writes data before metadata </li></ul></ul><ul><ul><li>other file systems are different </li></ul></ul><ul><li>ext3 ordered mode is an exception , not the rule </li></ul>
  84. 84. File System Integrity <ul><li>data=ordered mode on ext3 </li></ul><ul><ul><li>writes data before metadata </li></ul></ul><ul><ul><li>other file systems are different </li></ul></ul><ul><li>ext3 ordered mode is an exception , not the rule </li></ul><ul><ul><li>applications relying on this are not portable and depend on file system behaviour. the applications are buggy. </li></ul></ul>
  85. 85. data=ordered <ul><li>write() </li></ul><ul><li>close() </li></ul><ul><li>rename() </li></ul><ul><li>Disk order: </li></ul><ul><ul><li>data from fwrite() </li></ul></ul><ul><ul><li>inode </li></ul></ul><ul><ul><li>directory entry </li></ul></ul>
  86. 86. other systems <ul><li>write() </li></ul><ul><li>close() </li></ul><ul><li>rename() </li></ul><ul><li>Disk order: </li></ul><ul><ul><li>any! </li></ul></ul>
  87. 87. flush and sync <ul><li>struct wp_doc { </li></ul><ul><ul><li>char *document; </li></ul></ul><ul><ul><li>size_t len; </li></ul></ul><ul><li>}; </li></ul><ul><li>struct wp_doc d; </li></ul><ul><li>... </li></ul><ul><li>FILE *f; </li></ul><ul><li>f= fopen(“important_document.temp”,”w”); </li></ul><ul><li>if(!f) return errno; </li></ul><ul><li>size_t w= fwrite(d.document,d.len,1,f); </li></ul><ul><li>if(w<d.len) return errno; </li></ul><ul><li>if(fflush(f)!=0) return errno; </li></ul><ul><li>if(fsync(fileno(f))==-1) return errno; </li></ul><ul><li>fclose(f); </li></ul><ul><li>rename(“important_document.temp”,”important_document”); </li></ul>Flush the buffers! Sync to disk before rename
  88. 88. A tale of libxml2 <ul><li>libxml2 provides utility functions to “write XML to file” </li></ul>
  89. 89. A tale of libxml2 <ul><li>libxml2 provides utility functions to “write XML to file” </li></ul><ul><li>Nice application developer saves time by using libxml2's function </li></ul>
  90. 90. A tale of libxml2 <ul><li>libxml2 provides utility functions to “write XML to file” </li></ul><ul><li>Nice application developer saves time by using libxml2's function </li></ul><ul><ul><li>Application developer writes to temp file, renames </li></ul></ul>
  91. 91. A tale of libxml2 <ul><li>libxml2 provides utility functions to “write XML to file” </li></ul><ul><li>Nice application developer saves time by using libxml2's function </li></ul><ul><ul><li>Application developer writes to temp file, renames </li></ul></ul><ul><ul><li>User looses data after crash </li></ul></ul>
  92. 92. A tale of libxml2 <ul><li>libxml2 provides utility functions to “write XML to file” </li></ul><ul><li>Nice application developer saves time by using libxml2's function </li></ul><ul><ul><li>Application developer writes to temp file, renames </li></ul></ul><ul><ul><li>User looses data after crash </li></ul></ul><ul><ul><li>Nice application developer has to work around limitations of library </li></ul></ul>
  93. 93. so, replace <ul><li>xmlSaveFile(foo) </li></ul>
  94. 94. gint common_save_xml(xmlDocPtr doc, gchar *filename) { FILE *fp; char *xmlbuf; int fd, n; fp = g_fopen(filename, &quot;w&quot;); if(NULL == fp) return -1; xmlDocDumpFormatMemory(doc, (xmlChar **)&xmlbuf, &n, TRUE); if(n <= 0) { errno = ENOMEM; return -1; } if(fwrite(xmlbuf, sizeof (xmlChar), n, fp) < n) { xmlFree (xmlbuf); return -1; } xmlFree (xmlbuf); /* flush user-space buffers */ if (fflush (fp) != 0) return -1; if ((fd = fileno (fp)) == -1) return -1; #ifdef HAVE_FSYNC /* sync kernel-space buffers to disk */ if (fsync (fd) == -1) return -1; #endif fclose(fp); return 0; }
  95. 95. Nearing Nirvana <ul><li>If any failure during writing, the previously saved copy is untouched and safe </li></ul><ul><ul><li>User wont get partial or no data </li></ul></ul>
  96. 96. Except if you want to be portable... <ul><li>On Linux, fsync(2) does actually sync </li></ul><ul><ul><li>barring enabling write cache </li></ul></ul>
  97. 97. Except if you want to be portable... <ul><li>On Linux, fsync(2) does actually sync </li></ul><ul><ul><li>barring enabling write cache </li></ul></ul><ul><li>On MacOS X, </li></ul>
  98. 98. Except if you want to be portable... <ul><li>On Linux, fsync(2) does actually sync </li></ul><ul><ul><li>barring enabling write cache </li></ul></ul><ul><li>On MacOS X, not so much </li></ul>
  99. 99. on fsync, POSIX Says... <ul><li>If _POSIX_SYNCHRONIZED_IO is not defined, the wording relies heavily on the conformance document to tell the user what can be expected from the system. It is explicitly intended that a null implementation is permitted. </li></ul>
  100. 100. on fsync, POSIX Says... <ul><li>If _POSIX_SYNCHRONIZED_IO is not defined, the wording relies heavily on the conformance document to tell the user what can be expected from the system. It is explicitly intended that a null implementation is permitted. </li></ul>
  101. 101. POSIX compliant fsync <ul><li>int fsync(int fd) </li></ul>
  102. 102. POSIX compliant fsync <ul><li>int fsync(int fd) </li></ul><ul><li>{ </li></ul>
  103. 103. POSIX compliant fsync <ul><li>int fsync(int fd) </li></ul><ul><li>{ </li></ul><ul><ul><li>return 0; </li></ul></ul>
  104. 104. POSIX compliant fsync <ul><li>int fsync(int fd) </li></ul><ul><li>{ </li></ul><ul><ul><li>return 0; </li></ul></ul><ul><li>} </li></ul>
  105. 105. POSIX compliant fsync <ul><li>int fsync(int fd) </li></ul><ul><li>{ </li></ul><ul><ul><li>return 0; </li></ul></ul><ul><li>} </li></ul>gcc
  106. 106. POSIX compliant fsync <ul><li>int fsync(int fd) </li></ul><ul><li>{ </li></ul><ul><ul><li>return 0; </li></ul></ul><ul><li>} </li></ul>pushl %ebp movl %esp, %ebp movl $0, %eax popl %ebp ret gcc
  107. 107. Tale of a really fast database server <ul><li>A while ago (pre MySQL 4.1.9) </li></ul>
  108. 108. Tale of a really fast database server <ul><li>A while ago (pre MySQL 4.1.9) </li></ul><ul><li>Seeing corruption of InnoDB pages </li></ul>
  109. 109. Tale of a really fast database server <ul><li>A while ago (pre MySQL 4.1.9) </li></ul><ul><li>Seeing corruption of InnoDB pages </li></ul><ul><ul><li>only on MacOS X </li></ul></ul>
  110. 110. Tale of a really fast database server <ul><li>A while ago (pre MySQL 4.1.9) </li></ul><ul><li>Seeing corruption of InnoDB pages </li></ul><ul><ul><li>only on MacOS X </li></ul></ul><ul><li>Also, things seemed pretty fast </li></ul>
  111. 111. fsync() doesn't have to sync <ul><li>On MacOS X, fsync() doesn't flush drive write cache </li></ul>
  112. 112. fsync() doesn't have to sync <ul><li>On MacOS X, fsync() doesn't flush drive write cache </li></ul><ul><li>An extra fcntl is provided to do this </li></ul>
  113. 113. Standards are great <ul><li>everybody has their own </li></ul>
  114. 114. Standards are great <ul><li>everybody has their own </li></ul><ul><li>makes application developers life difficult </li></ul>
  115. 115. Standards are great <ul><li>everybody has their own </li></ul><ul><li>makes application developers life difficult </li></ul><ul><li>Let's see the InnoDB code for ensuring data is synced to disk </li></ul>
  116. 116. Standards are great <ul><li>everybody has their own </li></ul><ul><li>makes application developers life difficult </li></ul><ul><li>Let's see the InnoDB code for ensuring data is synced to disk </li></ul><ul><ul><li>if this doesn't work, transactions don't work </li></ul></ul>
  117. 117. #ifdef HAVE_DARWIN_THREADS # ifdef F_FULLFSYNC /* This executable has been compiled on Mac OS X 10.3 or later. Assume that F_FULLFSYNC is available at run-time. */ srv_have_fullfsync = TRUE; # else /* F_FULLFSYNC */ /* This executable has been compiled on Mac OS X 10.2 or earlier. Determine if the executable is running on Mac OS X 10.3 or later. */ struct utsname utsname; if (uname(&utsname)) { fputs(&quot;InnoDB: cannot determine Mac OS X version!n&quot;, stderr); } else { srv_have_fullfsync = strcmp(utsname.release, &quot;7.&quot;) >= 0; } if (!srv_have_fullfsync) { fputs(&quot;InnoDB: On Mac OS X, fsync() may be&quot; &quot; broken on internal drives,n&quot; &quot;InnoDB: making transactions unsafe!n&quot;, stderr); } # endif /* F_FULLFSYNC */ #endif /* HAVE_DARWIN_THREADS */
  118. 118. #if defined(HAVE_DARWIN_THREADS) # ifndef F_FULLFSYNC /* The following definition is from the Mac OS X 10.3 <sys/fcntl.h> */ # define F_FULLFSYNC 51 /* fsync + ask the drive to flush to the media */ # elif F_FULLFSYNC != 51 # error &quot;F_FULLFSYNC != 51: ABI incompatibility with Mac OS X 10.3&quot; # endif /* Apple has disabled fsync() for internal disk drives in OS X. That caused corruption for a user when he tested a power outage. Let us in OS X use a nonstandard flush method recommended by an Apple engineer. */ if (!srv_have_fullfsync) { /* If we are not on an operating system that supports this, then fall back to a plain fsync. */ ret = fsync(file); } else { ret = fcntl(file, F_FULLFSYNC, NULL); if (ret) { /* If we are not on a file system that supports this, then fall back to a plain fsync. */ ret = fsync(file); } } #elif HAVE_FDATASYNC ret = fdatasync(file); #else /* fprintf(stderr, &quot;Flushing to file %pn&quot;, file); */ ret = fsync(file); #endif
  119. 119. Yes, some OS Vendors hate you <ul><li>Thanks to all the permutations of reliably getting data to a disk platter, a simple call is now two screens of code </li></ul>
  120. 120. Big Files
  121. 121. Big Files <ul><li>Write to temp file, sync, rename works badly with large files </li></ul>
  122. 122. Big Files <ul><li>Write to temp file, sync, rename works badly with large files </li></ul><ul><ul><li>especially frequently modified large files </li></ul></ul>
  123. 123. Big Files <ul><li>Write to temp file, sync, rename works badly with large files </li></ul><ul><ul><li>especially frequently modified large files </li></ul></ul><ul><li>Time to start REDO/UNDO logging </li></ul>
  124. 124. Big Files <ul><li>Write to temp file, sync, rename works badly with large files </li></ul><ul><ul><li>especially frequently modified large files </li></ul></ul><ul><li>Time to start REDO/UNDO logging </li></ul><ul><ul><li>other cool tricks </li></ul></ul>
  125. 125. Big Files <ul><li>Write to temp file, sync, rename works badly with large files </li></ul><ul><ul><li>especially frequently modified large files </li></ul></ul><ul><li>Time to start REDO/UNDO logging </li></ul><ul><ul><li>other cool tricks </li></ul></ul><ul><li>Or not care so much </li></ul><ul><ul><li>e.g. DVD ripping </li></ul></ul>
  126. 126. Big Files <ul><li>Write to temp file, sync, rename works badly with large files </li></ul><ul><ul><li>especially frequently modified large files </li></ul></ul><ul><li>Time to start REDO/UNDO logging </li></ul><ul><ul><li>other cool tricks </li></ul></ul><ul><li>Or not care so much </li></ul><ul><ul><li>e.g. DVD ripping </li></ul></ul><ul><li>Some video software saves frame-per-file </li></ul>
  127. 127. Large directories <ul><li>Traditional directory is stored on disk as list of name,inode </li></ul>
  128. 128. Large directories <ul><li>Traditional directory is stored on disk as list of name,inode </li></ul><ul><li>lookup is search through this list </li></ul>
  129. 129. Large directories <ul><li>Traditional directory is stored on disk as list of name,inode </li></ul><ul><li>lookup is search through this list </li></ul><ul><li>Allocation of disk space to directories is block-at-a-time, leading to fragmentation </li></ul>
  130. 130. Large directories <ul><li>Traditional directory is stored on disk as list of name,inode </li></ul><ul><li>lookup is search through this list </li></ul><ul><li>Allocation of disk space to directories is block-at-a-time, leading to fragmentation </li></ul><ul><li>Directory indexes help </li></ul><ul><ul><li>some better than others </li></ul></ul>
  131. 131. Large directories <ul><li>Traditional directory is stored on disk as list of name,inode </li></ul><ul><li>lookup is search through this list </li></ul><ul><li>Allocation of disk space to directories is block-at-a-time, leading to fragmentation </li></ul><ul><li>Directory indexes help </li></ul><ul><ul><li>some better than others </li></ul></ul><ul><li>Can't always control the file system </li></ul><ul><ul><li>count on over a few thousand files being slow </li></ul></ul>
  132. 132. Where data is after being written <ul><li>txns in Dbs often committed but only written to log, not main data file </li></ul>
  133. 133. Where data is after being written <ul><li>txns in Dbs often committed but only written to log, not main data file </li></ul><ul><li>same with file systems </li></ul><ul><ul><li>metadata that's replayed from the log is committed </li></ul></ul>
  134. 134. Where data is after being written <ul><li>txns in Dbs often committed but only written to log, not main data file </li></ul><ul><li>same with file systems </li></ul><ul><ul><li>metadata that's replayed from the log is committed </li></ul></ul><ul><li>Don't rely on reading the raw disk of a mounted fs </li></ul>
  135. 135. Where data is after being written <ul><li>txns in Dbs often committed but only written to log, not main data file </li></ul><ul><li>same with file systems </li></ul><ul><ul><li>metadata that's replayed from the log is committed </li></ul></ul><ul><li>Don't rely on reading the raw disk of a mounted fs </li></ul><ul><ul><li>it harms kittens, </li></ul></ul>
  136. 136. Where data is after being written <ul><li>txns in Dbs often committed but only written to log, not main data file </li></ul><ul><li>same with file systems </li></ul><ul><ul><li>metadata that's replayed from the log is committed </li></ul></ul><ul><li>Don't rely on reading the raw disk of a mounted fs </li></ul><ul><ul><li>it harms kittens, puppies </li></ul></ul>
  137. 137. Where data is after being written <ul><li>txns in Dbs often committed but only written to log, not main data file </li></ul><ul><li>same with file systems </li></ul><ul><ul><li>metadata that's replayed from the log is committed </li></ul></ul><ul><li>Don't rely on reading the raw disk of a mounted fs </li></ul><ul><ul><li>it harms kittens, puppies and babies </li></ul></ul>
  138. 138. sqlite <ul><li>Many applications have structured data </li></ul>
  139. 139. sqlite <ul><li>Many applications have structured data </li></ul><ul><li>Can be (easily) represented in RDBMS </li></ul>
  140. 140. sqlite <ul><li>Many applications have structured data </li></ul><ul><li>Can be (easily) represented in RDBMS </li></ul><ul><li>sqlite is ACID with a capital D for Durability </li></ul>
  141. 141. sqlite <ul><li>Many applications have structured data </li></ul><ul><li>Can be (easily) represented in RDBMS </li></ul><ul><li>sqlite is ACID with a capital D for Durability </li></ul><ul><li>takes hard work out of things </li></ul>
  142. 142. sqlite <ul><li>Many applications have structured data </li></ul><ul><li>Can be (easily) represented in RDBMS </li></ul><ul><li>sqlite is ACID with a capital D for Durability </li></ul><ul><li>takes hard work out of things </li></ul><ul><li>Not so good with many clients </li></ul>
  143. 143. sqlite <ul><li>Many applications have structured data </li></ul><ul><li>Can be (easily) represented in RDBMS </li></ul><ul><li>sqlite is ACID with a capital D for Durability </li></ul><ul><li>takes hard work out of things </li></ul><ul><li>Not so good with many clients </li></ul><ul><li>Brilliant for a document format though </li></ul>
  144. 144. sqlite <ul><li>Many applications have structured data </li></ul><ul><li>Can be (easily) represented in RDBMS </li></ul><ul><li>sqlite is ACID with a capital D for Durability </li></ul><ul><li>takes hard work out of things </li></ul><ul><li>Not so good with many clients </li></ul><ul><li>Brilliant for a document format though </li></ul><ul><li>Scales up to “a few dozen GB of data” before not being as efficient as other RDBMs </li></ul>
  145. 145. Performance of Large files <ul><li>Once in core, page to disk location is cached </li></ul><ul><ul><li>other OSs may vary </li></ul></ul>
  146. 146. Performance of Large files <ul><li>Once in core, page to disk location is cached </li></ul><ul><ul><li>other OSs may vary </li></ul></ul><ul><li>Performance difference is in initial lookup </li></ul>
  147. 147. Performance of Large files <ul><li>Once in core, page to disk location is cached </li></ul><ul><ul><li>other OSs may vary </li></ul></ul><ul><li>Performance difference is in initial lookup </li></ul><ul><li>Extents based file systems much more efficient </li></ul>
  148. 148. Performance of Large files <ul><li>Once in core, page to disk location is cached </li></ul><ul><ul><li>other OSs may vary </li></ul></ul><ul><li>Performance difference is in initial lookup </li></ul><ul><li>Extents based file systems much more efficient </li></ul><ul><li>Zeroing takes long time </li></ul><ul><ul><li>support for unwritten extents means fast zeroing </li></ul></ul><ul><ul><li>think CREATE TABLESPACE </li></ul></ul><ul><ul><li>think bittorrent </li></ul></ul>
  149. 149. inode Infos Direct blocks Indirect Blocks Double indirect blocks
  150. 150. Extent <ul><li>start disk block </li></ul><ul><li>start file block </li></ul><ul><li>length </li></ul><ul><li>flags </li></ul><ul><ul><li>e.g. unwritten </li></ul></ul>
  151. 151. Parallel writers <ul><li>Multiple threads writing large files </li></ul>
  152. 152. Parallel writers <ul><li>Multiple threads writing large files </li></ul><ul><li>Not uncommon to compete for disk allocation </li></ul><ul><ul><li>especially with O_SYNC </li></ul></ul>
  153. 153. Parallel writers <ul><li>Multiple threads writing large files </li></ul><ul><li>Not uncommon to compete for disk allocation </li></ul><ul><ul><li>especially with O_SYNC </li></ul></ul><ul><li>Files can get interweaved ababab </li></ul><ul><ul><li>extent based file systems suffer </li></ul></ul><ul><ul><li>reading performance suffers </li></ul></ul><ul><ul><li>especially with slow growing files </li></ul></ul>
  154. 154. Parallel writers <ul><li>Multiple threads writing large files </li></ul><ul><li>Not uncommon to compete for disk allocation </li></ul><ul><ul><li>especially with O_SYNC </li></ul></ul><ul><li>Files can get interweaved ababab </li></ul><ul><ul><li>extent based file systems suffer </li></ul></ul><ul><ul><li>reading performance suffers </li></ul></ul><ul><ul><li>especially with slow growing files </li></ul></ul><ul><li>Preallocate disk space </li></ul><ul><ul><li>with no standard way to do it... :( </li></ul></ul>
  155. 155. Preallocation <ul><li>Often the file system will do it for you </li></ul><ul><ul><li>doesn't work as well with O_SYNC </li></ul></ul><ul><li>No (useful) standard way to preallocate space </li></ul><ul><ul><li>posix_fallocate doesn't work </li></ul></ul><ul><ul><li>xfsctl for files on XFS </li></ul></ul>
  156. 156. Tablespace allocation in NDB <ul><li>#ifdef HAVE_XFS_XFS_H </li></ul><ul><li>if( platform_test_xfs_fd(theFd) ) </li></ul><ul><li>{ </li></ul><ul><li>ndbout_c(&quot;Using xfsctl(XFS_IOC_RESVSP64) to allocate disk space&quot;); </li></ul><ul><li>xfs_flock64_t fl; </li></ul><ul><li>fl.l_whence= 0; </li></ul><ul><li>fl.l_start= 0; </li></ul><ul><li>fl.l_len= (off64_t)sz; </li></ul><ul><li>if( xfsctl(NULL, theFd, XFS_IOC_RESVSP64, &fl) < 0) </li></ul><ul><li>ndbout_c(&quot;failed to optimally allocate disk space&quot;); </li></ul><ul><li>} </li></ul><ul><li>#endif </li></ul><ul><li>#ifdef HAVE_POSIX_FALLOCATE </li></ul><ul><li>posix_fallocate(theFd, 0, sz); </li></ul><ul><li>#endif </li></ul>
  157. 157. Improvements in mysql-test-run <ul><li>Would run several nodes on one machine </li></ul><ul><ul><li>each creating tablespace files </li></ul></ul><ul><ul><li>all IO is O_SYNC </li></ul></ul>
  158. 158. Improvements in mysql-test-run <ul><li>Would run several nodes on one machine </li></ul><ul><ul><li>each creating tablespace files </li></ul></ul><ul><ul><li>all IO is O_SYNC </li></ul></ul><ul><li>number of extents for ndb_dd_basic tablespaces and log files </li></ul><ul><ul><li>BEFORE this code: 57, 13, 212, 95, 17, 113 </li></ul></ul><ul><ul><li>WITH this code : ALL 1 or 2 extents </li></ul></ul>
  159. 159. Improvements in mysql-test-run <ul><li>Would run several nodes on one machine </li></ul><ul><ul><li>each creating tablespace files </li></ul></ul><ul><ul><li>all IO is O_SYNC </li></ul></ul><ul><li>number of extents for ndb_dd_basic tablespaces and log files </li></ul><ul><ul><li>BEFORE this code: 57, 13, 212, 95, 17, 113 </li></ul></ul><ul><ul><li>WITH this code : ALL 1 or 2 extents </li></ul></ul><ul><li>30 seconds reduction in each test that created tablespaces </li></ul>
  160. 160. Library Developers <ul><li>No problem cannot be solved by Abstraction! </li></ul>
  161. 161. Library Developers <ul><li>No problem cannot be solved by Abstraction! </li></ul><ul><li>...and making it complex </li></ul>
  162. 162. Library Developers <ul><li>No problem cannot be solved by Abstraction! </li></ul><ul><li>...and making it complex </li></ul><ul><ul><li>mysys my_stat MY_STAT *my_stat(const char *path, MY_STAT *stat_area, myf my_flags) </li></ul></ul>
  163. 163. Library Developers <ul><li>No problem cannot be solved by Abstraction! </li></ul><ul><li>...and making it complex </li></ul><ul><ul><li>mysys my_stat MY_STAT *my_stat(const char *path, MY_STAT *stat_area, myf my_flags) </li></ul></ul><ul><ul><li>versus POSIX stat int stat(const char *path, struct stat *buf); </li></ul></ul>
  164. 164. Library Developers <ul><li>No problem cannot be solved by Abstraction! </li></ul><ul><li>...and making it complex </li></ul><ul><ul><li>mysys my_stat MY_STAT *my_stat(const char *path, MY_STAT *stat_area, myf my_flags) </li></ul></ul><ul><ul><li>versus POSIX stat int stat(const char *path, struct stat *buf); </li></ul></ul><ul><ul><li>my_fstat (grrr) int my_fstat(int Filedes, MY_STAT *stat_area, myf MyFlags __attribute__((unused))) </li></ul></ul>
  165. 165. Library Developers <ul><li>No problem cannot be solved by Abstraction! </li></ul><ul><li>...and making it complex </li></ul><ul><ul><li>mysys my_stat MY_STAT *my_stat(const char *path, MY_STAT *stat_area, myf my_flags) </li></ul></ul><ul><ul><li>versus POSIX stat int stat(const char *path, struct stat *buf); </li></ul></ul><ul><ul><li>my_fstat (grrr) int my_fstat(int Filedes, MY_STAT *stat_area, myf MyFlags __attribute__((unused))) </li></ul></ul><ul><ul><li>versus POSIX fstat int fstat(int filedes, struct stat *buf); </li></ul></ul>
  166. 166. There is hope <ul><li>You can do file IO correctly </li></ul><ul><li>You can prevent data loss </li></ul><ul><li>You can pester people to make life easier </li></ul>
  167. 167. Good Luck!
  168. 168. Good Luck! <ul><li>and please don't eat my data </li></ul>

×