Data recovery consistency with check db

2,900 views

Published on

This is the session I did at TechEd India 2010 on Data Recovery and Consistency with CHECKDB.

Published in: Technology
1 Comment
1 Like
Statistics
Notes
  • Now users can easily fix Database Consistency Errors & repair corrupt SQL database with the help of SQL recovery tool. http://www.sqlrecoverysoftware.net/blog/fix-database-consistency-errors.html
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
2,900
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
1
Likes
1
Embeds 0
No embeds

No notes for slide

Data recovery consistency with check db

  1. 1.
  2. 2. Data Recovery & Consistency with CHECKDBwith SQL Server<br />Vinod Kumar<br />Technology Evangelist - Microsoft<br />@vinodk_sql<br />www.ExtremeExperts.com<br />http://blogs.sqlxml.org/vinodkumar<br />
  3. 3. Why Is This Session Important?<br />Corruption does happen, mostly caused by IO subsystem<br />People don’t realize they have corruption until too late<br />People don’t know what to do when they do have corruption, leading to:<br />More data loss and downtime than necessary<br />Monetary and even job losses<br />
  4. 4. What Can Happen to an Unprepared DBA Confronted by Corruption?<br />
  5. 5. Session Takeaways<br />From this session you will<br />CHECKDB Significance<br />Guidance and options after corruption<br />Getting database online <br />Distinguish Repair VS Restore<br />DON’T TRY this on your <br />Production Environments<br />
  6. 6. Agenda<br />Discovering corruption<br />Interpreting CHECKDB output<br />Choosing between restore and repair<br />Recovering from a ‘last resort’<br />With demos of common scenarios<br />
  7. 7. I/O Errors<br />Three types<br />823 (a hard I/O error)<br />824 (a soft I/O error)<br />825 (a read-retry error)<br />Nice error messages in 2005+<br />Msg 824, Level 24, State 2, Line 1<br /> SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x7232c940; actual: 0x720e4940). It occurred during a read of page (1:143) in database ID 8 at offset 0x0000000011e000 in file 'c:roken.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.<br />Logged in msdb..suspect_pages<br />Input into single-page restore operations<br />
  8. 8. Page Protection Options<br />SQL Server allows pages to be ‘protected’ on disk from corruptions<br />Allows fast detection of corruptions<br />Set using<br />ALTER DATABASE SET PAGE_VERIFY <option><br />Three options:<br />NONE<br />TORN_PAGE_DETECTION<br />CHECKSUM<br />
  9. 9. DBCC CHECKDB<br />The only way to read all allocated pages in the database<br />Use to force page checksums to be checked<br />Choose between full checks and WITH PHYSICAL_ONLY<br />Many algorithms to minimize runtime and run ONLINE since SQL Server 2000<br />Blog post series:<br />http://www.sqlskills.com/blogs/paul/category/CHECKDB-From-Every-Angle.aspx<br />
  10. 10. First Hints That Something Is Wrong…<br />Application/user connections get broken<br />Users report 823 or 824 errors<br />‘Hard’ and ‘Soft’ IO errors<br />Backup jobs start failing<br />Error 3043 – backup detected checksum errors<br />Agent alerts start firing<br />Should have alerts on all errors with severity >= 19<br />Should have an alert on error 825<br />Informational (!) message that there are transient IO problems<br />Maintenance jobs start failing<br />
  11. 11. As Soon As Corruption Is Suspected…<br />No need to panic!<br />Determine the extent of the corruption<br />Run DBCC CHECKDB<br />Look in the SQL Server error log <br />Check maintenance job history<br />Check what backups are available<br />Wait for CHECKDB to finish before doing anything else<br />You many not NEED to do anything intrusive/destructive<br />
  12. 12. How To Run DBCC CHECKDB<br />By default, CHECKDB will:<br />Only return the first 200 errors<br />Return lots of info that’s distracting in a corruption situation<br />Use the following command with only these options:<br />DBCC CHECKDB (<<yourdb>>) WITH ALL_ERRORMSGS, NO_INFOMSGS<br />If it’s taking longer than usual, that should mean that it found some corruption<br />Check the error log for message 5268 from SQL Server 2005 SP2 onwards to see if it’s rescanning some data<br />Most importantly, wait for it to complete!<br />
  13. 13. Interpreting CHECKDB Output (1)<br />So, CHECKDB completes and you have a bunch of cryptic error messages. Now what?<br />There are over 150 errorsthat CHECKDB can output, some with over 200 states<br />Figuring out what one error means isn’t too bad<br />MSDN has most of them published for reference<br />There are some tips and tricks you can use…<br />
  14. 14. Interpreting CHECKDB Output (2)<br />Did CHECKDB fail?<br />If it stops before completing successfully, something bad has happened that is preventing CHECKDB from running<br />This means there is no choice but to restore from a backup as CHECKDB cannot be forced to run (and hence repair)<br />Examples of fatal (to CHECKDB) errors<br />7984 – 7988: corruption in critical system tables<br />8967: invalid states within CHECKDB itself<br />8930: corrupt metadata in the database such that CHECKDB could not run<br />See ‘Understanding DBCC Error Messages’ in the BOL for DBCC CHECKDB for more details<br />
  15. 15. Interpreting CHECKDB Output<br />Example fatal errors to CHECKDB<br />demo<br />
  16. 16. Interpreting CHECKDB Output (3)<br />Are the corruptions only in non-clustered indexes?<br />If recommended repair level is REPAIR_REBUILD, then YES!<br />Otherwise, check all the index IDs in the errors – if they’re all greater than 1, then YES!<br />If YES, you *could* just rebuild the corrupt indexes<br />Depends on the error, and the size of the index<br />But, what caused the corruption?<br />If you just rebuild the indexes, the corruption will probably happen again (especially if caused by the IO subsystem)<br />Make sure you do root-cause analysis and take preventative measures<br />
  17. 17. Interpreting CHECKDB Output<br />Non-clustered index corruption only<br />demo<br />
  18. 18. Interpreting CHECKDB Output (4)<br />Was there an un-repairable error found?<br />8909, 8938, 8939 (page header corruption) errors where the type is ‘PFS’<br />8970 error: invalid data for the column type<br />8992 error: CHECKCATALOG (metadata mismatch) error<br />Plus a few more obscure ones<br />E.g. an 8904 error (extent is allocated to two objects). This is usually repairable except in the case where the extent is marked as mixed and dedicated, and has pages allocated to multiple objects. The repair is too complicated and/or destructive so is not attempted.<br />None of these can be automatically repaired<br />But if you don’t have a backup without these corruptions, you may be able to fix the 8970 and 8992 errors…<br />
  19. 19. Interpreting CHECKDB Output<br />Manually repairing an invalid data value (2570) in SQL Server 2005+<br />demo<br />
  20. 20. Interpreting CHECKDB Output<br />Manually repairing a metadata corruption (8992) in SQL Server 2005+<br />demo<br />
  21. 21. Recovering Using Backups<br />Best way to avoid data loss<br />Not necessarily the best way to avoid downtime<br />Depends what kind of backups are available<br />Although backup compression in SQL Server 2008 helps…<br />Plethora of options available<br />Full database backup is a good starting point<br />Series of transaction log backups as well is much better<br />Beyond the scope of this session…<br />Remember:<br />Backups have to exist to be useful<br />Backups have to be valid to avoid data loss<br />
  22. 22. Choosing Between Restore and Repair (1)<br />Multiple decision points that could short-circuit the decision process<br />Do you still have a database?<br />No – you must restore from a backup<br />Do you have working backups?<br />No – you must use repair, or restore a damaged backup with CONTINUE_AFTER_ERROR, or extract data to a new database<br />Is the log damaged?<br />Yes – you must restore, or run emergency mode repair, or extract to a new database<br />
  23. 23. Choosing Between Restore and Repair (2)<br />Did CHECKDB fail?<br />Yes – you must restore or extract<br />Is it just non-clustered indexes that are damaged?<br />Yes – maybe rebuild them manually<br />Are there any un-repairable errors?<br />Yes – you must restore or extract<br />If you’re still able to make a repair/restore choice:<br />Consider your down-time and data-loss Service Level Agreements<br />Use whichever option you can which allows you to limit down-time and data-loss while still staying within the SLAs<br />
  24. 24. Repair vs. Restore<br />Manually repairing a single page corruption with and without backups<br />demo<br />
  25. 25. Beware of REPAIR_ALLOW_DATA_LOSS<br />Repair fixes structural inconsistencies by de-allocating<br />(Not REPAIR_REBUILD, but indexes should be fixed manually)<br />This is the fastest and most provably correct way<br />Repair doesn’t take into account:<br />Foreign-key constraints<br />Inherent business logic and data relationships<br />Replication (see BOL for DBCC CHECKDB)<br />Before running repair, protect yourself<br />Take a backup and quiesce replication topologies involved<br />After running repair, check the data<br />Consider running DBCC CHECKCONSTRAINTS<br />Fix up any replication topologies involved<br />
  26. 26. What If the Log Is Damaged?<br />Without a backup, two realistic choices:<br />Use EMERGENCY mode to access the data in the corrupt state<br />E.g. to extract to another database<br />ALTER DATABASE mydb SET EMERGENCY;<br />Use EMERGENCY mode repair<br />New feature of SQL Server 2005<br />Rebuilds the log and runs REPAIR_ALLOW_DATA_LOSS as an atomic operation<br />Database must be in EMERGENCY *and* SINGLE_USER<br />This is the 3rd worst state to be in<br />
  27. 27. Things That People Often Try *First*<br />Restart SQL Server<br />Just wastes time and delays getting back online<br />Immediately jump to a last resort and cause data loss without working through options<br />Running repair<br />Rebuilding the transaction log<br />Detach a suspect database<br />It will fail to attach again – now the situation is even worse!<br />This is the 2nd worst state to be in<br />However, there’s a trick you can use…<br />
  28. 28. Repairing a Suspect Database<br />How to hack a detached suspect database back into the system and repair it<br />demo<br />
  29. 29. What If You Don't Have a Database At All *OR* Any Kind of Backup to Restore From?<br />Total data loss - *this* is the worst state to be in<br />You might have no choice apart from manual re-entry, or<br />URLC<br />Update Resume, Leave City <br />
  30. 30. Summary: Pulling It All Together<br />Know the signs of corruption<br />When corruption occurs, be methodical:<br />Figure out the extent of the corruption<br />Figure out your options to limit downtime, data loss, or both<br />If you’re going to run repair, take a backup first<br />Fix the corruption<br />Finish with root-cause analysis<br />Test all of this before you have to do it for real<br />Good luck!<br />
  31. 31. Resources (Paul's Blog)<br />Example corrupt databases to play with<br />http://www.sqlskills.com/blogs/paul/post/Example-20002005-corrupt-databases-and-some-more-info-on-backup-restore-page-checksums-and-IO-errors.aspx<br />Everything you ever wanted to know about CHECKDB<br />http://www.sqlskills.com/blogs/paul/category/CHECKDB-From-Every-Angle.aspx<br />Tips and tricks for interpreting CHECKDB output<br />http://www.sqlskills.com/blogs/paul/post/CHECKDB-From-Every-Angle-Tips-and-tricks-for-interpreting-CHECKDB-output.aspx<br />Log rebuilding and repair<br />http://www.sqlskills.com/blogs/paul/post/Corruption-Last-resorts-that-people-try-first.aspx<br />Page checksums and SQLIOSim<br />http://www.sqlskills.com/blogs/paul/post/How-to-tell-if-the-IO-subsystem-is-causing-corruptions.aspx<br />EMERGENCY mode repair<br />http://www.sqlskills.com/blogs/paul/post/CHECKDB-From-Every-Angle-EMERGENCY-mode-repair-the-very-very-last-resort.aspx<br />
  32. 32. આભાર<br />ধন্যবাদ<br />நன்றி<br />धन्यवाद<br />ಧನ್ಯವಾದಗಳು<br />ధన్యవాదాలు<br />ଧନ୍ୟବାଦ<br />ਧੰਨਵਾਦ<br />നിങ്ങള്‍‌ക്ക് നന്ദി<br />
  33. 33. © 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.<br />The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.<br />

×