Befriending Dragons

Transform Tech with Anti-bullying Cultures


1 Comment

DBCC CHECKDB / Database Integrity

It is very important to understand DBCC CHECKDB. Make sure you run DBCC CHECKDB periodically and that if you do find some sort of corruption you address it immediately – both the current problem in the SQL Server database and the underlying problem that caused it.

 

·         How often should we run CHECKDB?

o   There is no single best schedule for how often to run CHECKDB. You have to understand how much risk you are willing to take and balance that with the impact of running CHECKDB. Paul Randal’s blog says to run some sort of consistency check “at least once a week”.

o   On systems that are not used 24 hours a day where CHECKDB is able to complete during the low/no use period, maybe you run it every week or even every day. On systems where CHECKDB cannot complete within a downtime/slow period then either you take the performance hit or you run it less frequently.

o   If you have systems that have been up for years with no corruption, does that mean you run it less often? Maybe, but remember that the older the hardware is the more likely it is to begin to wear out or even fail, and bad hardware is the cause of almost all database corruption problems.

·         How can we make it run faster?

o   Indirectly, run it less often. 🙂

§  Instead of running it for every database every night, maybe run it for half the dbs one night and the other half the next night.

§  Do one database a night. Either round robin equally or do more important ones more frequently.

§  Check individual tables with CHECKTABLE and don’t do them all at once (and frequently do the other database level checks that are included in CHECKDB – CHECKALLOC and CHECKCATALOG).

§  Check one partition (assuming each is on a separate filegroup) at a time with CHECKFILEGROUP.

§  Run CHECKDB more frequently on data that changes frequently and less often on less frequently updated or less important data.

§  Create a program that runs one checkdb, then checks to see how much time is left and maybe runs another, looping through until it reaches the end of your batch window. Fancier versions might look at runtime history to estimate whether the “next” check can finish before the batch window ends, when each database was last checked, a priority you add to each database, etc.

§  Use whatever schedule balances your ability to handle the risk of having corruption without knowing vs. the overhead of checking for the corruption.

o   Make the IO faster and/or add more CPUs.

o   Reduce other activity on the system while CHECKDB is running, especially data modifications.

o   Reduce the amount of space used in the database.

§  Archive unused data.

§  Drop redundant, duplicate, or unused indexes.

o   Use different options such as PHYSICAL_ONLY combined with having CHECKSUM enabled. Run a full CHECKDB less often.

o   Restore the backup and run CHECKDB on the restored copy. This has the added benefit of testing the consistency of the log and testing the viability of the backup file itself (and the restore process).

o   Use the TABLOCK option for CHECKDB. This may make the CHECKDB itself run faster BUT can result in some blocking.

·         Options

o   Generally you will want WITH ALL_ERRORMSGS on. The default was OFF in older versions, but is ON in newer versions. So just specify it explicitly to avoid a mistake. A caveat here is that SSMS has been written to limit the output of CHECKDB to the first 1000 rows, so you’re better off running CHECKDB from a tool like SQLCMD or OSQL. This has the added benefit of avoiding the overhead of writing all that output to the memory of the client box (or possibly the server if you run SSMS them).

o   NO_INFOMSGS is commonly used so you only have actual errors in the output, but sometimes the informational info can be useful to have.

o   PHYSICAL_ONLY changes the impact from a CPU bound process to an IO bound process.

·         Finding errors

o   If you have corruption, DBCC CHECKDB will NOT automatically abend! It will “successfully” complete. In SQL 2005 and later a “successful” CHECKDB will update the last known good run for that db.

o   If SQL Agent is used to schedule the CHECKDB then under some conditions it may abend when corruption is detected. This is a feature of SQL Agent, not a feature of DBCC. So if you use another scheduling tool that tool would have to detect @@error or use another method to look for CHECKDB failures. Within SQL Agent, you will probably want to change your job steps so that a failure in checking one database does NOT fail the entire job, but instead continues checking the next database in the next job step.

o   When CHECKDB detects a corruption error it writes a log SQLDUMPnnnn.txt to the SQL log directory.

o   Try/Catch only catches exceptions and DBCC CHECK* errors are not exceptions. Therefore Try/Catch will not detect DBCC errors. 953503 The DBCC CHECKDB statement does not raise exceptions in SQL Server when the DBCC CHECKDB statement detect a consistency error in the database http://support.microsoft.com/default.aspx?scid=kb;EN-US;953503

·         Checking the log

o   CHECKDB checks the data files. It does NOT check the log. The only way to ensure the log is not corrupted is to test a restore and see if it’s successful.

·         Suspect_pages table

o   In SQL 2005 and later a table in MSDB called suspect_pages tracks page IO errors 823 and 824. It does NOT track metadata, header, or linkage corruption errors/problems. It also does not track 825 errors which are early indicators that often later become 823/824 errors.

o   It is populated ONLY if a corrupted page is read into memory and if that page has an 823 or 824 problem.

o   Only the first 1000 rows are kept, don’t rely on this to have every error if you have massive or long running corruption. You need to monitor and clean up this table.

o   These pages are “suspected” to have problems, but it is possible the problem was transient or has already been fixed.

·         Backups

o   The only way to verify a backup is good/usable is to restore it. Let me say it again – you don’t know you have a good backup until you have restored it!

o   You can check that the header is fine with RESTORE …. WITH VERIFYONLY but that only checks the header and checksums!

o   Enabling CHECKSUM on the database plus specifying WITH CHECKSUM on the backup will detect many types of problems, but not all.

o   Even if the file is good at the time you take it, there’s no guarantee it will remain good. The media it is on can potentially cause corruption of the backup file(s).

o   Did I mention that you don’t know you have a good backup until you have restored it?

·         Checksum database option

o   A SQL Server DBA myth a day: (27/30) use BACKUP WITH CHECKSUM to replace DBCC CHECKDB http://sqlskills.com/BLOGS/PAUL/post/A-SQL-Server-DBA-myth-a-day-(2730)-use-BACKUP-WITH-CHECKSUM-to-replace-DBCC-CHECKDB.aspx

o   Once CHECKSUM has been enabled at the database level (ALTER DATABASE … SET PAGE_VERIFY CHECKSUM) any data pages that are modified will have a checksum added. Until every page is modified there will be pages that have no checksum.

o   CHECKSUM is calculated for a data page the first time that page is modified after the ALTER DATABASE.

o   CHECKSUM values on pages are only checked when those pages are read into memory so rarely used pages can be corrupted for a long time without being detected! Common ways to read the pages and see if the checksum values are still valid: BACKUP WITH CHECKSUM and CHECKDB.

o   CHECKSUM causes an error to be reported, it cannot correct the problem.

o   CHECKSUM doesn’t do anything about in-memory corruptions or any corruption errors other than 823, 824, 825.

o   CHECKSUM does NOT apply to/report problems with logical meta data, page linkages, stale reads, or index ordering.

·         Avoid corruption

o   Make sure your hardware is sound.

o   Alert on hardware level errors and take action on them.

o   Keep your firmware and storport drivers updated.

o   Alert on 823, 824, and 825 errors. If you already alert on severity 19+ errors that will capture 823 and 824 errors but not 825.

o   Run SQLIOSim to check that the IO subsystem is still properly handling data correctness.

·         What should I do when there are errors?

o   Check the status of your backups and start thinking about the consequences of restoring them.

o   The REPAIR_ALLOW_DATA_LOSS option is your LAST resort. If you use it, you WILL lose data and you may very well never know how much or what was lost.

o   Fix the underlying problem! If you see corruption once it will almost certainly happen again on the same system. The vast majority of the time the problem is with the hardware layer. Sometimes it is that the actual disk failed, sometimes it’s something less direct like failed batteries on the write controller combined with a power outage. It could potentially be firmware, a filter driver (Diskkeeper recently had a bug), or other problems. There was a rare bug (KB 954734) in SQL 2005 prior to SP3 (and SP3 is now the minimum supported version of 2005) where using the old DBREINDEX syntax instead of the new ALTER INDEX REBUILD/REORG syntax on a heap could corrupt the index and you would have to rebuild the index again. So you MUST find the source of the problem and increase monitoring on any system that has had a corruption problem.

o   Understand the errors. Corruption in a non-clustered index can usually be fixed by rebuilding the index. Other types of errors can require a restore of a backup. Read this blog from the guy who wrote CHECKDB: http://www.sqlskills.com/BLOGS/PAUL/post/CHECKDB-From-Every-Angle-Tips-and-tricks-for-interpreting-CHECKDB-output.aspx

 

General info:

·         Paul Randal’s collection of Corruption and CHECKDB blogs

o   http://www.sqlskills.com/BLOGS/PAUL/category/Corruption.aspx

o   http://www.sqlskills.com/BLOGS/PAUL/category/CHECKDB-From-Every-Angle.aspx

·         Sample corrupt databases to play with  

o   http://www.sqlskills.com/BLOGS/PAUL/post/Sample-corrupt-databases-to-play-with.aspx

o   http://www.sqlskills.com/BLOGS/PAUL/post/Conference-corruption-demo-scripts-and-example-corrupt-databases.aspx

o   http://www.sqlskills.com/pastConferences.asp

·         SQL Server: Is CHECKDB A Necessity?

http://blogs.msdn.com/psssql/archive/2009/02/20/sql-server-is-checkdb-a-necessity.aspx

·         Transaction Log (block) CHECKSUM
http://sqlserverpedia.com/blog/sql-server-2005/transaction-log-block-checksum/

·         Which part of ‘REPAIR_ALLOW_DATA_LOSS’ isn’t clear? http://blogs.msdn.com/sqlserverstorageengine/archive/2006/06/16/633645.aspx

·         Misconceptions around database repair http://www.sqlskills.com/BLOGS/PAUL/post/Misconceptions-around-database-repair.aspx

·         SQL Server 2008 Database Checking  http://blogs.technet.com/josebda/archive/2009/03/20/sql-server-2008-checking-databases.aspx

 

Checking the output

·         Automate your Database Integrity checks using SMO http://www.simple-talk.com/sql/backup-and-recovery/automate-your-database-integrity-checks-using-smo/

·         Inserting DBCC CHECKDB Results Into A Table http://www.lockergnome.com/sqlsquirrel/2009/03/11/inserting-dbcc-checkdb-results-into-a-table/


Leave a comment

Backing up a corrupted SQL Server database

I had a question about how to do a backup and skip a corrupted block of data. First, DO NOT DO IT unless you absolutely have to, such as when you are taking a backup prior to trying to fix the corruption (which means you should be on the phone with Microsoft PSS). If you do skip corrupted data you have to consider the backup to be very suspect.

 

Do not ever ignore any indication of data inconsistency in the database. If you have corrupted data it is almost certainly a problem caused by something below the SQL Server level. If it happened once, chances are it will happen again… and again…. and again until the source of the problem is fixed. This means the instant you have any indication of a corrupt SQL Server database you should immediately ask for low-level hardware diagnostics and a thorough review of all logs (event viewer, SQL, hardware, etc.). Double check that if write caching is enabled on the hardware that it is battery backed and the battery is healthy. Double check that all firmware is up to date. Run a DBCC CHECKDB WITH ALL_ERRORMSGS and pay very close attention to the output. Find the source of your corruption and fix it.

 

There is a parameter CONTINUE_AFTER_ERROR for BACKUP and RESTORE, but it is a last ditch command that should only be used as a last resort. One example would be if it’s the only way to get a backup before you attempt to repair the corruption. It does not always work, it depends on what the error is. If you actually have to restore a database backup taken with this option, then you MUST fix the corruption before allowing users, applications, or other production processes back into the database. From BOL:

“We strongly recommend that you reserve using the CONTINUE_AFTER_ERROR option until you have exhausted all alternatives.”

“At the end of a restore sequence that continues despite errors, you may be able to repair the database with DBCC CHECKDB. For CHECKDB to run most consistently after using RESTORE CONTINUE_AFTER_ERROR, we recommend that you use the WITH TABLOCK option in your DBCC CHECKDB command.”

“Use NO_TRUNCATE or CONTINUE_AFTER_ERROR only if you are backing up the tail of a damaged database.”

 

Some suggestions:

·         For every 2005/2008 database, SET PAGE_VERIFY=CHECKSUM (in 2005 this cannot be turned on for TempDB, but it can be turned on for TempDB in 2008). For SQL Server 2000 set TORN_PAGE_DETECTION=ON. When upgrading from 2000 to newer versions set TORN_PAGE_DETECTION=OFF and SET PAGE_VERIFY=CHECKSUM.

·         For databases with CHECKSUM enabled, use the WITH CHECKSUM command on all backups.

·         Implement a “standards” or “best practices” document to handle corruption on each version of SQL Server.

·         Review your disaster recovery plans and upcoming testing. Testing of a full recovery of various scenarios should be done periodically. Some people think once a year is enough, others say monthly or quarterly is often enough. Having backups is not good enough, we have to know that they can be restored. There are also scenarios where backups are not the best way to recover from a problem.

 

Some great info from the person who wrote CHECKDB:

http://sqlskills.com/blogs/paul/post/Example-20002005-corrupt-databases-and-some-more-info-on-backup-restore-page-checksums-and-IO-errors.aspx

http://sqlskills.com/BLOGS/PAUL/category/Corruption.aspx