DBCC CHECKDB / Database Integrity

It is very important to understand DBCC CHECKDB. Make sure you run DBCC CHECKDB periodically and that if you do find some sort of corruption you address it immediately – both the current problem in the SQL Server database and the underlying problem that caused it.

· How often should we run CHECKDB?

o There is no single best schedule for how often to run CHECKDB. You have to understand how much risk you are willing to take and balance that with the impact of running CHECKDB. Paul Randal’s blog says to run some sort of consistency check “at least once a week”.

o On systems that are not used 24 hours a day where CHECKDB is able to complete during the low/no use period, maybe you run it every week or even every day. On systems where CHECKDB cannot complete within a downtime/slow period then either you take the performance hit or you run it less frequently.

o If you have systems that have been up for years with no corruption, does that mean you run it less often? Maybe, but remember that the older the hardware is the more likely it is to begin to wear out or even fail, and bad hardware is the cause of almost all database corruption problems.

· How can we make it run faster?

o Indirectly, run it less often. 🙂

§ Instead of running it for every database every night, maybe run it for half the dbs one night and the other half the next night.

§ Do one database a night. Either round robin equally or do more important ones more frequently.

§ Check individual tables with CHECKTABLE and don’t do them all at once (and frequently do the other database level checks that are included in CHECKDB – CHECKALLOC and CHECKCATALOG).

§ Check one partition (assuming each is on a separate filegroup) at a time with CHECKFILEGROUP.

§ Run CHECKDB more frequently on data that changes frequently and less often on less frequently updated or less important data.

§ Create a program that runs one checkdb, then checks to see how much time is left and maybe runs another, looping through until it reaches the end of your batch window. Fancier versions might look at runtime history to estimate whether the “next” check can finish before the batch window ends, when each database was last checked, a priority you add to each database, etc.

§ Use whatever schedule balances your ability to handle the risk of having corruption without knowing vs. the overhead of checking for the corruption.

o Make the IO faster and/or add more CPUs.

o Reduce other activity on the system while CHECKDB is running, especially data modifications.

o Reduce the amount of space used in the database.

§ Archive unused data.

§ Drop redundant, duplicate, or unused indexes.

o Use different options such as PHYSICAL_ONLY combined with having CHECKSUM enabled. Run a full CHECKDB less often.

o Restore the backup and run CHECKDB on the restored copy. This has the added benefit of testing the consistency of the log and testing the viability of the backup file itself (and the restore process).

o Use the TABLOCK option for CHECKDB. This may make the CHECKDB itself run faster BUT can result in some blocking.

· Options

o Generally you will want WITH ALL_ERRORMSGS on. The default was OFF in older versions, but is ON in newer versions. So just specify it explicitly to avoid a mistake. A caveat here is that SSMS has been written to limit the output of CHECKDB to the first 1000 rows, so you’re better off running CHECKDB from a tool like SQLCMD or OSQL. This has the added benefit of avoiding the overhead of writing all that output to the memory of the client box (or possibly the server if you run SSMS them).

o NO_INFOMSGS is commonly used so you only have actual errors in the output, but sometimes the informational info can be useful to have.

o PHYSICAL_ONLY changes the impact from a CPU bound process to an IO bound process.

· Finding errors

o If you have corruption, DBCC CHECKDB will NOT automatically abend! It will “successfully” complete. In SQL 2005 and later a “successful” CHECKDB will update the last known good run for that db.

o If SQL Agent is used to schedule the CHECKDB then under some conditions it may abend when corruption is detected. This is a feature of SQL Agent, not a feature of DBCC. So if you use another scheduling tool that tool would have to detect @@error or use another method to look for CHECKDB failures. Within SQL Agent, you will probably want to change your job steps so that a failure in checking one database does NOT fail the entire job, but instead continues checking the next database in the next job step.

o When CHECKDB detects a corruption error it writes a log SQLDUMPnnnn.txt to the SQL log directory.

o Try/Catch only catches exceptions and DBCC CHECK* errors are not exceptions. Therefore Try/Catch will not detect DBCC errors. 953503 The DBCC CHECKDB statement does not raise exceptions in SQL Server when the DBCC CHECKDB statement detect a consistency error in the database http://support.microsoft.com/default.aspx?scid=kb;EN-US;953503

· Checking the log

o CHECKDB checks the data files. It does NOT check the log. The only way to ensure the log is not corrupted is to test a restore and see if it’s successful.

· Suspect_pages table

o In SQL 2005 and later a table in MSDB called suspect_pages tracks page IO errors 823 and 824. It does NOT track metadata, header, or linkage corruption errors/problems. It also does not track 825 errors which are early indicators that often later become 823/824 errors.

o It is populated ONLY if a corrupted page is read into memory and if that page has an 823 or 824 problem.

o Only the first 1000 rows are kept, don’t rely on this to have every error if you have massive or long running corruption. You need to monitor and clean up this table.

o These pages are “suspected” to have problems, but it is possible the problem was transient or has already been fixed.

· Backups

o The only way to verify a backup is good/usable is to restore it. Let me say it again – you don’t know you have a good backup until you have restored it!

o You can check that the header is fine with RESTORE …. WITH VERIFYONLY but that only checks the header and checksums!

o Enabling CHECKSUM on the database plus specifying WITH CHECKSUM on the backup will detect many types of problems, but not all.

o Even if the file is good at the time you take it, there’s no guarantee it will remain good. The media it is on can potentially cause corruption of the backup file(s).

o Did I mention that you don’t know you have a good backup until you have restored it?

· Checksum database option

o A SQL Server DBA myth a day: (27/30) use BACKUP WITH CHECKSUM to replace DBCC CHECKDB http://sqlskills.com/BLOGS/PAUL/post/A-SQL-Server-DBA-myth-a-day-(2730)-use-BACKUP-WITH-CHECKSUM-to-replace-DBCC-CHECKDB.aspx

o Once CHECKSUM has been enabled at the database level (ALTER DATABASE … SET PAGE_VERIFY CHECKSUM) any data pages that are modified will have a checksum added. Until every page is modified there will be pages that have no checksum.

o CHECKSUM is calculated for a data page the first time that page is modified after the ALTER DATABASE.

o CHECKSUM values on pages are only checked when those pages are read into memory so rarely used pages can be corrupted for a long time without being detected! Common ways to read the pages and see if the checksum values are still valid: BACKUP WITH CHECKSUM and CHECKDB.

o CHECKSUM causes an error to be reported, it cannot correct the problem.

o CHECKSUM doesn’t do anything about in-memory corruptions or any corruption errors other than 823, 824, 825.

o CHECKSUM does NOT apply to/report problems with logical meta data, page linkages, stale reads, or index ordering.

· Avoid corruption

o Make sure your hardware is sound.

o Alert on hardware level errors and take action on them.

o Keep your firmware and storport drivers updated.

o Alert on 823, 824, and 825 errors. If you already alert on severity 19+ errors that will capture 823 and 824 errors but not 825.

o Run SQLIOSim to check that the IO subsystem is still properly handling data correctness.

· What should I do when there are errors?

o Check the status of your backups and start thinking about the consequences of restoring them.

o The REPAIR_ALLOW_DATA_LOSS option is your LAST resort. If you use it, you WILL lose data and you may very well never know how much or what was lost.

o Fix the underlying problem! If you see corruption once it will almost certainly happen again on the same system. The vast majority of the time the problem is with the hardware layer. Sometimes it is that the actual disk failed, sometimes it’s something less direct like failed batteries on the write controller combined with a power outage. It could potentially be firmware, a filter driver (Diskkeeper recently had a bug), or other problems. There was a rare bug (KB 954734) in SQL 2005 prior to SP3 (and SP3 is now the minimum supported version of 2005) where using the old DBREINDEX syntax instead of the new ALTER INDEX REBUILD/REORG syntax on a heap could corrupt the index and you would have to rebuild the index again. So you MUST find the source of the problem and increase monitoring on any system that has had a corruption problem.

o Understand the errors. Corruption in a non-clustered index can usually be fixed by rebuilding the index. Other types of errors can require a restore of a backup. Read this blog from the guy who wrote CHECKDB: http://www.sqlskills.com/BLOGS/PAUL/post/CHECKDB-From-Every-Angle-Tips-and-tricks-for-interpreting-CHECKDB-output.aspx

General info:

· Paul Randal’s collection of Corruption and CHECKDB blogs

o http://www.sqlskills.com/BLOGS/PAUL/category/Corruption.aspx

o http://www.sqlskills.com/BLOGS/PAUL/category/CHECKDB-From-Every-Angle.aspx

· Sample corrupt databases to play with

o http://www.sqlskills.com/BLOGS/PAUL/post/Sample-corrupt-databases-to-play-with.aspx

o http://www.sqlskills.com/BLOGS/PAUL/post/Conference-corruption-demo-scripts-and-example-corrupt-databases.aspx

o http://www.sqlskills.com/pastConferences.asp

· SQL Server: Is CHECKDB A Necessity?

http://blogs.msdn.com/psssql/archive/2009/02/20/sql-server-is-checkdb-a-necessity.aspx

· Transaction Log (block) CHECKSUM
http://sqlserverpedia.com/blog/sql-server-2005/transaction-log-block-checksum/

· Which part of ‘REPAIR_ALLOW_DATA_LOSS’ isn’t clear? http://blogs.msdn.com/sqlserverstorageengine/archive/2006/06/16/633645.aspx