One of the scariest moments for a DBA is to see a database in the "Suspect" mode or in the "Recovery Pending" mode...
First, let's understand the difference between these two database modes...
The "Recovery Pending" mode was added in SQL Server 2008, so before, there was only the "Suspect" mode. Generally speaking, when a database was "Suspect" in SQL Server 2005 or earlier version, it meant that the database couldn't be brought online, and thus, it wasn't available for user connections.
Starting with SQL Server 2008, we have two modes - "Suspect" and "Recovery Pending". The "Recovery Pending" mode means that the database can't start the crash recovery process. This is either because SQL Server cannot access the boot page in the primary data file in order to determine whether a crash recovery is needed, or it determined that crash recovery is needed, but it couldn't start the recovery process (for example, because the transaction log file is missing or corrupted).
The "Suspect" mode means that the database started the recovery process, but it failed to complete. For example, this can happen if the transaction log file becomes inaccessible during a transaction rollback operation.
Although it's important to understand the difference between the two modes, usually, the steps to fix the problem are the same. In order to keep it simple, I will list here the actions you should take regardless of the database mode ("Recovery Pending" or "Suspect").
The "Recovery Pending" mode means that the database can't start the crash recovery process
So what do you do when (sh)it happens?
1. Don't panic! I know it's easier said than done, but believe me - this is one of the most important tips I can give you. Because you might do something that will only make things worse, for example, when a database is in the "Suspect" mode, you still might be able to access the database and extract data from it. But if you try to restart SQL Server (because you're under pressure), then you might not be able to access the database anymore. So before you do anything - breathe, drink a glass of water, and then start planning your actions. OK? Are you breathing? Good. Then you can now continue to step #2...
2. Take a backup! Yes, I know you can't really perform a SQL Server backup at this point. But you should try to take some sort of backup of the database files as soon as you can. Sometimes, depending on the state of the database and the database files, SQL Server might not be holding any locks on the files. In this case, copy all the database files to a different location, preferably a separate storage. If this is not possible, then take a storage snapshot or use whatever 3rd-party tool you like, which can take a copy of the database files. Very important - make sure that the tool you are using only tries to read the files and doesn't try to change them in any way
3. Restore from backups. Hopefully, you should have backups available, which correspond to the business RPO and RTO. What's that? You don't know what RPO and RTO mean? Oops... So RPO stands for Recovery Point Objective, and it corresponds to the amount of data the business is willing to lose in case of a failure (like you are facing now). RTO stands for the amount of time it would take to bring the system back to life. In other words, the amount of downtime. If you have a backup and recovery strategy that supports the business RPO and RTO, then you're all good. There's no point in wasting time trying to recover data from the corrupted files. Just execute your recovery policy, and restore from backups.
4. Or you can try to recover data without backups. This is risky, and it can also be time-consuming. Why would you want to do that? Several reasons... First, you might not have backups at all (oh boy). Or maybe you have backups, but you haven't tested them for a while, and they turn out to be corrupted or otherwise unusable (oh boy). Maybe you have valid backups, but it would take 5 hours to restore everything from scratch. Even if it's acceptable according to the business RTO, you might want to try to recover the database using other methods, hoping you can do it in less than 5 hours and with an acceptable amount of data loss. You should be careful because this can be time-consuming, and if you give up after 2 hours and only then start the 5-hours restore process, then, well, it's 7 hours of downtime. What you should do is start the restore from backups ASAP to a new database with a new name on the same instance. Then, while the restore is running, you can start playing with the corrupted database. If you manage to bring it online in less than 5 hours, then you can drop the restored database. If not, then at least you have a fresh new database after 5 hours. Don't forget to rename the new database as well as its database files to match the original database.
Picture credit: Lunde Studio
In the next sections, I will describe the steps to (try to) recover data from the corrupted database without backups. I hope you're ready for that...
5. Put the database in emergency mode. This is the first thing you should try. Because otherwise, you can't access or use the database.
If you managed to put the database in emergency mode, then maybe things are not as bad as you thought (but just maybe). You can continue to the next step. If you get an error, then good luck (you'll need it). Go to step #9.
6. Run DBCC CHECKDB. It might take a while, and it will return all the corrupted pages in the database. I prefer to run it with NO_INFOMSGS, so that it only outputs errors, like this:
You can then examine the page numbers and the error messages associated with them. There can be all kinds of errors, and going into the details of each one is beyond the scope of this article. DBCC CHECKDB will also tell you the minimum repair level for the errors found. Here is an example:
Msg 8939, Level 16, State 98, Line 6
Table error: Object ID 1205579333, index ID 0, partition ID 72057594043105280, alloc unit ID 72057594049527808 (type In-row data), page (1:323). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 133129 and -4.
Msg 8928, Level 16, State 1, Line 6
Object ID 1205579333, index ID 0, partition ID 72057594043105280, alloc unit ID 72057594049527808 (type In-row data): Page (1:323) could not be processed. See other errors for details.
CHECKDB found 0 allocation errors and 2 consistency errors in table 'TableWithCorruption' (object ID 1205579333).
CHECKDB found 0 allocation errors and 2 consistency errors in database 'MissionCriticalProductionDB'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (MissionCriticalProductionDB).
This doesn't necessarily mean you should run repair with data loss. If, for example, all the corrupted pages belong to a non-clustered index, then you can just drop it and create it again, and you're done without any data loss. You can use DBCC PAGE to examine the contents of specific pages reported by DBCC CHECKDB, but if the page is heavily corrupted, you might not be able to do it.
7. Bring the database online. After you're done repairing and/or rebuilding indexes, and you have a clean DBCC CHECKDB run, you should be able to bring your database online. Hopefully, you managed to lose only an acceptable amount of data. Sometimes, you might not get so lucky. In this case, if you have backups, this might be a good time to use them. You can also try to restore the corrupted database (see step #2), and start all over again. But otherwise, this is the end of the road for you.
8. Extract as much data as you can. If the database is in emergency mode, but DBCC CHECKDB fails to run, then you can still try to extract data form the database. First, create a new database. Then, use a script to create all the database objects. You can try to generate a script from the corrupted database, but that might fail because of the corruption. If it fails, then hopefully you have a backup or some other source you can generate the script from. If not, then prepare for a long night of typing CREATE statements. Then, copy data between tables using INSERT-SELECT statements. This can be tricky, and you might need to filter the data or use index hints in order to be able to extract some data.
9. Try a Hack-Attach. If you are here, then it means you couldn't put the database in emergency mode. It also means you can't access the database or do anything at this point. You might be able to fool SQL Server by using a hack-attach. The idea is to set the corrupted database offline in order to release the locks on the database files (or shut down SQL Server if you have to), then create a new and healthy dummy database, set it offline, delete its files, replace them with the files of the corrupted database, and then set the dummy database online again. This might fool SQL Server to think that the database is healthy, and let you in. Sometimes it works, sometimes it doesn't. Paul Randal explains this method so well here. If you managed to pull this trick, then continue to step #5.
10. Fix a broken boot page. If even a hack-attach doesn't work, so there is still one last thing you can try. Page #9 in the mdf file is called the boot page, and it contains important information about the database needed during recovery. If the problem is that the boot page is corrupted, then this might be the reason all of your efforts haven't succeeded so far. If this is the case, then you can try to fix the broken boot page. Again, I will let Paul Randal explain how to do it. He does that much better than me.
11. Try a 3rd-party recovery tool. If you are here, then you really tried everything, but still no luck. Before you give up, you might be able to recover some data with a 3rd-party database recovery tool. There are many recovery tools in the market. Some are better than others, some are more expensive than others. Most of them offer a trial version you can use in order to see if it can extract anything from the corrupted database. One tip I can give you is that you should seek a tool that specializes in the data platform you are dealing with. So, in our case, look for a SQL Server recovery tool. From my experience, in some cases, when nothing else worked, such a tool managed to extract at least some data from the database.
Good luck!
This blog was initially published at our Linkedin blog
Comments