SQL Server and My Disaster Recovery Experience

How true when they say, you learn best from experience.

Last month, I got a call from a client requesting an onsite visit to check why their SQL Server backups are failing. So I went and immediately ran a full database backup. A dreaded error message greeted me. The following is the error:

10 percent processed.
20 percent processed.
30 percent processed.
Msg 3271, Level 16, State 1, Line 1
A nonrecoverable I/O error occurred on file “K:\DATA\xxxxx.mdf:” 1117(The request could not be performed because of an I/O device error.).
Msg 3013, Level 16, State 1, Line 1
BACKUP DATABASE is terminating abnormally.

I felt a brief terror and the first step that I did in solving the situation was post the error message as a status on my Facebook to convey to my friends how my day is gearing up to be undesirable. How genius of me. Right? 🙂

Then I checked the last known good backup and found out that the server has not been backing up successfully for the last 10 days from when they reported the problem. The client never thought the problem would be critical as the applications are running smoothly uninterrupted since they can remember.

As I investigate the problem, it was concluded that one disk in one RAID 10 array is defective. Worst, the array was meant to store the MDF files. The mirror in the RAID 10 array doesn’t seem to kick in too. The other arrays where the LOGs, the TEMPDBs are located SEEM to be working fine. Hardware wasn’t my expertise so I never bothered to spend time on what went wrong with the hard disks and RAID controllers and decided to just focus on how get the data intact through other means since backing up is failing.

Luckily for me, the TRANSACTION LOGs are doing fine and can be copied. However, as I restored from the last known known good full backup and restored all the transaction log backups that were done right after, it was determined that TRANSACTION LOG backups have stopped too when SQL Server Agent stopped running few days prior. When I attempted to backup the transaction log, I got a similar I/O error. I then remember one of Kimberly Tripps’ PASS SUMMIT videos on disaster recovery. In the video, she demo’d how SQL Server handled when an entire disk volume/array went black. Then hope came on to me. Though no one encounters this every day, and probably no SQL Server professional has encountered this in his/her entire SQL Server career, I felt some degree of hope that the problem I am about to deal with is solvable. The mere fact that the applications depending on the database that can’t be backed up are running perfectly fine tells me something in SQL Server is allowing the database to stay and used online.

My first instincts based on my limited actual experience on disaster and data recovery were to do the following:

  • Never to turn off/restart the server nor restart the services, which I did
  • Ask the client if they can go offline while I resolve the issue, which they allowed so I turned off web server running the applications
  • Find a way to get a full backup of the current state of the database, which I failed
  • Find a way to copy all the associated files (MDFs, LDFS, etc) and having it restored on another SQL Server instance, which I also failed. The MDF file simply just can’t be copied using various tools.
  • Google for relevant info. There were lots but none was as specific to assist me on how to go about it one step at a time
  • Ask fellow SQL Server practitioners about it, but I never came to actually go about this during the whole exercise of data recovery
  • Watch again Kimberly Tripp’s video, which I did and gave me some indirect insights how SQL Server behaves when it is crippled.

What eventually allowed me to recover the data wasn’t accidental but a result of some sound practice in database deployment. There were two things that allowed for recovery:

  1. The LOG files saved the day for me as they happened to be stored on a separate container (disk array) since the day the server was first put up.
  2. SQL Server had ample memory (RAM). It has more than enough memory (RAM) to contain more than twice the size of the database concerned.

So what really happened? How did the storing of the TRANSACTION LOG files on a separate disk array allowed the system to stay good and recoverable? How did memory helped out.

  • SQL Server has the ability to keep data for as long as it can in memory. For read only queries, SQL Server will never bother to read data from data files on disk for as long as it can, provided data aren’t flushed out of memory (e.g. when a need for another memory is requested and there is not enough memory for SQL Server to grant the request another chunk of memory).
  • UPDATES are also reflected on memory and the transaction is recorded in the TRANSACTION LOGs when a database is in Full Recovery Mode. SQL Server does not need to update data files for as long as it can. Data are persisted when TRANSACTION LOGS are backed up in this mode.
  • Coincidentally (well not really as the client followed the advice to have ample memory based on earlier recommendations), the ample amount of memory the SQL Server has allowed the memory manager not to flush out data from memory to accommodate other requests for other data.
  • What appeared to be a problem though was READing from the disks. It appears SQL Server can write to the LOG files but can’t back them up, thus, it allowed SQL Server to continue updating data in memory and record the transactions in the log files allowing the applications to still working. I can do a SELECT * FROM table despite the disk read problems (for SQL Server is getting it from memory). I just can’t backup the database whether full or transactional.

What I eventually did to successfully recover 100% of the critical data are the following:

  • I restored the last known good backup sets (FULL and TRANSACTION LOGS) to another defect-free location
  • I compared the row counts and contents of tables (FROM MEMORY vs LAST KNOWN GOOD BACKUP). Turned out only 2 tables had differences and these tables are the most critical and frequently used ones.
  • I simply copied data from memory to another database located in a defect-free disk array.

That was it. I recovered everything I needed.

After I made sure everything was fully recovered and have the database installed on a new server and have the applications running for a few days, the defective server was decommissioned. Before laying it to rest, I played with it with fear of losing data and knowing it is still essentially an active ‘production’ server.

Here are some experiments I did:

  • Tried various file recovery tools in copying the almost 2GB MDF file to no avail. The attempt always ends up with an I/O error.
  • I tried updating the database, insert new records in the two critical tables and it worked flawlessly as if nothing has gone wrong. I was able to INSERT 100k records with no redflags. I was able to edit 5k records and delete 5k records with no clamor whatsoever.
  • I attempted to try FULL BACKUP one last time and the error is now different. It says:The operating system returned error 21(The device is not ready.) to SQL Server during a write at offset 0x0000000023c000 in file ‘ K:\DATA\xxxxx.mdf ‘. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online. (Microsoft SQL Server, Error: 823)Apparently, the array can’t be accessed anymore. Maybe I had subjected it with added stress it gave out. But HEY, I can still access the database and do some updates.
  • Lastly, I rebooted the ‘newly decommissioned’ database server and it just won’t start anymore.

Hope I don’t get to experience this again until I retire from doing SQL Server 😀

Toto Gamboa is a consultant specializing on databases, Microsoft SQL Server and software development operating in the Philippines. He is currently a member and one of the leaders of Philippine SQL Server Users Group, a Professional Association for SQL Server (PASS) chapter and is one of Microsoft’s MVP for SQL Server in the Philippines. You may reach him by sending an email to totogamboa@gmail.com