School of Hard (Drive) Knocks

image of people cleaning a hard driveOne of the problems with having a Windows Home Server with 12 hard drives is that hard drives do fail and there’s a dozen chances for that to happen. Add to that the 18TB of data those drives can hold and things are further complicated. The odds are not in my favor and it’s only a matter of time. I’d been thinking about that recently and had just begun looking at ways to monitor the drive health when I came across some health problems while testing out some tools.

I installed a WHS add-in called Home Server Smart that shows the SMART stats from the hard drives. Sure enough, there were a couple bad or pending bad sectors on a couple drives. But all drives have bad sectors and manufacturers plan for it and re-allocate the sectors to some spares. I’ll monitor those drives and if the bad sectors increase I’ll act. But there was one drive with 160 bad sectors. And sure enough, checking the system log showed this drive had a bad block.

Here’s the Home Server Smart screen from today which is up to 171 bad sectors this morning (Click for full size):

HomeServerSmart

Now, in retrospect the right thing to do would have been to remove this disk immediately and get a replacement. But I figured I’d be smart. It was time to get the hard drives under control and test them all. So I pulled out a spare 1TB drive and began running SpinRite at level 4 on it. For 1 TB, this would take an estimated 2 days to complete the testing. My plan was to replace one of the external 2TB drives first and then run SpinRite on that 2TB drive before using it to replace the problem drive. Naturally the problem drive was internal and inside a drive cage that would require removal to get at the drive. So the fewer times into the machine the better. But that assumed the drive lasted. I figured the problem wasn’t new, just newly noticed.

That was a couple days ago and I was monitoring the drive since then. There weren’t any new bad sectors the first couple of days. I felt confident because I had file duplication on so one drive failure would lose data. I also had full recent backups of everything. I replaced the 2GB external drive and began running SpinRite on the freed up 2TB drive. I was almost home free.

Then last night things began to go very, very wrong. Streaming video or file copies would stop for no apparent reason. No heavy disk usage, no heavy CPU load, no high memory usage. Into Home Server Smart again and the bad sectors are up to 176. So I immediately  began the drive removal process on that drive.

It was slow. Very slow. Painfully slow. 1GB per hour slow. I didn’t have 1,400+ hours for it to go through the drive removal process. But it was late so I decided to give it the night to see if it improved. I woke up to find it made no real progress. So I used the Shutdown command (since I couldn’t use the WHS console to do it) and powered off the server and powered it on. I figured drive duplication would save me from losing files. Although my fear was some corrupt data that would go unnoticed for months.

After the startup I began the drive remove again. After giving it the day while I was at work it was faster but not by much.  At the current rate it would take another 6 days to remove the drive. So I again shut it down. Then I went inside and pulled the drive cable so it would go missing.

When the server came back up it told me the drive was missing so it got that part right. I started the drive removal process again. It’s been running several hours now. The progress bar has moved. Because since there’s no actual data I can’t tell how far along it is. But perfmon does show some heavy read/write drive activity that looks like file copies. So I’m going to be patient. With the drive gone I now have some files that aren’t duplicated. I figure it’s duplicating or verifying those files. A normal drive removal would last overnight for this much data so I figure I need to wait until morning at the least. While I have them backed up I’d hate to have to figure out which ones were lost if they get corrupted so I’ll let WHS do its thing for awhile..

Lessons Learned (the important ones):

  1. Test all drives before installing them. The problem drive is one of the new ones. Whether SpinRite would have caught the problem or not is unknown since it took a couple weeks to manifest itself, but as of now any new drive gets SpinRite Level 5 before it goes into the server (or my PCs for that matter). I’d been lazy and impatient in the past. I’d go through the drive removal process the night before the new drive was due to arrive and slap the new one in as soon as it arrived. No more.
  2. Write down the drive serial numbers (and the specific model numbers). Somewhere along the line my drive mappings got messed up. When I thought I was removing the drive from external bay 4 I was actually removing the one from Bay 3. So when I rebooted I got a drive missing error. Luckily that was easily fixed by popping the drive back in. Still, it’s easily avoidable as the Disk Management add-in I use shows drive model and serial numbers (at least for most drives).
  3. Having Windows Home Server offline for drive removals really sucks. Timing wise this wasn’t too bad because I haven’t needed the files on it (although if it was online I’d be streaming video now). But the server has basically been offline since Sunday night.
  4. Hard Drives hate me at the moment. A 2TB drive I ordered for another project, but was going to use as a replacement, arrived DOA on Thursday and it’s replacement didn’t arrive until today. I can’t remember the last time I got a drive that was literally DOA and wouldn’t even spin up when I pulled it out of the box. Hopefully is just one of those things and not a manufacturing issue with a batch of them. The replacement is still in its unopened antistatic bag.
  5. If a hard drive seems to be going bad pull it immediately. Either replace it or run diagnostics on it. I left the drive in because I figured the bad sectors weren’t new. I figured I’d “be smart” and minimize the effort and time opening up the server case. I waited until there were new bad sectors but by then I was already in trouble.
  6. It’s not a new lesson but it reinforces my current beliefs. Windows Home Server file duplication is a good thing. Backups are a good thing. File duplication is not a backup.

For more information about the Home Server Smart Add-in you can see the reviews at HomeServerLand or We Got Served. No sense me repeating their review. I’ll just add my endorsement of the plug in. It’s a simple but well designed plug-in that does it’s thing without getting in the way. The plug-in is free but donations are accepted. I threw a small donation their way to encourage these types of add-ins.

Read the conclusion to my hard drive problems here.