School of Hard (Drive) Knocks – Part 2

image of people cleaning a hard driveAt the beginning of the month I wrote about my Windows Home Server hard drive problems and at the time I was waiting for WHS to finish removing the bad drive from its storage pool. When I posted the article I was letting Windows Home Server run through the drive removal process while I slept. I had pulled the plug on the bad drive since it brought the server to its knees and was now trying to remove the traces of it from the pool.

When I woke up in the morning the console had crashed and wouldn’t start up. Not Good, but after a quick reboot the console started. The bad drive was still there and part of the pool. There were still some file conflicts reported but with the missing drive that was to be expected. I started up the drive removal wizard again and went off to work.

The console had again gone away by the time I got home. This time it started and the drive was still there. I decided to reboot the server and give the removal wizard one more try since I didn’t have time to do much else. This time when I got back to it the drive had been successfully removed and there weren’t any remaining file conflicts. I gave the WHS once final reboot just to be sure all was well. And it was.

The truth is, I wouldn’t have had the patience to run the wizard three times except for the fact the it was either that or do nothing while I slept and worked. Timing is everything, although in the future I’ll be sure to show some patience.

A few days later one of my 2TB drives finished its pass through a Spinrite level 5 check so I brought the server down again and replaced the drive. Since it was a drive in the internal cage the server needed to be pulled apart to get to it. While a bit tedious pulling apart the server to get the drive it went without a hitch and the server was happy with the new drive.

As it turns out that was one of three Western Digital TB drives manufactured in October of 2009 that  went bad on me. There was the previously mentioned one that arrived DOA. At the time of my previous post that DOA’s replacement had just arrived. While it did spin up it failed to pass Western Digital’s own testing after being a bit flaky in my test PC, so back it went. When I pulled out this bad drive I saw it was manufactured the same month. I have a second drive purchased at the same time that I’ll be pulling to see when it was manufactured and running through a Spinrite test. While it only showed 4 pending bad sectors (compared to hundreds on its mate) I already removed it from the storage pool.

Unfortunately it didn’t hit me that the bad drive was a recent purchase and my delay in pulling it meant I missed the window for a hassle free return to Amazon. So when I get a chance I’ll run Western Digital’s diagnostics on it and if it fails I’ll RMA it, otherwise it will get the Spinrite treatment.

As for the server, I decided I only wanted to make one more trip into that internal drive cage, so it means replacing two 2TB drives at once. I didn’t have the free space to remove both drives from the storage pool so I turned off file duplication on my video files to free up the space. Once that was done I removed both drives from the storage pool. Once of those is one that’s probably manufactured in the same batch as my three bad drives. I’ve got two 2TB drives going through Spinrite level 5 checks now and if the time estimate is correct I should be able to replace the drives on Saturday and turn file duplication back on.

I have five more drives showing a couple bad or pending bad sectors. A couple isn’t necessarily bad if the number doesn’t increase, but I still plan to work my way through them all with a Spinrite level 5 check. Unfortunately level 5 is very time consuming. If I use my test PC (which is slow) it takes about 10 days to do a 2TB drive.

By this weekend my Windows Home Server should be back to full strength and I can use a spare drive to cycle through the testing of the remaining drives. It’ll take awhile, but since the drive testing doesn’t require me to do anything more than swap a disk and bang a few keys every few days it shouldn’t be too bad. What’s the saying? “An ounce of prevention…”

School of Hard (Drive) Knocks

image of people cleaning a hard driveOne of the problems with having a Windows Home Server with 12 hard drives is that hard drives do fail and there’s a dozen chances for that to happen. Add to that the 18TB of data those drives can hold and things are further complicated. The odds are not in my favor and it’s only a matter of time. I’d been thinking about that recently and had just begun looking at ways to monitor the drive health when I came across some health problems while testing out some tools.

I installed a WHS add-in called Home Server Smart that shows the SMART stats from the hard drives. Sure enough, there were a couple bad or pending bad sectors on a couple drives. But all drives have bad sectors and manufacturers plan for it and re-allocate the sectors to some spares. I’ll monitor those drives and if the bad sectors increase I’ll act. But there was one drive with 160 bad sectors. And sure enough, checking the system log showed this drive had a bad block.

Here’s the Home Server Smart screen from today which is up to 171 bad sectors this morning (Click for full size):

HomeServerSmart

Now, in retrospect the right thing to do would have been to remove this disk immediately and get a replacement. But I figured I’d be smart. It was time to get the hard drives under control and test them all. So I pulled out a spare 1TB drive and began running SpinRite at level 4 on it. For 1 TB, this would take an estimated 2 days to complete the testing. My plan was to replace one of the external 2TB drives first and then run SpinRite on that 2TB drive before using it to replace the problem drive. Naturally the problem drive was internal and inside a drive cage that would require removal to get at the drive. So the fewer times into the machine the better. But that assumed the drive lasted. I figured the problem wasn’t new, just newly noticed.

That was a couple days ago and I was monitoring the drive since then. There weren’t any new bad sectors the first couple of days. I felt confident because I had file duplication on so one drive failure would lose data. I also had full recent backups of everything. I replaced the 2GB external drive and began running SpinRite on the freed up 2TB drive. I was almost home free.

Then last night things began to go very, very wrong. Streaming video or file copies would stop for no apparent reason. No heavy disk usage, no heavy CPU load, no high memory usage. Into Home Server Smart again and the bad sectors are up to 176. So I immediately  began the drive removal process on that drive.

It was slow. Very slow. Painfully slow. 1GB per hour slow. I didn’t have 1,400+ hours for it to go through the drive removal process. But it was late so I decided to give it the night to see if it improved. I woke up to find it made no real progress. So I used the Shutdown command (since I couldn’t use the WHS console to do it) and powered off the server and powered it on. I figured drive duplication would save me from losing files. Although my fear was some corrupt data that would go unnoticed for months.

After the startup I began the drive remove again. After giving it the day while I was at work it was faster but not by much.  At the current rate it would take another 6 days to remove the drive. So I again shut it down. Then I went inside and pulled the drive cable so it would go missing.

When the server came back up it told me the drive was missing so it got that part right. I started the drive removal process again. It’s been running several hours now. The progress bar has moved. Because since there’s no actual data I can’t tell how far along it is. But perfmon does show some heavy read/write drive activity that looks like file copies. So I’m going to be patient. With the drive gone I now have some files that aren’t duplicated. I figure it’s duplicating or verifying those files. A normal drive removal would last overnight for this much data so I figure I need to wait until morning at the least. While I have them backed up I’d hate to have to figure out which ones were lost if they get corrupted so I’ll let WHS do its thing for awhile..

Lessons Learned (the important ones):

  1. Test all drives before installing them. The problem drive is one of the new ones. Whether SpinRite would have caught the problem or not is unknown since it took a couple weeks to manifest itself, but as of now any new drive gets SpinRite Level 5 before it goes into the server (or my PCs for that matter). I’d been lazy and impatient in the past. I’d go through the drive removal process the night before the new drive was due to arrive and slap the new one in as soon as it arrived. No more.
  2. Write down the drive serial numbers (and the specific model numbers). Somewhere along the line my drive mappings got messed up. When I thought I was removing the drive from external bay 4 I was actually removing the one from Bay 3. So when I rebooted I got a drive missing error. Luckily that was easily fixed by popping the drive back in. Still, it’s easily avoidable as the Disk Management add-in I use shows drive model and serial numbers (at least for most drives).
  3. Having Windows Home Server offline for drive removals really sucks. Timing wise this wasn’t too bad because I haven’t needed the files on it (although if it was online I’d be streaming video now). But the server has basically been offline since Sunday night.
  4. Hard Drives hate me at the moment. A 2TB drive I ordered for another project, but was going to use as a replacement, arrived DOA on Thursday and it’s replacement didn’t arrive until today. I can’t remember the last time I got a drive that was literally DOA and wouldn’t even spin up when I pulled it out of the box. Hopefully is just one of those things and not a manufacturing issue with a batch of them. The replacement is still in its unopened antistatic bag.
  5. If a hard drive seems to be going bad pull it immediately. Either replace it or run diagnostics on it. I left the drive in because I figured the bad sectors weren’t new. I figured I’d “be smart” and minimize the effort and time opening up the server case. I waited until there were new bad sectors but by then I was already in trouble.
  6. It’s not a new lesson but it reinforces my current beliefs. Windows Home Server file duplication is a good thing. Backups are a good thing. File duplication is not a backup.

For more information about the Home Server Smart Add-in you can see the reviews at HomeServerLand or We Got Served. No sense me repeating their review. I’ll just add my endorsement of the plug in. It’s a simple but well designed plug-in that does it’s thing without getting in the way. The plug-in is free but donations are accepted. I threw a small donation their way to encourage these types of add-ins.

Read the conclusion to my hard drive problems here.