Postmortem of a catastrophic RAID failure

Wednesday of last week, I came home to find my three new 1TB hard disks waiting for me, destined to upgrade our ReadyNAS NV+.

Being a hot-plug-online-upgradable-all-singing-all-dancing sort of widget, I followed the recommended upgrade procedure and popped out one of the current 500GB drives, waited a few seconds, slotted one of the new 1TB replacements, waited ’till it started resynchronizing the volume, and went down to make dinner.

And spent the next several days picking up the pieces…

One critical bit of background – the NAS had three disks in a single RAID-5 volume. RAID-5 can tolerate one disk failure without data loss, but if two disks fail (regardless of the number of disks in the volume), kiss your data good bye.

When I went back upstairs after dinner to check on progress I discovered that the NAS had locked up, and completely dropped off the network. Wouldn’t answer it’s web management UI, and wasn’t responding to pings.

Hesitantly, I power-cycled it. It started booting, and hung about a quarter of the way through checking the volume.

After several reboot attempts all locking up at the same place, I applied a bit of coercion and convinced the box to boot. I checked the system logs and found nothing telling, removed and re-seated the new 1TB drive, and watched it start the resync again.

A couple hours later, sync still proceeding, I went to bed.

And woke the next morning to find the unit again fallen off the network.

Buried in the log messages – which I’d left scrolling past over night – was a warning that disk 2 was reporting SMART warnings about having to relocate failing sectors.

In other words, one disk of the three was being rebuilt while another one was busy dying.

At this point it became a race – would the rebuild complete (leaving me with two good disks, and intact data) before the failing one died completely.

In order to try to buy some insurance, I shut down the NAS, transplanted the failing drive into a spare PC, and started a disk-to-disk copy of it’s data onto the working 500GB disk I had removed at the start of this mounting disaster.

Despite valiant attempts by both dd_rescue and myrescue, the disk was dying faster than data could be retrieved, and after a day and a half of effort, I had to face the fact that I wasn’t going to be able to save it.

Fortunately, I had setup off-site backups using CrashPlan, so I had Vince bring my backup drive to work, and retrieved it from him on Friday.

Saturday was spent restoring our photos, music, and email (more later) from the backup.

Unfortunately, despite claiming to have been backing up Dawnises inbox, it was nowhere to be found in the CrashPlan backup set, and the most recent “hand-made” backup I found was almost exactly a year old (from her PC to Mac conversion). Losing a year of email is better than losing everything, but that seems like meager consolation under the circumstances.

By Saturday night I had things mostly back to rights, and had a chance to reflect on what had gone wrong.

The highlights:

1. SMART, as google discovered (and published) is a terrible predictor of failure. The drive that failed (and is being RMAd under warranty, for all the good it’ll do me) had never issued a SMART error before catastrophically failing.

2. In retrospect, I should have rebooted the NAS and done a full volume scan before starting the upgrade. That might have put enough load on the failing drive to make it show itself before I had made the critical and irreversible decision to remove a drive from the array.

3. By failing to provide disk scrubbing (a process whereby the system periodically touches every bit of every hard disk) the ReadyNAS fails to detect failing drives early.

4. While I had done test restores during my evaluation of CrashPlan, I had never actually done a test restore to Dawnise’s Mac. Had I done so, I might have discovered the missing files and been able to avoid losing data.

I have a support ticket opened with the CrashPlan folks, as it seems there’s a bug of some kind here. At the very least, I would have expected a warning from CrashPlan that it was unable to backup all the files in it’s backup set.

5. In my effort to be frugal, I bought a 500GB external drive to use as my remote backup destination – the sweet spot in the capacity/cost curve at the time.

Since I had more than 500GB of data, that meant I had to pick and choose what data I did and didn’t backup. My choices were ok, but not perfect. There’s some data lost which should have been in the backup set, but wasn’t due to space limitations.

6. CrashPlan worked well – but not flawlessly – and without it, I’d have been in a world of hurt. Having an off-site backup means that I didn’t lose my 20GB worth of digital photos, or several hundred GB of ripped music.

Aside from digital purchases, the bulk of the music would have been recoverable from the source CDs, but at great time expense. The photos would have just been lost.

7. In this case, the off-site aspect of CrashPlan wasn’t critical, but it’s easy to imagine a scenario where it would have been.

8. The belief that RAID improves your chances of retaining data is built largely on what I’m going to refer henceforth to as “The RAID fallacy” – that failure modes of the drives in the array are completely independent events. The reality is that many (most?) RAID arrays are populated with near-identical drives. Same manufacturer, same capacity (and model) , and often the same or very similar vintage. So the drives age together under similar work loads, and any inherent defect (like, say, a firmware bug that causes the drives not to POST reliably) is likely to affect multiple drives, which spells disaster for the volume.