Not All Movement Is Progress

I was having a conversation with a friend recently. We both work in “Big Tech.” Both our employers are sensitive to employees making remotely public statements – and his in particular is pretty notorious for over-reacting. So in an abundance of caution I’ll point out that this rambling represents neither of our companies – just a conversation between two people who’ve been in tech “a fair while.”

We were talking about the most recent operating system released by a Big Tech company that runs on their fruit-themed hardware.

The release has caused a bit of a kerfuffle – it no longer runs a set of applications that were supported by its predecessor. Like other transitions in this company’s past, this one was deliberate and foreshadowed across a couple years. If a customer depends on an application that this OS won’t run, they find an alternative, convince themselves they don’t really need that application, or they don’t upgrade. This last is problematic. Not in the short term – they’ll still get critical updates for their old operating system for a while – but eventually. And if they need to buy new fruit-themed hardware, that new hardware likely won’t run that old operating system. So those customers are one hardware failure away from running out of options.

It’s also generally been a bit of a bumpy release. The initial release had more than its fair share of issues, and even after a couple minor releases there are ongoing sources of customer pain and breakage. I’ve encountered some of these bumps personally, and I reached out to this friend – on a personal basis – to relate my anecdote.

I think it’s safe to say that most tech consumers don’t have personal contacts inside “Big Tech.” They can potentially contact support if they have a problem, but that’s where it ends.

“I don’t expect you to fix this,” I started, “I just want you to hear the unfiltered voice of your customer.” I went on to explain the problem, and how I thought they could have given their customers more options – as they had during past large technology transitions.

I pointed out that, from my perspective, this transition was different from the big transitions in the past. In the past, I argued, customers saw a difference, and that visibility made the changes – even the unwelcome ones – easier to understand. During the transition from their “classic” operating system to their NeXT (sic.) generation operating system in c. 2001, everything looked different. When they changed the CPU their machines were built around in c. 2006, customers bought new computers. Those changes were moments of transition – painful transitions for some customers – that enabled new things.

This time, a customer who “upgrades” their software gets to do less. And it’s pretty hard to explain to a customer how being able to do less enables new things.

That got me thinking about the idea of progress in computing, and software. This is a case where we’re “improving” a computing systems by making it do less than it could do before – and that feels like not progress.

Maybe I’m just old. Part of me can’t give up the Apple ][ that I could open the cover on, and basically understand from the component level up.

That’s not computers anymore.  And as magical as carrying the internet around in my pocket, or on my wrist, is – and it is – I think we’ve lost some valuable things along the way.

Postmortem of a catastrophic RAID failure

Wednesday of last week, I came home to find my three new 1TB hard disks waiting for me, destined to upgrade our ReadyNAS NV+.

Being a hot-plug-online-upgradable-all-singing-all-dancing sort of widget, I followed the recommended upgrade procedure and popped out one of the current 500GB drives, waited a few seconds, slotted one of the new 1TB replacements, waited ’till it started resynchronizing the volume, and went down to make dinner.

And spent the next several days picking up the pieces…

One critical bit of background – the NAS had three disks in a single RAID-5 volume. RAID-5 can tolerate one disk failure without data loss, but if two disks fail (regardless of the number of disks in the volume), kiss your data good bye.

When I went back upstairs after dinner to check on progress I discovered that the NAS had locked up, and completely dropped off the network. Wouldn’t answer it’s web management UI, and wasn’t responding to pings.

Hesitantly, I power-cycled it. It started booting, and hung about a quarter of the way through checking the volume.

After several reboot attempts all locking up at the same place, I applied a bit of coercion and convinced the box to boot. I checked the system logs and found nothing telling, removed and re-seated the new 1TB drive, and watched it start the resync again.

A couple hours later, sync still proceeding, I went to bed.

And woke the next morning to find the unit again fallen off the network.

Buried in the log messages – which I’d left scrolling past over night – was a warning that disk 2 was reporting SMART warnings about having to relocate failing sectors.

In other words, one disk of the three was being rebuilt while another one was busy dying.

At this point it became a race – would the rebuild complete (leaving me with two good disks, and intact data) before the failing one died completely.

In order to try to buy some insurance, I shut down the NAS, transplanted the failing drive into a spare PC, and started a disk-to-disk copy of it’s data onto the working 500GB disk I had removed at the start of this mounting disaster.

Despite valiant attempts by both dd_rescue and myrescue, the disk was dying faster than data could be retrieved, and after a day and a half of effort, I had to face the fact that I wasn’t going to be able to save it.

Fortunately, I had setup off-site backups using CrashPlan, so I had Vince bring my backup drive to work, and retrieved it from him on Friday.

Saturday was spent restoring our photos, music, and email (more later) from the backup.

Unfortunately, despite claiming to have been backing up Dawnises inbox, it was nowhere to be found in the CrashPlan backup set, and the most recent “hand-made” backup I found was almost exactly a year old (from her PC to Mac conversion). Losing a year of email is better than losing everything, but that seems like meager consolation under the circumstances.

By Saturday night I had things mostly back to rights, and had a chance to reflect on what had gone wrong.

The highlights:

1. SMART, as google discovered (and published) is a terrible predictor of failure. The drive that failed (and is being RMAd under warranty, for all the good it’ll do me) had never issued a SMART error before catastrophically failing.

2. In retrospect, I should have rebooted the NAS and done a full volume scan before starting the upgrade. That might have put enough load on the failing drive to make it show itself before I had made the critical and irreversible decision to remove a drive from the array.

3. By failing to provide disk scrubbing (a process whereby the system periodically touches every bit of every hard disk) the ReadyNAS fails to detect failing drives early.

4. While I had done test restores during my evaluation of CrashPlan, I had never actually done a test restore to Dawnise’s Mac. Had I done so, I might have discovered the missing files and been able to avoid losing data.

I have a support ticket opened with the CrashPlan folks, as it seems there’s a bug of some kind here. At the very least, I would have expected a warning from CrashPlan that it was unable to backup all the files in it’s backup set.

5. In my effort to be frugal, I bought a 500GB external drive to use as my remote backup destination – the sweet spot in the capacity/cost curve at the time.

Since I had more than 500GB of data, that meant I had to pick and choose what data I did and didn’t backup. My choices were ok, but not perfect. There’s some data lost which should have been in the backup set, but wasn’t due to space limitations.

6. CrashPlan worked well – but not flawlessly – and without it, I’d have been in a world of hurt. Having an off-site backup means that I didn’t lose my 20GB worth of digital photos, or several hundred GB of ripped music.

Aside from digital purchases, the bulk of the music would have been recoverable from the source CDs, but at great time expense. The photos would have just been lost.

7. In this case, the off-site aspect of CrashPlan wasn’t critical, but it’s easy to imagine a scenario where it would have been.

8. The belief that RAID improves your chances of retaining data is built largely on what I’m going to refer henceforth to as “The RAID fallacy” – that failure modes of the drives in the array are completely independent events. The reality is that many (most?) RAID arrays are populated with near-identical drives. Same manufacturer, same capacity (and model) , and often the same or very similar vintage. So the drives age together under similar work loads, and any inherent defect (like, say, a firmware bug that causes the drives not to POST reliably) is likely to affect multiple drives, which spells disaster for the volume.