{"id":72,"date":"2009-02-02T12:07:00","date_gmt":"2009-02-02T12:07:00","guid":{"rendered":"https:\/\/www.oubliette.org\/blog\/?p=72"},"modified":"2019-12-30T12:35:21","modified_gmt":"2019-12-30T12:35:21","slug":"postmortem-of-a-catastrophic-raid-failure","status":"publish","type":"post","link":"https:\/\/www.oubliette.org\/blog\/index.php\/2009\/02\/02\/postmortem-of-a-catastrophic-raid-failure\/","title":{"rendered":"Postmortem of a catastrophic RAID failure"},"content":{"rendered":"\n<p class=\"has-drop-cap\">Wednesday of last week, I came home to find my three new 1TB hard disks waiting for me, destined to upgrade our <a href=\"http:\/\/www.netgear.com\/Products\/Storage\/ReadyNASNVPlus.aspx\">ReadyNAS NV+<\/a>.<\/p>\n\n\n\n<p>Being a hot-plug-online-upgradable-all-singing-all-dancing sort of \nwidget, I followed the recommended upgrade procedure and popped out one \nof the current 500GB drives, waited a few seconds, slotted one of the \nnew 1TB replacements, waited &#8217;till it started resynchronizing the \nvolume, and went down to make dinner.<\/p>\n\n\n\n<p>And spent the next several days picking up the pieces&#8230;<\/p>\n\n\n\n<p>One critical bit of background &#8211; the NAS had three disks in a single <a href=\"http:\/\/en.wikipedia.org\/wiki\/RAID_5#RAID_5\">RAID-5<\/a>\n volume.  RAID-5 can tolerate one disk failure without data loss, but if\n two disks fail (regardless of the number of disks in the volume), kiss \nyour data good bye.<\/p>\n\n\n\n<p>When I went back upstairs after dinner to check on progress I \ndiscovered that the NAS had locked up, and completely dropped off the \nnetwork.  Wouldn&#8217;t answer it&#8217;s web management UI, and wasn&#8217;t responding \nto pings.<\/p>\n\n\n\n<p>Hesitantly, I power-cycled it.  It started booting, and hung about a quarter of the way through checking the volume.<\/p>\n\n\n\n<p>After several reboot attempts all locking up at the same place, I \napplied a bit of coercion and convinced the box to boot.  I checked the \nsystem logs and found nothing telling, removed and re-seated the new 1TB\n drive, and watched it start the resync again.<\/p>\n\n\n\n<p>A couple hours later, sync still proceeding, I went to bed.<\/p>\n\n\n\n<p>And woke the next morning to find the unit again fallen off the network.<\/p>\n\n\n\n<p>Buried in the log messages &#8211; which I&#8217;d left scrolling past over night &#8211; was a warning that disk 2 was reporting <a href=\"http:\/\/en.wikipedia.org\/wiki\/S.M.A.R.T.\">SMART<\/a> warnings about having to relocate failing sectors.<\/p>\n\n\n\n<p>In other words, one disk of the three was being rebuilt while another one was busy dying.<\/p>\n\n\n\n<p>At this point it became a race &#8211; would the rebuild complete (leaving \nme with two good disks, and intact data) before the failing one died \ncompletely.<\/p>\n\n\n\n<p>In order to try to buy some insurance, I shut down the NAS, \ntransplanted the failing drive into a spare PC, and started a \ndisk-to-disk copy of it&#8217;s data onto the working 500GB disk I had removed\n at the start of this mounting disaster.<\/p>\n\n\n\n<p>Despite valiant attempts by both <a href=\"http:\/\/www.garloff.de\/kurt\/linux\/ddrescue\/\">dd_rescue<\/a> and <a href=\"http:\/\/myrescue.sourceforge.net\/\">myrescue<\/a>,\n the disk was dying faster than data could be retrieved, and after a day\n and a half of effort, I had to face the fact that I wasn&#8217;t going to be \nable to save it.<\/p>\n\n\n\n<p>Fortunately, I had setup off-site backups using <a href=\"http:\/\/www.crashplan.com\">CrashPlan<\/a>, so I had Vince bring my backup drive to work, and retrieved it from him on Friday.<\/p>\n\n\n\n<p>Saturday was spent restoring our photos, music, and email (more later) from the backup.<\/p>\n\n\n\n<p>Unfortunately, despite claiming to have been backing up Dawnises \ninbox, it was nowhere to be found in the CrashPlan backup set, and the \nmost recent &#8220;hand-made&#8221; backup I found was almost exactly a year old \n(from her PC to Mac conversion).  Losing a year of email is better than \nlosing everything, but that seems like meager consolation under the \ncircumstances.  <\/p>\n\n\n\n<p>By Saturday night I had things mostly back to rights, and had a chance to reflect on what had gone wrong.<\/p>\n\n\n\n<p>The highlights:<\/p>\n\n\n\n<p>1. SMART, as google discovered (and <a href=\"http:\/\/research.google.com\/archive\/disk_failures.pdf\">published<\/a>)\n is a terrible predictor of failure.  The drive that failed (and is \nbeing RMAd under warranty, for all the good it&#8217;ll do me) had never \nissued a SMART error before catastrophically failing.  <\/p>\n\n\n\n<p>2. In retrospect, I should have rebooted the NAS and done a full \nvolume scan before starting the upgrade.  That might have put enough \nload on the failing drive to make it show itself before I had made the \ncritical and irreversible decision to remove a drive from the array.  <\/p>\n\n\n\n<p>3. By failing to provide disk scrubbing (a process whereby the system\n periodically touches every bit of every hard disk) the ReadyNAS fails \nto detect failing drives early.<\/p>\n\n\n\n<p>4. While I had done test restores during my evaluation of CrashPlan, I\n had never actually done a test restore to Dawnise&#8217;s Mac.  Had I done \nso, I might have discovered the missing files and been able to avoid \nlosing data.<\/p>\n\n\n\n<p>I have a support ticket opened with the CrashPlan folks, as it seems \nthere&#8217;s a bug of some kind here.  At the very least, I would have \nexpected a warning from CrashPlan that it was unable to backup all the \nfiles in it&#8217;s backup set.<\/p>\n\n\n\n<p>5. In my effort to be frugal, I bought a 500GB external drive to use \nas my remote backup destination &#8211; the sweet spot in the capacity\/cost \ncurve at the time.  <\/p>\n\n\n\n<p>Since I had more than 500GB of data, that meant I had to pick and \nchoose what data I did and didn&#8217;t backup.  My choices were ok, but not \nperfect.  There&#8217;s some data lost which should have been in the backup \nset, but wasn&#8217;t due to space limitations.  <\/p>\n\n\n\n<p>6. CrashPlan worked well &#8211; but not flawlessly &#8211; and without it, I&#8217;d \nhave been in a world of hurt.  Having an off-site backup means that I \ndidn&#8217;t lose my 20GB worth of digital photos, or several hundred GB of \nripped music. <\/p>\n\n\n\n<p>Aside from digital purchases, the bulk of the music would have been \nrecoverable from the source CDs, but at great time expense.  The photos \nwould have just been lost.<\/p>\n\n\n\n<p>7. In this case, the off-site aspect of CrashPlan wasn&#8217;t critical, \nbut it&#8217;s easy to imagine a scenario where it would have been. <\/p>\n\n\n\n<p>8. The belief that RAID improves your chances of retaining data is \nbuilt largely on what I&#8217;m going to refer henceforth to as &#8220;The RAID \nfallacy&#8221; &#8211; that failure modes of the drives in the array are completely \nindependent events.  The reality is that many (most?) RAID arrays are \npopulated with near-identical drives.  Same manufacturer, same capacity \n(and model) , and often the same or very similar vintage.  So the drives\n age together under similar work loads, and any inherent defect (like, \nsay, a <a href=\"http:\/\/seagate.custkb.com\/seagate\/crm\/selfservice\/search.jsp?DocId=207931\">firmware bug<\/a> that causes the drives not to POST reliably) is likely to affect multiple drives, which spells disaster for the volume.  <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Wednesday of last week, I came home to find my three new 1TB hard disks waiting for me, destined to upgrade our ReadyNAS NV+. Being a hot-plug-online-upgradable-all-singing-all-dancing sort of widget, I followed the recommended upgrade procedure and popped out one of the current 500GB drives, waited a few seconds, slotted one of the new 1TB &hellip; <a href=\"https:\/\/www.oubliette.org\/blog\/index.php\/2009\/02\/02\/postmortem-of-a-catastrophic-raid-failure\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Postmortem of a catastrophic RAID failure&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[7,4],"tags":[],"class_list":["post-72","post","type-post","status-publish","format-standard","hentry","category-i-hate-computers","category-selected-back-issues"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.oubliette.org\/blog\/index.php\/wp-json\/wp\/v2\/posts\/72","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.oubliette.org\/blog\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.oubliette.org\/blog\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.oubliette.org\/blog\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.oubliette.org\/blog\/index.php\/wp-json\/wp\/v2\/comments?post=72"}],"version-history":[{"count":1,"href":"https:\/\/www.oubliette.org\/blog\/index.php\/wp-json\/wp\/v2\/posts\/72\/revisions"}],"predecessor-version":[{"id":73,"href":"https:\/\/www.oubliette.org\/blog\/index.php\/wp-json\/wp\/v2\/posts\/72\/revisions\/73"}],"wp:attachment":[{"href":"https:\/\/www.oubliette.org\/blog\/index.php\/wp-json\/wp\/v2\/media?parent=72"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.oubliette.org\/blog\/index.php\/wp-json\/wp\/v2\/categories?post=72"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.oubliette.org\/blog\/index.php\/wp-json\/wp\/v2\/tags?post=72"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}