User Tag List

Page 1 of 2 12 LastLast
Results 1 to 20 of 36

Thread: Uh oh.

  1. #1

    Default Uh oh.

    As of 2:00pm, our server RAID failed too many drives *at the same time*. The statistical odds, are like lightening striking your car outside.

    Our server contains a *lot* of data, and we're busy reconciliating our various backups now. I realized however, that if these drives do not come back online, I will lose all my rAge 2007 footage, which was so huge it could only be stored on the server.

    NAG and SACM are fine, from the looks of it, and we're still examining the server to see what we can recover. NAG and SACM won't be late, don't worry - but this is going to really put a damper on the office atmosphere.

  2. #2

    Default

    NOOOOOOOOOOOOOOOOOOOO! :'( Sad times indeed... Prob why gallery.tidemedia.co.za/ is offline?

  3. #3

    Default

    That really sucks.... :( I hope that all isn't lost. Is there any specific reason why it happend?

  4. #4

    Default

    Um, no. Our hosting is in Germany, FrancoisWiid. All our websites are on one server: the forums, gallery, sacm, nag, oldskool - so if you can see the forums, you *should* see the gallery.

    But I didn't know the gallery was down, I'll look at it now.

    The server that failed, is the one in our office.

  5. #5

    Default

    Quote Originally Posted by -Bouncer- View Post
    That really sucks.... :( I hope that all isn't lost. Is there any specific reason why it happend?
    A lot of theories, no solid evidence yet.

  6. #6

    Default

    We're sending it off to a data recovery lab now - we can't spin the drives up ourselves, this is going to require proper drive recovery equipment, which we don't have.

  7. #7

    Default

    Eish, not good.
    Also experienced that at my previous job, while we were having backup issues to boot.
    Faulty RAID controller killed 2 drives simultaneously.
    Coupla thousand Rands(quote was somewhere in the region of R8000) worth of expert drive recovery...gave us most but not all of the data back.
    Last edited by CaptainCrunch; 18-10-2007 at 02:40 PM.

  8. #8

    Default

    Hey, didn't NAG run an article a few years back about data recovery? According to you, total irrecoverable data loss is virtually impossible. So I'm cautiously optimistic. Which is really quite out of character for me.

  9. #9

    Default

    What RAID level do you run? 5? 10?

    If you are running RAID5 (which is very popular) then the odds of loosing your data are in the billions because it works so well.
    You need a-lot! of HDD's to fail to loose data.

  10. #10

    Default

    Holding thumbs, I'm sure we all are. Best of luck.

  11. #11

    Default

    I just had the image of a worried looking Dragotaur pacing in front of a Server Operating Theatre.

    Hope it pulls through. We'll send balloons and get well cards

  12. #12

    Default

    We'll know what the data recovery guys say today. I realized this morning, all my Steam games are backed up to there, as well as ALL my MP3s I've been buying over the last two years. ;..;

  13. #13

  14. #14

    Default

    woah bummer man, best of luck.....if you find the person responsible point me at him, I'll use my chuck norris like powers against him

  15. #15

    Default

    then again Disleckia's got ninja powers so you could use him instead

  16. #16

  17. #17

    Default

    We're still waiting. It's going to cost over R30,000 to recover the data (more than it cost to build the server).

  18. #18

    Default

    NAG server = Abyss

    Supermicro H8DCE-O
    Opteron 265
    4GB (2x2GB) ECC DDR-400
    Radeon x600 (you don't use system memory for video on a server kids, that's just retarded overhead to your memory bus)
    Areca ARC-1230 RAID controller
    8x Western Digital RE2 500GB in RAID6
    Thermaltake Eureka
    Aopen AO-700 PSU

    The server falls under my domain because I was the one who built this particular machine, and we don't have a dedicated IT staff so you can imagine what my life is like right now. We already had one drive (#6) down from a previous failure, and I was waiting for a replacement, which came in that day. Found out another drive (#8) had failed in the mean time, killing the redundancy. The third drive (#7) failed while I was standing there trying to shut it down so I could install the replacement and get some redundancy back.

    Two drives (#7 and #8) were just click on power, they wouldn't even spin up properly. Plus I don't know if #8 had just failed or failed a while ago, so no idea what state its data is in. The only hope we have is to recover #7, which is the one that failed last and brought the array down.

    So far evidence points to a faulty set of drives. 4 of the drives (#1-4) came new when we built the server. The other 4 (#5-8) were recycled from our old standalone RAID5 NAS server that used to hold a lot of our data. All 3 drives that failed were from that set, and had been showing random minor problems beforehand. I suspect they ran too hot in the NAS for the time we used them there and when the server room heated up they just crapped out.

  19. #19

    Default

    Thankyou for choosing AMD.
    RAID 6?? *scuttles away to lecturer and asks about it*

    Moral of the story? Information is more valuable then hardware.

  20. #20

    Default

    Quote Originally Posted by Frozenfireside View Post
    Thankyou for choosing AMD.
    RAID 6?? *scuttles away to lecturer and asks about it*

    Moral of the story? Information is more valuable then hardware.

    Mmm yes I never doubted that, a machine is only worth what you use it for. :)

    I'm wondering where the charred remains of our previous dual Xeon server are. I'm thinking of possibly using that so we can have two machines up, if we could get enough parts to fix it.

    As for RAID6, dual parity stripes. Same as a RAID5 except there's two parity calcs per stripe set, which rotate the same way. Why bother? Because a two drive failure can still kill a RAID10 or 0+1 if you're unlucky and take down a complete mirror. But it takes three drives to kill a RAID6. What are the odds you hit the two exact drives to kill the RAID10 vs a third drive? Math time!

    Assuming eight drives, each drive has an equivalent Mean Time Between Failure (MTBF):

    RAID10 you need to kill two drives in a matched pair. Just two kill two drives you need a probability of the MTBF squared (though there's some complexity of what unit of time you use for this function but let's just use hours and we can say it's fairly improbable).

    Then we have to hit a mirror. So what are the odds that the second drive failure is the mirrored partner of the first? 7 remaining functional drives, so 1/7.

    So RAID10's failure probability is 1 / ( MTBF^2 * n-1) where n is the number of drives in the array.

    What about RAID6? You need to kill any 3 drives at once. That's MTBF cubed.

    RAID6 = 1 / (MTBF^3)

    So, for MTBF > n-1, RAID6 is more secure. Most drives have an MTBF of around a million hours. Let's say it takes a pessimistic 10 hours to rebuild the RAID. So that's a 100,000:1 chance of a fail during rebuild. I doubt anyone's running a 100,002 drive RAID10 anywhere.


    Oh and if you want the lightning strike number on what Abyss' failure was, the WD RE2 has a 1.2 million hour MTBF, and I know it only takes our array 3 hours to rebuild. So 1.2M/3 cubed = 64,000,000,000,000,000 : 1.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •