Monday, August 3, 2009

What happened?

If you have been a regular visitor to the morgueFile you would have noticed the site was down for the last fifteen days. But today we are back and this is a little explanation of what happened.

On July 15th, one of our drives in the storage array failed. That is all good and fine, drives fail all the time and a drive array consists of many disk so they can easily be replaced, as long as no more than 2 drives fail. During the process to replace the drive... kaboom. [ Technically speaking, it didn’t immediately fail, it starting rebuilding but firmware in the older Seagate ES series drives manufactured in early to late 2008 had firmware issues. The rebuild failed part way in, the techs than told the array to rebuild again, than the OS froze when too many NFS requests were in que pulling memory from the OS, than the system needed a hard reboot. Which it than came back and asked to run efsck on a large array which takes time. ] Four days of file system checks, we had all our data back, just without the file names our folders, making them unusable.

This is really our worst nightmare come true. There are around 200,000 files in 4 sizes and around 450GB. Luckily we were clever enough to keep a backup. The host has a backup system, but we really operate on a tight budget so we needed something cheap. But more importantly we made the decision not to keep all data in one location. The thinkng was, god forbid there was a fire or flood, backup and production files shouldn't be in the same physical location. So we had our resident linux contractor back it up to amazon.

Now we have the big problem that the files had to be downloaded from amazon. Originally the files were to be stored as raw files but because of issues setting up the files to sync, they were tarred and encrypted, then saved weekly. It wasn't such an issue at the time, but that is the real charm of hindsight.

We had started the download day one, and it was taking a very, very long time. We estimated 500 days give or take a week. Host and contractor had been working night and day, they opened up all the connections, made adjustments, and we were on our way to recovery! One week later all the backups are downloaded. Then word comes in an email, the backup script to amazon may have, possibly, however unlikely, corrupted the data and we are only at 50% recovered. Apparently the backup script created multiple backups of the same directories. We had 800GB of files downloaded, but with no way of knowing if any of the directories were complete and valid. We were looking at a very real possibility of coming back online with only half the files.

This was the really bad part. After some berated calls, subsequent apologies, long walks through dark alleys sobbing quietly into the night- well, it's a little bit of an exaggeration but the prospect of losing half the data was a bit disconcerting to say the least.

The community had entrusted us with their images and going back and sharing the news of having lost them felt like letting everybody down in the worst possible way. If half the files are gone what can you do at this point?

Recover what you can and hope for the best, prepare for the fall-out..

So, for the next week the contractor decoded and re-synced directories and was basically working around the clock to salvage what was possible. Mike from openphoto.net wrote us a perl script to match files and directories by pattern. The week truly felt like an eternity.

The end result was that in fact even though there were multiple copies of backups there was enough of them to make a complete restore. Huzzah! We ran a simple php script to compare the recorded file sizes and there they were, 98% files recovered, or only 4,500 files out of 220,000 were lost. This would be the part we break out into a bollywood dance number on top of a train as it rolls through the country. Happy day!

Lots of really great people wrote in with very encouraging emails. We would like to thank everyone who did write in and tweet. It really kept us going! Now, we learned quite a bit from the experience as we move forward:

1) Lesson one, backup is important, you may know it is important, but it shouldn't be an after thought. Backup should include a recovery plan as well. If it takes 3 years to recover the backup, it isn't a very good backup.
2) The second lesson comes from an old proverb, never underestimate the bandwidth of a station wagon full of backup tapes
3) Lastly, we experienced the true power of the internet - it's community - even competitors wrote and offered their help. Something we all can be proud of.

16 comments:

  1. Good!!!
    It's nice to hear that you're back online and rolling.
    I don´t see my thumbnails though, were they some of the missing files?

    ReplyDelete
  2. So glad to have you back! Thanks for sharing the story with us, as well as the lessons learned! I'm definitely going to be backing things up more often. :)

    ReplyDelete
  3. My fingers have been crossed for 2 weeks. So happy to see you back in action!!!

    ReplyDelete
  4. Very glad to have the `File back,It really is a truism to say you don`t know what you`ve got till it`s gone.

    ReplyDelete
  5. Been following the drama unfold at Twitter. I'm very happy you're back online!

    ReplyDelete
  6. I, too, kept checking Twitter hoping for good news. I rely on your images more than you could imagine for work and personal use. So glad you are up and running again!

    ReplyDelete
  7. Haven't held breath like that since apollo moon landing!

    ReplyDelete
  8. I had just been clued in to morguefile a day before the disaster and was really excited to use the images for a project. The next day I log in to find an implosion of sorts on what I hope will be a great source for photos. Cruel, cruel fate. Needless to say I am so glad you all are back up and running! Thanks to everyone for the hard work and long hours!!

    ReplyDelete
  9. Bravo!
    You guys made my day! I've been following your tweets. tank goodness you're up again.
    jeff

    ReplyDelete
  10. Glad to see you back up, but none of your thumbnails will load! Is this being worked on?

    ReplyDelete
  11. thanks for the explanation!! i was missing my favorite site, man!! you guys are awesome!!

    ReplyDelete
  12. Great news, you back in the saddle again! Keep up the good work!

    ReplyDelete
  13. [...] a bit of down time, they are finally back up running!  Check out this blog post explaining what it is that happened.  In the post, they outlined several lessons they learned from [...]

    ReplyDelete
  14. Hey.. but really.. what's going on NOW?
    seems like thousands of photos that used to be here aren't
    I love the crop and link back option.. but now a bunch of those photos are gone. What's the deal?
    I really want to keep loving you.

    ReplyDelete
  15. @kerch We still have the same amount of photos- we only lost 500 of 200,000. Maybe its an issue with the search. Not sure why it seems like that many are missing.

    ReplyDelete
  16. Hello, Stailiatync http://www.airjordanforsale.com/ - order azithromycin Zithromax can also be used successfully in the treatment of venereal diseases such as nongonococcal urethritis, Chlamydia, gonorrhea, and cervicitis. http://www.airjordanforsale.com/ - generic azithromycin

    ReplyDelete