Something horrible just happened to Jeff Atwood aka CodingHorror.
“ugh, server failure at CrystalTech. And apparently their normal backup process silently fails at backing up VM images.”
“I had backups, mind you, but they were on the virtual machine itself :(“
It’s a times like these we start wishing for a time machine, a cosmic undo button or reversible computing.
Jeff’s blog was read by tenth of thousands of programmers and system administrators for many years. It contains information that is very valuable for these people, and represents an unthinkable amount of hours spent by Jeff. An agency rate for somebody like Jeff is between $250 and $500 an hour, but this is like appraising a priceless family heirloom.
I am not going to go through the motions of telling everybody how to backup things, about how important offisite backups are, how disk drives are fragile, how I don’t trust virtual servers, how raid is not a backup strategy, and how version control is not backup strategy, etc, etc. JWZ wrote a good article about backups.
Here are things I want to say. First, we are all not backed up sufficiently and likely have already lost data that we would want back.
I can’t find my grandmother’s recipe book (I still hope it’s only lost), my wife’s first email to me, my first web page through which she found me, my first job search web page that had a picture of the Twin Towers and said how I wanted to work there, my early school grading papers, a rare book about fishing in the Black Sea, a stamp from the Orange Republic that used to be in my father’s stamp album, the password to my very short-numbered ICQ account. A lot of stuff.
All of our digital information is susceptible to an electromagnetic pulse, fire, flood. Spinning platter hard drives are particularly bad – they have very short lifespans measured in low single digit years. CDs are even worse – aluminum inside them rots (I have a cd with a lot of outlook emails that reads as a blank filled with 1s).
So the first thing that I would like to mention is that if you never simulate a failure, you’ll never know if your stuff can be replaced. It’s not an easy thing to practice, though – restores and failovers are tricky to do.
A few jobs ago we were getting a fancy new load balancer set up. It was up and running, and supposedly we had failover: if one of the servers died, we would not even need to do anything, the backup servers would pick up the slack. I suggested that we should test it by pulling the network plug on one of the machines off hours. My boss would not allow that, saying that we could possibly break things. My argument that it’d be better if something like that happened when we were ready it would not be as bad if it happened when the actual failure would occur. When the actual failure did occur the load balancer did not switch, and we had an outage that was a good deal longer (it happened at night).
Load balancers are not backup solutions, but this story highlights an irrational streak in system administration: nobody wants to practice failure: it’s just too nerve-wracking, and a lot of hard work. It’s much easier to assume that somebody up the line did everything correctly: set up and tested backups, startup scripts, firewalls and load balancers. Setting up and validating backups and testing security are thankless jobs.
This brings me to a another point. The act of taking a backup is not risk free in itself. The biggest data losses that I suffered happened to me in the process of setting up backups. As an example I’ll bring up the legendary story about Steve Wozniak (whom I met yesterday):
The Woz was creating a floppy driver under an extreme time pressure, not sleeping much and feeling sick. The end result was a piece of software of unimaginable beauty: it bypassed a good deal of clunky hardware, and thanks to a special timing algorithm, was fast and quiet. When other disk drives sounded like a machine gun (I dealt with a few of those when I was young), Woz’s purred like a kitten. Finally he wrote the final copy onto a floppy, and decided to make a backup of it. Being dead tired, he confused the source and destination drives, and copied an empty floppy onto the one with the precious driver. Afterward he proceeded to burnish his place at the top of engineering Olympus by rewriting the thing from memory in an evening.
It’s really the easiest thing in the world to confuse the source and destination of a backup, destroying the original in the act of backup! The moral of the story?
Do as much backing up as possible, while being careful not to destroy your precious data in the process. Have an offsite backup. Print out your blog on paper if it’s any good. In fact, print out as much stuff as you can. Your backup strategy should be like a squirrel’s: bury stuff in as many places as possible (well, except sensitive information, which is a whole other story in itself).