It’s happened to the best of us; no matter what scalability plans we put into place, no matter how many horizontally scaled virtual hosts we spin up, no matter how we shard our carefully crafted data, no matter how many greens we show in our automated tests, there’s always one element that might just catch us out.
It’s easy to overlook a misplaced comma in a config file, or heaven forbid coding a quick patch on a live running service. The human error is bound to creep in one day, and so you should be prepared.
And I don’t mean be prepared just for the human error. The advice I’m going to give applies just as equally to hardware or software failures. You see, it’s all about how you communicate with your customers when disaster strikes.
Whilst writing this, I’m reminded of a recent incident involving a DNS outage at MyDomain.com, an example of how not to handle a disaster. In fact, had they handled the incident better, we would probably still be hosting our DNS with them. I’ll be taking this as an example, and compare it with how Singing Horse Studio currently handle outages, some of which we’ve inherited from our time at Linden Lab, the makers of Second Life.
The main thing is don’t panic! No matter how bad it seems to be, taking a breath and stepping back from the problem is the best first step. You won’t be making it any worse by taking your time, whereas I’ve witnessed more small errors turn into bigger problems by stupid mistakes made under stress.
Assign One Person As Point
If you’re not a one-man-band then you’ll have the luxury of appointing a single person as point. It’s this person’s job to co-ordinate events, taking the lead in organising who deals with what, and facilitating communication with others.
Here at Singing Horse Studio, the person on point would be responsible for documenting events for the post-mortem (more on that later), and communicating with customers directly, but for organisations that have their own dedicated customer services department, point will simply communicate events to the customer service representatives for them to inform the customers.
The advantage of assigning someone as point means that you can allow the engineers to fix the problem without giving them any additional pressure and stress of “get it fixed now!”; they can focus on what they need to do whilst one person can handle the co-ordintation. This gives them the freedom, instead, to regularly inform management, customer services, affected departments etc on the progress of the fix. Which leads to …
Keep Your Customers Informed. Regularly.
If there’s anything more frustrating for a customer or manager, it’s the lack of information. Put yourself in the customer’s shoes:
Damn right. And being kept in the dark about progress (or lack thereof) is going to make your customer worry even more than necessary. Even communication saying “we’re still looking into it” is better than silence.
With the MyDomain DNS outage, we first became aware because of our own internal Nagios checks. A disappearing host isn’t a good sign, and it was quickly diagnosed as a DNS failure. I first checked to see if MyDomain had a maintenance window or a service status page, but unfortunately it just pointed to their knowledge base. Not a good sign.
So what’s the best way of keeping up with news as it happens? Twitter of course! And checking their Twitter stream showed that they did indeed have problems.
Now, at least they did something right by monitoring their Twitter messages and responding to them, but there was never a series of planned communication to the customers. Instead, the only communication they exposed was a reply to any question about their service being down with “I’m sorry, yes we’re working on it”. No explanation as to what the problem might be, no estimation time for a fix, no setting of expectations of future updates.
When customers are given content such as “At 7:57 UTC the main East Coast server suffered an, as yet, undiagnosed problem, causing software connectivity issues. Engineers are currently diagnosing the problem. An update will be provided within 30 minutes” they know what to expect; there’s a serious problem, it’s being looked at, and I’ll find out more in half-an-hour. Cool.
Tell your customers what’s going on, even if it’s your own fault. Saying “we messed up, but we’re fixing it” is going to give you respect, and you’ll be surprised at how easily customers will forgive you if you’re open and honest, and taking responsibility.
Learn From The Experience With A Five-Whys Analysis
As part of our Agile process, we’re always looking at ways to improve what we do, whether that’s looking at our coding practice, our project management style, or even how we make sure faults don’t re-occur. I guess that’s why we love test driven development because writing a failing test before fixing a bug should mean you’ll never have to worry about that bug showing its face in a later regression test.
One excellent method for gradually improving a process or methodology is something that we’ve learned from Eric Ries and IMVU (oddly enough, a competitor of Linden Lab), called Five-Whys which originated in Toyota’s production system.
It involves asking the question why five times to drill down to the root cause of a problem. The method then suggests you invest a small amount of time to address each of the five answers, effectively improving the process piece by piece without having to invest huge amounts of time, money or effort. It’s a very effective method, and I strongly urge you to read all the details.
Publish A Post-Mortem
And finally the most important part when you’ve suffered a disaster is to let people know. The guy on point would have recorded all the details in a timeline of events; you have your five-whys analysis and the actions you’re going to take to make sure this kind of disaster never happens again; so publish it.
If it’s an internal facing problem, then there’s obviously no need to publish outside of your own organisation, but it makes sense to distribute the post-mortem to at least those who had been effected by the disaster. If it was a public facing problem where your customers (or a percentage of) suffered, then make it a public post-mortem. Release it to the customer services department to distribute, put it on the upcoming maintenance/service status page (assuming you have one that works!), or publish it as a blog post.
Show people the results of the five-whys, even if it displays some of that human error — that’s ok, we’re all human, remember. Show people the steps you’re going to take to make sure that this never happens again — your customers will appreciate it.
But one thing, don’t put it on your Facebook wall and then expect non-logged-in users to be able to view it. Yes, MyDomain, I’m looking at your “you need to log into Facebook to see this page” post-mortem. Ho hum.
So, there you have it. What to do when disaster strikes at your scalable service. Now let’s hope you never have to put this plan into action, eh …?