These famous words coined by Winston Churchill, still remain true even today within the context of the software industry.
The reality is that software crises are here to stay, a fact we all need to come to terms with. They are unavoidable and also inevitable, but rather than running away from them, we should be embracing them head-on. Crises do come with a price that’s true, however, the price we pay is really dependent on how we handle the situation and most importantly what we take out from it after the dust settles.
So, we had this new feature running. We fully tested it (at least we thought so at that time) and it went live. A few days later our customer started using it, but unfortunately for us, the scale hit a threshold, something we didn’t expect, and an auto-scaling solution didn’t fix the problem. That’s when crisis mode went into full gear.
Following this stress-inducing event, I created a 6 step guide to help me to get through the “next time” (because you know there will always be a next time), whether you are an individual contributor or team leader I hope you’ll find this as useful as I have.
Step 1: Analyzing
First thing — Keep calm. Stressing out won’t help here. The first thing I do is try to analyze the cause root of error, aim your focus as well as the teams. At this point, you are most likely going to have alerts and logs running all over the place. Once you get a grip on the right problematic component, you’re off to a good start.
So, previously we had this use case where one service had to fetch data from another. unfortunately. The second service had to deal with data fetching latency which hit the timeout ratio. That timeout issue propagated back to the caller which added even more latency. Once we spotted the root cause error, we focused and had the chance to give it a good fight.
Step 2: First aid (The so-called: band-aid solution)
When you’re in crisis-mode, there is no room to point fingers at who’s to blame or how you even got into the position you’re in in the first place. All your resources should be united to get “us” out of there, and for that, we must regain focus. An example use case we had was a retry mechanism which got itself into an infinite loop causing duplicated requests due to an internal serialization failure. The solution:
- Isolate the issue: We as a team understood that first, we had to disable the service to cut the loop. Once we had done that, the system went back to normal.
- Give first aid: Based on the issue type you can take fast action with tools you already have in your arsenal: Scale, Restart, Disabling or any other band-aid solution you can add to stop the bleeding.
Once we were successful with the first-aid solution we took a deep breath, made coffee, and moved on to the next step.
Step 3: Hot Fix
We’ve all been there: “voodoo” or just outright jinxing it by saying the words, “it happened but, it’s unlikely to happen again”. In my opinion, there’s no such thing as voodoo or jinxes when it comes to this. I always suggest the team should thoroughly investigate and try to release a hotfix to make sure the issue won’t occur again. I do recommend a drawback as well.
If you can’t understand what happened at least add additional logs. So, in the case that the same issue arises again you’ll be better prepared and smart enough to nail it. The issue is that it can come back to bite you over and over, it’s just a matter of time. So, be well and ready for it. A hotfix must be there whether it’s an issue-resolve or logs decoration.
Step 4: Retrospective
Once you have a hotfix in place and are all “back to normal” you and the “crises” stakeholders (infra team, dev team, etc..) should gather up and start processing how you all got yourselves into the “crisis” state. Go through the flow. Draw the chain and sequence of things that lead you there. As I see it, this is the most important step as here we enter into “learning” mode. In that session, don’t try to find who to blame or who’s at fault. Try together as a team, to understand how to learn from the crises. I suggest asking the following leads:
- Which dark parts in the system were we not familiar with? (e.g was it a regression or a new area that was flaky)?
- Where in the process did it fail (miscommunication between teams, not enough testing, architecture design issue, etc..)?
- Which component must be refactored or improved?
- Were logs missing?
- Perhaps, better monitoring alerts need to be added?
- How quickly were you aware of these crises?
- How fast did you manage to analyze and understand the issue?
- What was the impact of the crises on other components and was it isolated enough?
- We might have similar cases that could cause the same crises somewhere else in the system?
I do recommend raising all these questions, as we learn the most through on-the-spot thinking, growing as an individual and as a team.
Step 5: Conclusions and Follow-Ups
Once we have learned from the crises, we must take further actions. Follow-Ups must be put in place and status checks need to be revised else the same crises will recur again and again and that’s bad news for everybody concerned. We are fine with new crises but having the same one recurring must be avoided by all means necessary as to do so will mean we didn’t learn anything and the learning curve and crises went to waste. So, all fixes and conclusions must be in place (whether it’s added to a backlog or added to the next sprint(s)) to make sure your system is put back into a safer state and raise the team’s confidence.
Step 6: Team Morale
At the end of each crisis after you’ve followed all these steps, as a team take responsibility. And as much as we hate it, the feeling we might sense is of failure. Things like: “How come we didn’t test this use-case”, “The code I’ve just pushed was prone to error”, “That part was missing due to miscommunication”, etc. I find it truly important not to look for who to blame, but instead, help build confidence again through honest actions (see above: Step 4,5).
Crises are here to stay, it’s just how we learn to handle them that truly counts.