Pre-Mortems to Prevent Post-Mortems: Proactive Engineering
Aug 16, 2021 | By Joseph Miller
Systems fail. Our systems here at FloQast are subject to failure, as are the ones in your organization, and even those ones over there. With systems failing all the time, and seemingly at the worst times, it's a wonder that anything works at all in our modern world!
That being said, the reason that anything works at all in our modern world is because people have prepared for these failures.
These preparations sometimes go unused and often go unnoticed, but they're present and crucial. Distributed systems, contingency plans, backups, redundancy - these are all preparations that we rely on to recover from some system failure.
Another item in this toolbox is the pre-mortem, which is a method of planning that forces us to think about the ways in which our projects can fail.
A pre-mortem at the project scope can help us include contingencies in our development sprints, whereas a pre-mortem for a code deployment can help us identify risks surrounding particularly tricky code. By doing so, we reduce the chaos and confusion that can so easily accompany a failure in production - an already stressful situation.
In other words, a pre-mortem helps us avoid a post-mortem.
At FloQast, we recently had a pre-mortem for a project in which we changed a critical, time-sensitive process that runs periodically. We captured these details in a document so that we could easily share them with other team members. (And it came in handy when things started behaving abnormally. 😅 But that's a blog post for another day...)
Here are the considerations that we included in our pre-mortem:
(and an example at the bottom of this page)
Overview of the Project
A brief description of the what and why is always a reliable way to set the stage. How would we go about bringing someone up-to-speed on this topic if this was the first time they were hearing about it?
- What exactly is the project changing?
- Why is the change necessary?
- Is there any historical context that would be useful for us to share?
- How would we describe the risk level for the changes?
Expected Timeline and People
This is something that, at a glance, lets us determine progress and identifies the go-to folks at each particular step in case any questions arise. If different actions are happening concurrently, we can also call that out here.
- When do we expect the festivities to begin? (i.e., start time)
- Who is involved in the project, and what roles/responsibilities do they have? Who will be monitoring systems, and what systems should be monitored?
- How long will it take for the changes from the project to take effect? Are we operating within a strict window of time?
It's ok for us to use approximate or relative times here, even if they're rough guesses. We're just looking for a quick way to get a feel for how things are moving along.
What is the Happy Path?
Prepare for the worst... but expect the best! 😁
If everything went exactly according to plan, what would it look like?
We can use the Overview as a launching pad for this, giving us an outline of the steps involved. Next, we add details to each of the steps, including technical breakdowns of the code and systems that are involved along the way.
Presenting this information like a script is helpful because it lets us focus on what we're expecting to happen. Deviations from the script could be indicators that we're venturing into the Danger Zone.
What Can Go Wrong?
Now for the fun part: time to brainstorm what kinds of problems can occur!
In typical brainstorming fashion, there are no wrong answers - we should call out anything that can mess things up.
It's important that we get specific with these issues, and avoid overly general phrasing (e.g., "the queueing system stops responding" vs. "it doesn't work"), because we also want to use this opportunity to think of ways we can mitigate or even prevent these problems.
- What is a brief description of the issue?
- What are the symptoms? How will we know if we're in this state?
- What are the effects on the end user?
- Is data affected at all?
- What actions can we take to avoid this? Can we fix it without a code deployment?
- If we fix the issue, is there additional cleanup work afterward?
Patterns often start to emerge when we enumerate these issues. We might even be able to group some of them together, if they have similar symptoms or fixes.
This becomes really useful when we encounter a failure that we hadn't anticipated in the pre-mortem; it might fit into one of the groups we already identified, giving us a leg up on finding a solution!
Unfortunately, we can't easily fix or workaround all problems. In these situations, we need to consider how to roll back or undo our work. For instance, this could include merging a revert PR for a code deployment, stopping a service, or even deleting data that was created during the event.
We want to prepare for this situation up-front because chances are high that we'd be invoking it at a time when everyone has already been hyper-focused on other details. As a result, it's a time ripe for making simple mistakes. As part of our recent pre-mortem, we had a revert PR prepped and reviewed, ready-to-go before we got started. (We ended up opting to not deploy it at the time, but it was still nice peace of mind. 😅)
- Under what conditions should we consider rolling back changes?
- How do we perform a rollback?
- Who needs to perform the rollback? (e.g., someone with
Mind the Gap
The main participants of a pre-mortem are the people coordinating the event. This includes Product representatives and other team members that have a finger on the pulse of user experience. We're less likely to accidentally overlook something if we have a diverse set of ideas, and a variety of viewpoints from different levels of the tech stack.
But, documentation only helps if someone reads it.
Therefore, we should consider our team members' preferences, and tailor the document to their needs and styles.
We enjoy a pretty casual atmosphere at FloQast, which is reflected in a lot of our documentation - plain language and a relaxed tone, with a strong chance of GIFs.
However, some teams may be more comfortable with a structured format: using a formal tone or more technically precise language. Similarly, we may want to control the general tenor of certain topics, to keep things on track.
Striking the right balance can be a challenge, but the important thing is to make it easy for everyone to read and contribute to the pre-mortem. That way we can make sure everyone is on the same page when the going gets rough.