Pre-Mortems to Prevent Post-Mortems: Proactive Engineering

Systems fail. Our systems here at FloQast are subject to failure, as are the ones in your organization, and even those ones over there. With systems failing all the time, and seemingly at the worst times, it’s a wonder that anything works at all in our modern world!

That being said, the reason that anything works at all in our modern world is because people have prepared for these failures.

These preparations sometimes go unused and often go unnoticed, but they’re present and crucial. Distributed systems, contingency plans, backups, redundancy – these are all preparations that we rely on to recover from some system failure.

Astronaut accidentally leaves their keys in the lander

Another item in this toolbox is the pre-mortem, which is a method of planning that forces us to think about the ways in which our projects can fail.

A pre-mortem at the project scope can help us include contingencies in our development sprints, whereas a pre-mortem for a code deployment can help us identify risks surrounding particularly tricky code. By doing so, we reduce the chaos and confusion that can so easily accompany a failure in production – an already stressful situation.

In other words, a pre-mortem helps us avoid a post-mortem.

At FloQast, we recently had a pre-mortem for a project in which we changed a critical, time-sensitive process that runs periodically. We captured these details in a document so that we could easily share them with other team members. (And it came in handy when things started behaving abnormally. But that’s a blog post for another day…)

Here are the considerations that we included in our pre-mortem:
(and an example at the bottom of this page)

Overview of the Project

A brief description of the what and why is always a reliable way to set the stage. How would we go about bringing someone up-to-speed on this topic if this was the first time they were hearing about it?

  • What exactly is the project changing?
  • Why is the change necessary?
  • Is there any historical context that would be useful for us to share?
  • How would we describe the risk level for the changes?

Expected Timeline and People

This is something that, at a glance, lets us determine progress and identifies the go-to folks at each particular step in case any questions arise. If different actions are happening concurrently, we can also call that out here.

  • When do we expect the festivities to begin? (i.e., start time)
  • Who is involved in the project, and what roles/responsibilities do they have? Who will be monitoring systems, and what systems should be monitored?
  • How long will it take for the changes from the project to take effect? Are we operating within a strict window of time?

It’s ok for us to use approximate or relative times here, even if they’re rough guesses. We’re just looking for a quick way to get a feel for how things are moving along.

What is the Happy Path?

Prepare for the worst… but expect the best!

If everything went exactly according to plan, what would it look like?

We can use the Overview as a launching pad for this, giving us an outline of the steps involved. Next, we add details to each of the steps, including technical breakdowns of the code and systems that are involved along the way.

Presenting this information like a script is helpful because it lets us focus on what we’re expecting to happen. Deviations from the script could be indicators that we’re venturing into the Danger Zone.

What Can Go Wrong?

Now for the fun part: time to brainstorm what kinds of problems can occur!

In typical brainstorming fashion, there are no wrong answers – we should call out anything that can mess things up.

A tiger attacks a baseball player sliding into base

It’s important that we get specific with these issues, and avoid overly general phrasing (e.g., “the queueing system stops responding” vs. “it doesn’t work”), because we also want to use this opportunity to think of ways we can mitigate or even prevent these problems.

  • What is a brief description of the issue?
  • What are the symptoms? How will we know if we’re in this state?
  • What are the effects on the end user?
  • Is data affected at all?
  • What actions can we take to avoid this? Can we fix it without a code deployment?
  • If we fix the issue, is there additional cleanup work afterward?

Patterns often start to emerge when we enumerate these issues. We might even be able to group some of them together, if they have similar symptoms or fixes.

This becomes really useful when we encounter a failure that we hadn’t anticipated in the pre-mortem; it might fit into one of the groups we already identified, giving us a leg up on finding a solution!

Rollback Considerations

Unfortunately, we can’t easily fix or workaround all problems. In these situations, we need to consider how to roll back or undo our work. For instance, this could include merging a revert PR for a code deployment, stopping a service, or even deleting data that was created during the event.

Reversed animation of a racecar traffic jam

We want to prepare for this situation up-front because chances are high that we’d be invoking it at a time when everyone has already been hyper-focused on other details. As a result, it’s a time ripe for making simple mistakes. As part of our recent pre-mortem, we had a revert PR prepped and reviewed, ready-to-go before we got started. (We ended up opting to not deploy it at the time, but it was still nice peace of mind. )

  • Under what conditions should we consider rolling back changes?
  • How do we perform a rollback?
  • Who needs to perform the rollback? (e.g., someone with merge permissions?)

Mind the Gap

The main participants of a pre-mortem are the people coordinating the event. This includes Product representatives and other team members that have a finger on the pulse of user experience. We’re less likely to accidentally overlook something if we have a diverse set of ideas, and a variety of viewpoints from different levels of the tech stack.

But, documentation only helps if someone reads it.

Therefore, we should consider our team members’ preferences, and tailor the document to their needs and styles.

We enjoy a pretty casual atmosphere at FloQast, which is reflected in a lot of our documentation – plain language and a relaxed tone, with a strong chance of GIFs.

Characters from Regular Show unsuccessfully pass out flyers for their movie night

However, some teams may be more comfortable with a structured format: using a formal tone or more technically precise language. Similarly, we may want to control the general tenor of certain topics, to keep things on track.

Striking the right balance can be a challenge, but the important thing is to make it easy for everyone to read and contribute to the pre-mortem. That way we can make sure everyone is on the same page when the going gets rough.


 Example

(Note: this is not an actual FloQast pre-mortem; we’re already running on Atlas. )

Overview

With our unprecedented company growth, we’re seeing a huge influx of new customers. That means a lot of new data and traffic hitting our servers. We’ve been having problems keeping up with this increased load on our DB cluster lately, particularly with maintenance and scaling (see this related incident). We’ve decided to move onto the MongoDB Atlas service, to help alleviate some of this pressure and let us better focus on adding new features. We’re using the Atlas Live Migration Service, which should help minimize risk in this transition.

Expected Timeline and People

Time PDT (approx.) People Involved Actions

Monday, 6pm

(after close of business)

Jess (DevOps)
  • Verify the health of Atlas instance
  • Verify that the data in Atlas is in sync with legacy DB
  • Give go/no-go signal
(throughout the process) Jess (DevOps)
  • Monitor health of new and legacy DBs
6:30pm Liz and An (Eng)
  • Merge connection PR for Rube Goldberg Service (RGS)
  • Create revert PR
6:45pm Kris and Luis (QE)
  • Verify RGS is functional
7:00pm Liz and An (Eng)
  • Merge connection PR for app backend
  • Create revert PR
7:15pm Kris and Luis (QE)
  • Verify app is functional
7:15pm Jess (DevOps)
  • Verify no lingering connections to legacy DB
  • Cut access to legacy DB
8:00pm (team)
  • Transition complete

What is the Happy Path?

Our new DB hosted on Atlas has been keeping in sync with our current (legacy) DB via the Atlas Live Migration Service. On Monday at 6pm, Engineering will merge two PRs to point the application to the new database:

  1. Rube Goldberg Service (RGS), which makes sure the planders don’t get frouzzled
  2. Backend, which provides the data needed by the app

After each PR is merged, QE will see that the application is behaving as expected without any interruption in service. Users will have no idea they are actually connected to the new DB.

DevOps will see that the application is no longer connected to the legacy DB, and will isolate it so that we can safely retire it next week.

The process is expected to take a couple of hours but could finish sooner depending on how long the deploys and testing take.

What Could Go Wrong?

Problem Symptoms Possible Actions
Atlas Live Migration Service not syncing with legacy DB
  • We see data differences between the new and legacy DB instances
  • Check firewall and ensure traffic can get through
  • Verify lag time to see if that explains differences
  • Consider canceling the transition if we can’t resolve (and need to investigate more)
Sign-up page is unresponsive
  • Planders are frouzzled
  • Restart the Rube Goldberg Service container
User profiles take a while to load
  • Planders are frouzzled
  • Restart the Rube Goldberg Service container
Rube Goldberg Service keeps frouzzling
  • Container restarts are not helping
  • Run rollback PR for RGS
Entire site is unresponsive
  • Pages other than Sign-up and User Profile hang
  • Connection timeout errors
  • Check permissions on new DB
  • Check firewall and connectivity
  • Consider rollback

Rollback Considerations

Engineering will create revert PRs for the connection PRs upon merging. If we encounter any major problems during the transition, we can revert back to hitting the legacy DB; better that we do this right than quick!

If problems occur after the transition has been completed, and the legacy DB has been isolated, Engineering will need to coordinate with DevOps on getting the DB available again before applying the revert PRs.

Joseph Miller

Joe is a Senior Software Engineer at FloQast, contributing to direct ERP integrations and supporting other teams. In his spare time he enjoys playing video games with his kids, and playing video games without his kids.



Back to Blog