Alex Bartlow · 2024-05-24 · platform, devops

The most important system to keep running

I heard a ring, and my heart rate spiked to 132. It was Tom Bailey, a colleague on the Product Success team.

I prepared myself for a cheerful British voice to deliver bad news — the only reason for Tom to call me would be to get engineering's attention on a major system outage. My suspicions were confirmed, and I felt a surge of adrenaline as if a driver in front of me had stood on their brakes. My body was getting me ready to fight, but I needed to think. I started to breathe, knowing that the most important system to keep running right now was my own limbic system.

Managing your own emotional state is a job that has only one qualified candidate, and the physiological systems entrusted to you do not have backups. The most stressful moments as engineers are times when the software has gone all wrong, and it is most critical in those scenarios that we keep our own wetware running. That is the primary topic of this post. But none of us live in a vacuum. We all function as part of a larger team, so we will also present pointers on this for leaders and managers to consider. With all of this in mind, let's dive in.

Three goals for managing incidents

In true Aha! style, we want to be goal-first. Our three goals for incident management are to coordinate and control a response, to regain normalcy, and to do so sustainably.

Goal 1: Coordinate and control the response

The first goal is to coordinate and control the response to an extraordinary, dangerous, and stressful situation.

Of course, we must assess a situation's severity to understand how to manage it and who should be involved. Situations that are not extraordinary are routine, and they might be dangerous and stressful — but they hopefully occur often enough that operators are trained to handle them safely. These nondangerous situations are low-stakes; failure would simply mean trying again later or absorbing the cost of that failure easily. Scenarios that are not particularly stressful simply do not exert enough force per engineer to warrant a special procedure, even if they are extraordinary and high-risk. (This might happen if your operations team is especially skilled.) On the other hand, critical situations often have severe, legal, and even existential risks for a business that cannot resolve their conditions.

Extraordinary, dangerous, and stressful situations (hereafter, I will refer to these as "incidents") are often unmanaged during a company or team's early stages. There are simply not enough people dedicated to operations. Likewise, there are not enough lessons to draw from to build an effective body of institutional knowledge that will become a management framework.

You will know when it is time to adopt a management framework, and that is after your first unmanaged incident. But if your company is past the early startup stages and is in the process of scaling up, you would do well to adopt a framework early on regardless. Chapter 14 of Google's book on site reliability engineering outlines a pretty good framework. If you haven't read it yet, stop reading this article until you have.

Goal 2: Regain normalcy

The second goal is to regain normalcy by trading big problems for smaller ones. This is a topic for another post in the series, but I will touch on it here.

Your mindset when resolving incidents will be radically different than when you are doing other types of engineering. You have to think rationally and methodically about what the root cause of the incident could be, prove that theory, and then create a permanent solution. But you must also adopt mitigating strategies that restore service partially, keep the problem from getting worse, or even communicate effectively about the issue.

In addition to thinking like a detective attempting to identify the "culprit," you also have to think like an EMT whose primary concern is getting the patient to the hospital with a pulse. Tourniquets serve a purpose — even if they endanger the limb on which they are placed, their use is sometimes necessary to save the patient's life. Similarly, tools at our disposal such as rate limiting, temporary captchas, or even choosing to bring the site down trade a big problem (like security vulnerability or flooded servers) in for smaller problems (like bad UX or poor performance).

Goal 3: Do so sustainably

The third goal is to do so without burning out your operations team.

Institutional knowledge is expensive. Hiring is difficult, and retaining good engineers is doubly so. Even if it wasn't a moral imperative that managers should look out for the good of their employees, it simply makes good business sense to protect your team's morale and mental health. Most people talk about burnout in the context of large workloads. Although this can be true, I believe the far more dangerous cause of burnout is a lack of control. If a person's sense of agency is removed when they walk into work, they will choose to preserve it by making the one choice available to them: walking out. A constant stream of incidents constrains the freedom of operations team members — they cannot work on solving new or interesting problems because they are constantly fighting for survival. Similarly, a lack of psychological safety in an organization's culture might mean that they have to weather constant blame and criticism.

What engineers can do to operate effectively

Breathe

The Latin word "spiritus" is at the root of both "spirit" and "respiration." In other words, breath is life. Deep breathing activates your parasympathetic nervous system, which takes you out of "fight, flight, or freeze" mode. Operating out of a lower-brain panic and preparing for a fight is a wonderful mode to engage when threatened by a lion. It is marginally less useful when you have to inspect transmission control protocol traffic or help a failing database recover. For these sophisticated tasks, you require your full brain — and the best way to marshal all of your neurons is to take a few seconds to breathe.

No incident is going to spiral out of control due to six seconds of inaction, but many can be made worse by a decision made in haste. You owe it to yourself (and to your company) to compose yourself and think carefully before deciding on an intervention. As often as necessary, close your eyes, take a long two-second inhale, hold it for two seconds, and exhale for a long two seconds.

Be

Dialectical behavior therapy (DBT) is a type of talk therapy for people who experience strong emotions. And as mentioned earlier, incidents bring intense feelings. A core DBT skill is mindfulness: being fully rooted in the present moment without worrying about the past or the future.

This might sound too tangential for a blog post about managing an engineering incident. But if we are honest, many of our thoughts in the midst of a crisis are worries about the past: "Did I do something to cause this?" "Is there something I could have done differently?" "Did I miss an alert?" Likewise, our thoughts might also be worries about the future: "Am I going to get in trouble for this?" "Is this going to kill the company?" "Am I going to make it home for my kid's game?"

Applying the skill of mindfulness helps us be present in the here and now — so we can solve the problem at hand and root ourselves in what we can actually control. To practice it, try making three statements about things that you see, hear, feel, or smell. Then, make one clear statement about the problem at hand. For example:

My coffee is cold.
I am hungry.
There is a cardinal outside of my window.
The database is receiving too many connections. This short exercise is based on the 5-4-3-2-1 method for calming anxiety. It works by rooting you in the problem at hand and pushing away irrelevant and unhelpful worries about what led to this present situation or the potential aftermath (which is a smaller problem for tomorrow).

Believe

How you view the world will greatly affect your actions within it. If you believe (and many do, whether or not they might admit to it) that good things happen to good people, bad things happen to bad people, and that the goal of life is to be happy, then an incident is not a technical problem. It becomes an existential crisis — because this is a bad thing happening to us (which means we are bad), and it is blocking our life goal.

Author Timothy Keller puts it like this in Walking With God Through Pain and Suffering:

"If you accept [this sort of worldview] … then that which gives your life purpose would have to be some material good or this-world condition — some kind of comfort, safety, and pleasure. But suffering inevitably blocks achievement of these kinds of life goods. Suffering either destroys them or puts them in deep jeopardy. As Dr. Paul Brand argues in the last chapter of his book The Gift of Pain, it is because the meaning of life in the United States is the pursuit of pleasure and personal freedom that suffering is so traumatic for Americans."

If your worldview has no room for chance accidents, for malevolent actors attempting to extort hardworking people, or for human mistakes and frailty, then you have no business being an engineer. Our profession exists to carve out an ordered space where humans can flourish in an otherwise chaotic world. Roads, bridges, ships, taxes, and even software form the framework of human flourishing. And that framework suffers from entropy, like everything else in our universe.

I am not suggesting that every engineer has to join a seminary, practice meditation, or read up on humanist literature (though our field would be better for it!). What I am suggesting is that as a first responder to technical problems, we have to understand that these problems are a normal, expected part of life. Hard drives fail, code is written in error, and bad actors will use SQL injection and ransomware just as earlier generations of miscreants used clubs and rocks. Standing against the forces of entropy, chaos, and yes — even evil — is what we are paid to do. Incidents, then, are not primarily interruptions of our status quo. They are the highest and noblest moments of our profession.

What managers must do for their teams

There are things you must do as a leader beyond the obvious. Watching for burnout, encouraging employees to take PTO when required, and ensuring the engineering team has the resources (time, money, or hardware) necessary to solve problems are the minimum responsibilities expected from a manager. To create a truly hardy organization that can engineer a highly resilient system, good leadership is required.

Create compassionate accountability

Much has been said about creating cultures of blameless retrospectives. However, our nature is that when we know a mistake has been made, we quietly look at the person who made the mistake. Or perhaps even worse, we pointedly avoid eye contact. It is the leader's job to provide the needed accountability — and we as leaders are the ones accountable.

Gene Kranz, flight director for the Apollo 11 mission, said: "[My] job as flight director is to take the actions necessary for crew safety and mission success … [In my] line of work there is neither ambiguity or a higher authority. It is go, or no go. And I am accountable for the mission."

And Jocko Willink recounted a particularly relevant conversation between him and a junior officer in the Navy SEALs in The Dichotomy of Leadership: "We are responsible ... It was our strategy. We came up with it. We knew the risks. You planned the missions. I approved them. We were the leaders. And we are responsible for everything that happened during that deployment. Everything. That's the way it is. We can't escape that. That is what being a leader is."

During the after-action analysis, you should be the first to admit where decisions you made in the past contributed to an issue and what you personally could have done differently. You also should not expect your team members to similarly out themselves. If necessary, provide compassionate 1:1 coaching to team members who could have done a better job. But while you're together as a team, your job is to build the team.

Build a team

After they've shipped their first bug, I tell every new hire on my team some variation of the following: "You wrote that code. But someone else reviewed it, and I ultimately deployed it. And you followed all of our style guides and existing documentation when you did it. We are a team, and we are going to handle this as a team."

Leaders should already be well-read on building good, high-performing teams, but a reread of Patrick Lencioni's The Five Dysfunctions of a Team is never a bad idea. Lencioni highlights the importance of trust, productive debate, and commitment — and those attributes, when built into a team, pay off during an incident.

On a good team, we can trust our engineers' judgment. They are not afraid to put up a hand and stop someone else from making a mistake, and they are committed to the success of the mission. Those are attributes that have to be built into the team months before the metaphorical pager goes off. They grow at every daily standup and sprint planning meeting.

If you've done it right, you will know when your team is able to handle even a relatively severe incident without your help. If you've done it wrong, a minor incident will reveal the cracks in your organization. Take the opportunity to learn and shore up your foundation when these small fractures form; they will only worsen with time.

Give control to your engineers

Just because leaders are ultimately accountable for the team's success does not mean that they need to micromanage everything that happens in the organization. Engineers with a sense of ownership over the system will see the weak points before they become fractures. And those with a sense of empowerment will either fix the problem themselves or advocate for a feature to be put on the board and prioritized to solve the problem early.

Your job as a leader is to foster this creativity, ownership, and empowerment as much as possible — all while reining in ideas that are likely to result in failure down the line.

This approach might result in mistakes being made. But if your plan as a manager is to have a "mistakeless" organization, you are guaranteed to fail. We want to build a resilient organization instead, and one of the ways to do that is to empower our team to execute on clear objectives.

Putting it all together

I believe teams that function with high degrees of collaboration, delegated control, and compassionate accountability can solve most problems. When those attributes are also present in a team made up of mindful, spirited, and hardworking engineers, they can solve nearly any problem put before them.

The technical problems we must overcome are far less daunting than creating an organization capable of solving them without imploding. In later articles in this series, we'll look into creating effective communication channels as well as how to adopt an appropriate "triage" mindset when navigating problems. Regardless, creating this level of psychological safety in your team is sure to pay off — even if the "pager" never goes off.

And before we conclude, I'd be remiss not to come back to the incident referenced in the beginning. It was detected in fewer than five minutes (thanks to a great communications structure), was tied to a root cause in another three minutes (thanks to an engineer who built a very impressive operations dashboard), and was mitigated in another five minutes (thanks to a different engineer who integrated and specified the necessary procedures into our runbook). It was a stressful day, but I couldn't have been more proud of the team that made solving a scary problem fairly trivial.

I'm grateful that because of the hard work of our team, we don't have too many days like that. If you'd like to be part of a high-performance, high-empathy team, we'd love it if you joined us.