8 February 2022
From checklist to service: scaling Stripe’s incident response
What: Eng blog post
Written for: Retool blog
Links: Blog post, Twitter thread
My role: Full stack work, from concept to completion. Came up with the topic, researched the piece, engaged the guest and conducted the interview. Wrote the piece. Designed and coded the companion open source Incident Central app.
The importance of names, the surprising benefits of restricting your product, and other lessons from an engineer who built Stripe’s “Big Red Button”
When Andreas Fuchs joined Stripe as an engineer in 2012, the company was about 25 people located in a former wedding chapel hidden behind a wine bar in Palo Alto, working on one main product, the payments API. Over the next nine years, he saw the company grow to over 3000 people across the globe and launch numerous new products that would make Stripe a leading online payments provider, handling payments for companies like Amazon, Google, Shopify and Lyft.
Among the many projects Andreas took on was building Stripe’s internal incident response tools. He helped launch the first version of the main tool (internally nicknamed the “Big Red Button”) and get it successfully adopted. He later joined as one of the first engineers on the Reliability Tooling team and adapted BRB to serve the entire incident response lifecycle.
With this experience at Stripe, Andreas has a uniquely broad view on evolving incident response as a company grows from baby startup to a big player.
We had the pleasure of sitting down with Andreas to mine his wisdom: from brass-tacks advice on how and when to build your own tools and how to roll them out, to reflections on the human factors important to a successful incident response program.
Want a tool like the Big Red Button for your org? Check out the open source Incident Central app, built in Retool.
Genesis of Stripe’s incident response tool
Stripe’s incident response tool, a service nicknamed the “Big Red Button” (or “BRB” for short), was first built around 2017 when the company was a few hundred people. There was already a “tool” in place at the time: a simple checklist that described the steps to starting and running an incident. This checklist had been spun up by a single employee years earlier and thereafter collaboratively edited. Out of the gate, any new tool would need to compete with this checklist. But there was opportunity too: the incident process required a number of tedious manual steps that a service could automate.
Even starting up an incident was harder than it should be. One point of friction was simply choosing which Slack channel to talk in. At the time, Stripe coordinated incident response in one Slack channel named #warroom. In the off chance there was a second simultaneous incident, there was #warroom2. And then, #warroom3 came to be. As a form of courtesy, many teams began to coordinate “smaller” incidents in unofficial side channels—but this meant less visibility into the overall number of incidents.
Andreas and another engineer, Kiran Bhattaram, built BRB to tackle this problem first. The first version of BRB let anyone at Stripe report an incident through a modest user interface: a form specifically designed to be dead simple, so that reporting an incident was fast and easy. BRB would then create a single dedicated Slack channel for discussing the new incident, and invite first responders to that channel.
Notably, this was a fairly simple technical service, but it changed the way people were used to doing things. So, the team wanted to be thoughtful about how they rolled out BRB to the rest of the org.
Earning adoption for the “Big Red Button”
According to Andreas, there were two aspects of the initial version of BRB that made it successful out of the gate: first, the way the service set up an incident with a unique and user-friendly name; and second, the way the team earned adoption for the tool within Stripe.
First, BRB was designed to create a dedicated Slack channel for each incident using a very particular naming convention: it generated a unique but human-memorable name. BRB generated a name by randomly picking and combining two commonly used English words. For example, it might pick the words “friendly” and “llama”, to form the name “friendly-llama”. Compared to something like a UUID, this kind of a name is a lot easier to remember and to use in spoken conversation. The Slack channel that BRB created for this incident would look like “#incident-friendly-llama”, and this name became the canonical way people would refer to an incident: in speech, in writing, everywhere.
Before BRB was rolled out, there had been debate internally about whether separate Slack channels for incidents was a good idea; but this naming convention, combined with the idea of separate channels, came to be universally loved by Stripes. Andreas points to the naming convention as one of the best design decisions in BRB. “Names have great power,” he said. “You definitely want to name incidents in a way people remember.”
Second, the team thoughtfully considered how to roll out BRB to the org. Andreas noted there were some early detractors, but the team focused on making them happy via product changes and turned them into supporters. For example, some people objected that having separate Slack channels would make it harder to keep track of all incidents within Slack. In response, the team set up an optional “incident watchers” group that these people could join to get automatically added to every incident channel. This change had an additional unplanned benefit: the “watchers” would notice and combine the separate channels created whenever two people unknowingly kicked off an incident response for the same problem.
“Names have great power. You definitely want to name incidents in a way people remember.”
One other important aspect of the rollout was that the team allowed people in the org to choose to use the tool, rather than forcing them to adopt it. There were two prongs to this approach.
The first prong was the fundamental choice to automate a part of the incident process that only required one user: kicking off an incident. (In contrast, automating other parts of the incident process would have required multiple users.) This meant only one person needed to learn about and use the new tool, which made onboarding easier. Other responders who came to help later would essentially continue to “do things the old way” via the checklist. (And Andreas noted that in general, it’s helpful to make a new tool compatible with the existing process where possible.)
Second, and more importantly, the team initially artificially restricted BRB so that users could not create a Severity 1 (most severe) incident with it. Users could only create a Severity 2 or Severity 3 incident, and had to use the original checklist to kick off Severity 1 incidents. The rationale was that the tool, being new, was not yet seen by its builders to be sufficiently vetted. They kept the tool restricted like this for a few months until users wanted to pull it out of their hands—until people were repeatedly asking to create Severity 1 incidents in BRB. “That’s when we knew it was going to be a big success,” said Andreas.
Expanding BRB to the full incident lifecycle
Eventually, BRB’s functionality expanded beyond kicking off incidents to include helping during incident response through post-incident follow-up.
One critical development that enabled BRB to expand its feature set was the founding of a dedicated team—the Reliability Tooling team—to own and continue to adapt the service (and to overall enforce the resilience of Stripe’s products). Up until this point, BRB was a semi-grounds-up effort that had found informal shelter under the wings of the Observability team.
Andreas joined the Reliability Tooling team as one of its first engineers. He thinks the team played an important role for Stripe and certainly for BRB. A dedicated owner was critical for BRB’s future, because as Stripe grew, the incident process inevitably evolved. And when a process evolves, tools must evolve with it; otherwise tools become irrelevant—or worse, a drag. Andreas highlighted that a lot of the work on the team was, as expected, organizational rather than purely technical.
Stepping back, Andreas reflected on the order in which BRB addressed problems in the incident lifecycle, and he recommends the order that it took:
First, make it easy to report an incident and get the right people together and talking to each other. Concrete examples: Creating a simple UI to report an incident, which then triggers the appropriate PagerDuty on-call rotation and spins up a dedicated Slack channel for the incident.
Second, add human-overridable tooling/automation to reduce the most tedious and error-prone parts. Concrete examples: Making it easy to flip relevant feature flags, convey highlights about the incident inside the company, and update public-facing status pages.
Finally, maximize what the org learns from each incident. Concrete examples: Making it easy to collate data from an incident, and sending automatic nudges to people to complete post-incident remediation steps.
Important organizational support for incident response
In the final part of our chat, Andreas turned to an equally, if not more important part of incident response at Stripe: the organizational support structures. As we saw earlier, one enabler of BRB’s sustained success was the founding of the Reliability Tooling team, a dedicated team to work on the service. Andreas mentioned two other notable organizational structures: the Incident Review meeting, and full-time Incident Response Managers.
Incident Review is a regular meeting where executives and engineers meet to go over recent incidents and the follow-up action items. Andreas recalled these meetings often involved uncomfortable questions—which were a compelling feature, not bug. “It’s hard to underestimate how powerful it is for the leader of your entire org branch to be there and to ask you why certain work hasn’t been prioritized,” he said. It was equally powerful to be able to give an honest answer and use that as an open forum to discuss a solution. This meeting was what drove the most learning and positive outcomes from incidents. “You can have as much software as you want,” said Andreas, “but if people don’t draw the right conclusions and steer the org in the right direction, it’s not going to be helpful.”
More recently, Stripe has also hired a globally distributed team of full-time Incident Response Managers to serve the role of “incident commander”—the primary point of contact and overall coordinator during an incident. This role used to be filled by a rotation of volunteers, who would come from different teams and roles across Stripe. Incident Response Managers also attend all Incident Review meetings. To Andreas, the formation of the Incident Response Management team was a recognition that coordinating incidents was a hard thing to do well and a job in its own right. He recalls this team provided rich product feedback to the Reliability Tooling team because they were a relatively small, dedicated group who saw every single incident, and thus intimately knew where the process could be better.
“You can have as much software as you want. But if people don’t draw the right conclusions and steer the org in the right direction, it’s not going to be helpful.”
Through the way the initial BRB service was rolled out, the open discussion during Incident Review meetings, and the formation of specific teams to support both incident tooling and operations, it’s clear that Stripe kept people in the center of its incident response program.
And if Andreas had one takeaway to sum up his experience, it might be just that: incident response is an extremely human process. “The more you acknowledge that, and the more you shape the process such that it lets humans do their thing, the better.”
Technical design decisions when building your own service
If you want a service like BRB in your organization, how might you go about it?
The first question Andreas recommends thinking about is: do you want to build it? What parts can you outsource? In general, you should try as hard as you can to buy what you can, and only build the parts you need or want to customize. Back in 2017, there were few third-party solutions available, which played into Stripe’s decision to build a tool from the ground up. However, even BRB offloaded critical pieces of functionality to other off-the-shelf tools, like PagerDuty, Slack, and Jira.
Today, there are more choices on the market. When evaluating a third-party tool for incident response, one question Andreas asks is, “What’s the integration story?” What tools do you already use internally, and does this new third-party tool integrate with them? For example, if you use a specific task tracker, does this tool let you create a ticket in that tracker?
If you go down the path of building a tool, Andreas offered a few specific technical tips that came out of his experience with BRB:
Consider hosting the incident data in your own database, as opposed to relying on a third-party tool like Jira to be your “database.” Initially, BRB had no database, and did offload its storage to Jira; but this setup became unwieldy once the team wanted to expand BRB to stages of response beyond kicking off an incident. Having control of your data model gives you flexibility to evolve the product experience as you like. In the end, the core database for BRB stored information like the ID of the PagerDuty incident, the ID of the related Jira ticket, etc, and BRB would display a combined view of data queried from all of those different tools.
For the back end, pick a language that’s good at async operations. BRB was written in Go, and people were pretty happy with that choice and the fact that it was decoupled from the main pieces of Ruby infrastructure at Stripe. Andreas predicts that if they’d built in the default Ruby environment and the dependency cycle it came with, they “would have been in a ton of pain, and the tool would have been unavailable much more during the times that it was needed most.”
Testing a tool like this is tricky. Consider scheduled e2e tests of the system. Incident tools are (hopefully) infrequently used, so things like traditional end-to-end monitoring is not as effective due to lack of traffic. At the same time, when incident tools are needed, it’s critical that they work. One tactic to increase the reliability of your incident tools is to schedule regular test runs of the system—in which the whole org knows it’s a test, the on-calls expect to be paged, etc.
Finally, Andreas harkened back to the first incident “tool” at Stripe, the humble checklist. Even after Stripe built more complex tools, Andreas said they kept a checklist like this as a backup. This is because, in the very worst case, an incident can disrupt the incident tools themselves, so you want to have a fallback that won’t experience downtime. Documentation—ideally hosted in multiple places—is a practical form of this fallback.
Want a tool like the Big Red Button for your org? Check out the open source Incident Central app, built in Retool.
For more reading on incident response, Andreas recommends the Downtime Project podcast.