I recently met with a large industrial company with about 35,000 users. They have operations in 103 countries, and each country has its own designated administrator. All these users, and administrators, are in a single tenant. I said that their environment sounded challenging and asked how they handled incident response (IR) for outages or security issues. “We have a detailed IR plan,” said my interlocutor, “and it’s printed out and stored in a safe, so we can still get it if we’re having a problem.”
“That’s amazing!” I said. “How often do you test the plan?”
“Well, we haven’t yet,” he admitted.
If that sounds familiar, you’re in good company. And if you’re thinking “Wow, they have a pre-built plan? And it’s written down?! I wish we had that,” then you’re not alone there either. But practical incident response is easier for you to implement than you probably think.
What is Incident Response?
Simply put, “incident response” just means “what you do when there’s some kind of major incident.” This may seem completely obvious, but it’s surprising how complicated people want to make it sometimes. At its foundation, your IR plan is a giant if/then list: if X happens, then do Y. If you understand that, you have everything you need to get started. As a bonus, if you remember the three Mattis questions I wrote about a few months ago, you can use them to construct a more detailed version of the plan: “if a bad thing happens, what do I know about it? Who else needs to know what I know? How and when do I tell them? What do I do next?”
That last question—“What do I do next?”—is where the magic happens. Answering that question is what makes your IR plan into an actual plan and not just a playbook for sending people email to tell them stuff. The interesting part is that most of us already have a pretty good idea of the actions required during incident response. A few obvious examples:
- If something’s broken, fix it.
- If it’s down, bring it up.
- If data’s been lost, try to recover it.
But the actions alone aren’t enough to make a plan. If you just write down a list of everything you might do to respond to a specific type of incident, that’s a great start but it’s only about half of the work required.
Building your First Incident Response Plan
To make this a concrete example, let’s take ransomware. IR after a successful ransomware attack can be enormously complicated because you have to figure out what users and devices were affected, how you’re going to get your data back, and how (and whether!) you can recover devices and services that were affected. That’s obviously not a matter of a simple set of if/then statements… but you can still start from that framework.
You can think of this like a conversation or an interview question: “If you thought you had a ransomware incident, what would you do?” You might answer by saying “Well, if we detected ransomware, first I’d want to tell the security operations team to block network access outbound from the affected devices or networks. Then I’d update our antimalware signatures ASAP. I’d open an incident ticket with our service desk, and I’d probably open one with our backup vendor. Then I’d call the CISO and tell her.”
The specific steps are important, of course, but the order of the steps is important too.
Next, take your steps and convert them into a table. I like a simple format: the first column is the thing to be done, the second column lists who’s supposed to do it, and the third column is blank. When you execute the plan, use that empty column to mark down when the action was taken and who actually did it.
Preparing the Plan for Execution
The next question to ask is whether you can perform all the steps as listed, and, if not, what you need in order to get there. For example, suppose you’re writing the plan for “if my on-prem AD is compromised.” You’ll probably have a step involving logging into a domain controller to do something. How are you going to do that? Do you have an isolated DC that you can put back on the network? Is there another step somewhere that involves a break-glass account?
Now imagine that you gave this plan to the most literal-minded person you know. They will only do exactly what’s written in the plan. How far will they get before they hit a blocker because there’s a missing step, or a missing resource that’s required to complete a step? Fix those gaps and try your mental simulation again. For bonus points, ask one of your junior teammates to do the same. They may have questions you haven’t anticipated.
Testing and Delivery
Not everyone is going to be able to completely test their incident response plan in a realistic environment. My large industrial customer has a completely separate test tenant, with 200 purchased licenses, where they can perform testing, but that’s far beyond what most of us will have to work with.
To get the most from your plan, as you build it, try to identify how you can test each step. Some tests will be simple (“can I find the CISO’s phone number?”). Others will be more complicated, and some may be unfeasible. The more information you put into the plan about how to verify each step, the easier it will be to execute when you need it. How do you know when a step succeeds? For an action like “reboot all domain controllers,” success is easy to measure. For other actions, you’ll need to determine, and record, what success looks like.
Once you’ve defined success for each step, you’ve got a plan! The more testing you can do, of course, the more confidence you can have in the plan, but even if you can’t perform full testing, you’ll be much better off having a partially tested plan than nothing at all.