If it seems like this column has been more focused lately on processes than on tools or configuration settings, you’re not wrong. That’s intentional. There are two reasons why I prefer to focus on process, education, and other non-technical topics when it makes sense. The first is that not everyone is always going to have the latest and greatest tools. In the real world, things such as budgets or change control requirements mean that not everyone can just click a few buttons in the Microsoft 365 Admin Center and deploy whatever the latest and greatest Microsoft-branded security tool is. The other, and arguably more important reason, is that what legendary software engineer Grady Booch said 20+ years ago is still true: “A fool with a tool is still a fool.” That is, having better tooling doesn’t help protect you if you don’t know what the tool can do or how to use it. (If you doubt me, look no further than recent breaches caused by elementary mistakes such as failing to enforce MFA for remote desktop access!)
Getting in the Repetitions
There are a lot of activities where expertise developed over time is critical. If you are going to have heart surgery, you probably don’t want to be a brand-new surgeon’s first patient. Incident response, I would argue, is one of those things. Until your organization has been through a few incidents, it is really hard to develop the institutional skill, knowledge, and muscle memory required to make sure that you can consistently and effectively respond to future events. Of course, the problem with that strategy is that no one wants to voluntarily have incidents that require recovery. Instead, what we typically see is that people try to practice incident recovery with simulated incidents, through tabletop exercises, hiring consultants, and so on. This is a lot better than doing nothing, but still doesn’t necessarily give you the number of repetitions required to get good at such a perishable skill set.
There is another tactic that has been proven to work well, though. Learning from the successes and failures of other organizations’ attempts to recover from incidents is, dollar for dollar, the best investment you can make in incident recovery. Security culture has advanced a great deal in the last 10 or so years; it’s now commonplace for organizations to share what they learned as part of responding to security incidents and outages. That learning can be applied very profitably in your own organization.
Recognizing Differences
There is one important caveat to keep in mind when you are considering building your incident response plans and training around incidents that other people have responded to. Back in the day, it was very common to see enterprise customers insist on copying the architecture that Microsoft used for their own internal Exchange deployments, even when that architecture didn’t match their business needs or was beyond their organizational skills to build and manage. To be successful, you can’t just look at the way an organization ten times your size deals with incidents and copy it directly.
Microsoft has a number of internal teams that work with their largest or most sensitive customers to help them get through security incidents. it is both interesting and educational to look at Microsoft’s recent blog post about the lessons they have learned in helping other organizations do incident response.
You can separate the lessons they shared into three categories. People, Processes, and Technique. (This mirrors the old “people, processes, and technology” structure that they first began using around the time of the Trustworthy Computing memo.) Let me summarize the most significant ones.
People Issues in Incident Response
The first lesson I want to share is one I just made up because it isn’t in Microsoft’s article. The people at your organization who are supposed to respond to incidents have to know enough about the tools they’re using, your internal systems, and security in general to be effective. You cannot expect someone without that to be an effective part of incident response. That isn’t fair to them and it isn’t going to get you useful results. Microsoft probably presupposes that all of the mega-corporations and government agencies they work with have a skilled and experienced staff already. But that may not be the case in your organization.
The second people lesson is that you need to have an incident manager. This is a really well-understood principle from incident response in the physical world. Every major fire, terrorist incident, chemical spill, airplane crash, or other disaster will have a single individual who is the single point of contact for decision-making about the incident and is accountable for it.
The incident manager has to understand enough about your business and its processes and systems to help guide the response. That does not mean that she has to be knowledgeable in the tiniest details of how every app you use is configured and managed. This role is more about breadth than about depth. If you don’t know who the incident manager in your organization is, finding out (or designating one) is an excellent first step that costs you nothing to implement.
Process Issues in Incident Response
Microsoft’s first process lesson is one you’ve probably heard before; your planning for incident response must begin with your disaster recovery planning. Those two areas are inextricably linked. The quality and depth of your disaster recovery plan is going to dictate how well you can actually recover from an incident. Your incident response plan, which has to include directions for how to carry out the disaster recovery process, is going to dictate the success of your response. The good news is that a robust disaster recovery plan and process is useful in many other ways besides just when responding to an actual incident. It can help you in cases where you have data loss. It is a useful way to test your organizational communication processes, and having a good recovery plan is often a prerequisite for getting cyber insurance. As with many other types of planning, it is better to have a simple and limited disaster recovery plan that you understand and can execute on demand, than to have a giant, complicated process that hasn’t been thoroughly tested and might not work when you need it most.
Microsoft’s recommendations also highlight the importance of patch and update management, but maybe in a different way than you have seen before. They encourage organizations to take extra care in securing and auditing their software update distribution mechanisms because of the risk that an attacker will compromise them, plus their important role in recovering by distributing updates after an incident. This is an area that definitely deserves special attention and it is good to see Microsoft highlighting it.
In the least surprising part of the blog, Microsoft also recommends that you use a tool such as Microsoft Sentinel to manage your audit logs. I will reserve any further comment on that recommendation, because of course you should be carefully maintaining, inspecting, and auditing your audit logs, period. At this point in the evolution of Microsoft 365 security, it’s hard to imagine anyone still needs to be told that that is important.
Does Technique Matter?
The rest of Microsoft’s recommendations revolve around specific techniques, like hardening identities, or adding extra protection for critical services. Most of the Practical Protection columns I’ve written so far cover specific techniques, and I didn’t really see anything noteworthy about Microsoft’s recommendations in their blog.
For most organizations, the payoff will be bigger if they spend time to think about how they will manage the process of finding the right people with the right skills and attitudes to successfully lead the response to an incident, and then help those people to succeed with adequate planning and preparation.