Active Directory Forest Recovery is Difficult to do Correctly
Active Directory (AD) forest recovery is difficult to do correctly. Add in all the dependencies required for AD to function, and recovering AD can be a real challenge. This is not something you want to do for the first time after an actual forest failure. According to a 2022 poll conducted during the Hacked and Afraid – Dramatic Tales from AD Disaster Recovery Scenarios TEC session, 45% of organizations that responded had never tested their AD disaster recovery plans. Hopefully, by now, they have begun testing and continue to test at least annually. A well-documented and heavily rehearsed recovery plan is key to getting your business back on its feet as soon as possible.
Microsoft’s Active Directory Forest Recovery Guide provides a great foundation for the AD forest recovery process. It covers many facets of forest recovery and outlines the necessary high-level steps but doesn’t address every possible configuration. Every organization is different, so how could one document possibly account for all environments? It cannot. It is only meant to be the starting point for you to build a customized recovery plan for your organization.
Taking regular backups of AD domain controllers (DCs) using AD-aware backup methods is essential. This is the only recommended way for recovering AD. Other methods of recovery, like restoring virtual hard disk image files (VHDX), are not supported and could cause issues.
In this article, we’ll discuss AD forest recovery recommendations, proper backup methods, protection for the AD backups, and operational dependencies that you’ll want to make sure are a part of your AD recovery plan.
Using Virtual Machines to Recover Domain Controllers
Other methods exist for restoring DCs outside of AD-aware backups. Beginning with Windows Server 2012, Microsoft introduced new features that increase the capabilities for virtualizing DCs. Hosting DCs on virtual machines (VMs) facilitates VM cloning and the creation of hypervisor snapshots. Even though recovering the first DC in each domain of a multi-domain forest using a hypervisor snapshot is supported, it is not the recommended method. The recommended restoration method for a DC is to use an AD-aware backup solution. Recovery of DCs using hypervisor snapshots comes with caveats and, according to Microsoft, is not considered to be a substitute for performing AD-aware backups that capture system state data.
Why is reverting to a hypervisor snapshot not the recommended method for restoring DCs? Reverting to a hypervisor snapshot returns the VM back to the state the system was in at the time the snapshot was taken. This may sound desirable. However, if malware was present when the snapshot was taken, the restored server will also contain that malware.
Also included with the virtualization improvements in Windows Server 2012 is the ability for virtualized DCs to make use of VM-Generation ID (VMGenID). On supported hypervisors, when a snapshot is applied, the DC compares the VMGenID exposed by the hypervisor with the value stored in AD. If these values do not match, the DC recognizes this and carries out protection mechanisms to prevent issues within AD. The invocation ID is reset, the new VMGenID value is stored in AD, and the local RID pool for the DC is invalidated.
If your environment meets all the requirements for VMGenID to work properly, you’re unlikely to experience a condition called “USN (Update Sequence Number) rollback” but if improper restore methods are used, it could occur. An improper restore method is any method that doesn’t change the VMGenID and thus results in the AD database invocation ID not being reset.
USN rollback results in issues with replicating changes made to objects on the improperly restored DC. If you run DCs on old hypervisors that don’t meet the requirements for VMGenID or other unsupported restore activities are allowed to take place, like restoring virtual hard disk files of DCs, USN rollback can happen.
If you miss a step in the AD forest recovery process or perform a step incorrectly, it could require restarting the recovery process all over again, further delaying the resumption of normal operations. For example, forgetting to isolate the recovered DCs from the unhealthy ones or performing a restore of a DC that reintroduces ransomware.
Essential Backup Points for AD Recovery
Following Microsoft’s documented forest recovery steps in the Active Directory Forest Recovery Guide is vital, however, the documentation can’t possibly address every scenario. Assumptions are made that might not be applicable to your environment and could require that you perform additional recovery work. For instance, you might use a third-party DNS solution and then discover that the AD Forest Recovery Guide only addresses AD-Integrated DNS.
Restoring AD should be performed using a backup and restore method that is AD-aware. The supported and recommended method for backup and restore of AD is using a backup utility that makes use of the Volume Shadow Copy Service (VSS). This will help reduce the chances that the restore process will introduce new issues into the AD environment.
It’s important to store backups in a location that is protected from ransomware, tampering, theft, and deletion. Backups that are compromised by an attacker can result in the inability to recover from an AD forest failure or allow an attacker to gain privileged knowledge of the production environment. If the backups are stolen, an adversary could use them to perform an offline attack on AD and obtain password hash information for sensitive accounts. If the backups are encrypted using ransomware, recovery may not be possible until the ransom is paid. If backups are deleted, recovery becomes impossible.
The restoration method should include the ability to restore to a clean OS. This helps prevent the reintroduction of malware to the restored environment. The backups only include files necessary to restore AD and not the OS or the complete file system of the backed-up DC. You wouldn’t want to introduce the original cause of the AD failure to the restored environment.
Ensure your AD forest recovery plan includes the ability to restore one DC in each domain in the forest to an isolated environment. This recommendation is for a few reasons. If the nature of the failure was due to an attacker gaining access to the environment, it gives you the opportunity to clean up group membership of privileged groups and reset passwords for privileged accounts. Hardening and validation that the original threat no longer exists should take place in this environment. Perform metadata cleanup of DCs that were not restored and validate that replication is working.
Don’t Forget the Operational Dependencies
Be sure to also plan for the operational dependencies. Do you have contact information for all decision-makers and the team members necessary to carry out a full forest recovery? Are you able to access this information if AD is completely down?
Where are your DCs deployed today? Are they on-premises or cloud-based VMs? Is access to the hypervisors possible if AD is unavailable? For cloud-based VMs, are you able to access the cloud provider’s portal in order to provision new VMs? Are there any DCs on physical servers? Do you have the capability to reimage them remotely or does it require that someone be onsite?
Does your recovery plan meet your organization’s RTO (Recovery Time Objective) and RPO (Recovery Point Objective)?
- RTO –– How much time it takes to restore your systems.
- RPO –– How much acceptable data loss there is. The maximum age of the backups used which are suitable for recovery.
Meeting both objectives is crucial to ensure AD is back up and running within your organization’s recovery requirements and with fresh enough data.
Prepare for AD Forest Recovery or Prepare to Fail
AD Forest Recovery is tough. There’s a non-trivial number of pitfalls that can occur during a recovery that you won’t want to encounter when recovering from an actual disaster. It’s essential to have a solid recovery plan that is updated on a regular basis, after every major environmental change, and tested at least once every year. Keeping this plan up-to-date and validated will help prevent surprises if the time ever comes to perform an actual AD forest recovery.
How much does it cost your organization when AD is down? I know of a company where there are several “million-dollar-an-hour apps.” AD becoming unavailable would prevent these applications from functioning and the outages for each would cost the company millions of dollars every hour. Consider also how many other critical functions could be impacted during an AD outage. How many LOB apps are running on servers that are domain-joined?
Improper restore methods could result in USN rollback, reintroduction of malware, or other issues that are not recognized until weeks or months after the recovery. Ensure that you are using an AD-aware backup method so that additional complications are not introduced into the recovered environment.
Deciding to perform a full forest recovery is not a decision to be taken lightly as it can be very disruptive to an organization. If the conclusion to recover from the forest failure is to restore the entire forest, ensure that you gain approval from your company’s appropriate decision-makers.
We hope that you never have to perform a full AD forest recovery in production, but it is something that you’ll want to be well prepared for. Continuous rehearsal and constant updating of your forest recovery plan are essential to ensure the recovery goes as smoothly as possible.
Come see us in Atlanta at TEC 2023 to hear more about what should and shouldn’t be in your AD recovery plan!