Microsoft’s cloud infrastructure suffered an outage for Azure DNS servers on April 1, 2021. I didn’t write about the incident at the time because it didn’t last very long (Microsoft’s root cause analysis reports a 39-minute outage “between 21:21 UTC and 22:00 UTC”) and I wanted to see what their diagnosis was.
The DNS Outage Unfolds
Microsoft disclosed that “an anomalous surge in DNS queries from across the globe targeting a set of domains hosted on Azure” hit the Azure DNS servers. These servers are not used to resolve public DNS queries like Google’s public DNS servers. Instead, they resolve lookups to find the right server in the millions of servers running inside Microsoft’s infrastructure. For instance, Exchange Online has over 175,000 mailbox servers. The Azure DNS servers find the right server when a user wants to access their mailbox.
Code is available to mitigate query surges. In this case, a bug (aka “code defect”) failed to mitigate the demand and caused the DNS service to become overloaded. Services under strain are less responsive and more likely to fail than services running under normal load, and that led to users experiencing problems connecting to a range of services from Bing to Xbox. Once Microsoft engineers noticed the degradation, they took corrective action to address the problem and restored service. And now they’re fixing the code defect.
Azure’s Rough Period
Azure has had a tough time recently. March 15 saw an authentication outage starting at 19:00 UTC. While most services were online by 22:10 UTC, the incident continued until 09:37 UTC the following day. Microsoft is in the middle of deploying a new Safe Deployment Process (SDP) backend system to stop problems like key removal, which caused this issue. It’s part of the solution to a range of issues such as the previous major outage on September 28/29 2020.
Collectively, several incidents over a relatively short period took away much of the gloss Microsoft’s cloud infrastructure accrued over the initial stages of the pandemic. Services blinked and some functionality was reduced (like reducing the definition of Teams meeting recordings from 1080p to 720p), but the cloud delivered when people needed it to work from home. To its credit, Microsoft transferred resources around and brought new resources online to handle a massive increase in demand and things were good. Until the incidents erupted to erode customer confidence and cause people to ask (once again) if they can depend on the cloud.
Problems Happen All the Time
Azure is obviously a critical part of the Microsoft cloud infrastructure. Many major Microsoft 365 apps depend on microservices running inside Azure today that a failure in Azure invariably means more widespread problems. But stepping back and looking at the full picture, I don’t see that recent issues changed the overall situation. Consider these facts:
- Not every organization using Microsoft cloud services is affected by individual outages. Sure, if you are, it’s painful for the duration of the outage. But like any other utility, the impact of a failure depends on what parts of the network you use. Like a pumping or power station going offline, cloud customers not serviced by an individual component continue working happily.
- You can be guaranteed that an outage is ongoing somewhere in the Microsoft 365 ecosystem at any time you care to look. However, the design of Microsoft 365 and the way customers receive service from self-contained datacenter regions mean that outages are usually localized. The exception, as we’ve seen, is when common services like DNS or authentication are affected. These have proven to be single point of failures that cross regional boundaries. Obviously, Microsoft has some work to do to bulletproof these aspects of its service.
- An outage doesn’t stop everyone working. Many apps can function without being affected by cloud failures. Outlook has cached complete mailboxes since Outlook 2003, Teams recently got basic capability to work with messages offline, and desktop apps like Word and Excel can save files locally if SharePoint Online or OneDrive for Business are inaccessible. Mobile apps usually have a local cache to allow some amount of offline activity.
- Browser apps are most sensitive to cloud outages, but even so, the authentication outage in March only affected them if apps needed to reauthenticate.
A Temporary Embarrassment
After spending hundreds of billions of dollars on software engineering, datacenters, and networks, any outage is embarrassing to Microsoft. Although losing Azure AD DNS service for a while might create some discussion about how to keep productive when the cloud sits down, a 39-minute outage won’t stop the transfer of work from on-premises to the cloud. Fortunately, we have never experienced a multi-day, multi-region complete cloud shutdown. I hope we never do.