Failure to Authenticate Causes Users to Lose Access to Apps
Microsoft 365 users around the world were underimpressed on Monday, March 15 when Azure AD authentication requests failed to block their access to apps like Teams and Exchange Online. Downdetector.com started to report problems with users being unable to sign into Teams and other Office 365 apps around 19:00 UTC. The number of reports grew rapidly (Figure 1) for incident MO244568.
The Azure status page said “Starting at approximately 19:15 UTC on 15 Mar 2021, a subset of customers may experience issues authenticating into Microsoft services, including Microsoft Teams, Office and/or Dynamics, Xbox Live, and the Azure Portal.” The difference between the time users detected issues and Microsoft declaring an incident isn’t unusual, as Microsoft engineers need time to figure out if reported problems are due to a transient hiccup or something deeper.
Why Some Users Could Continue to Work
Based on my own experience, it seemed as if apps continued to work if they didn’t need to make an authentication request to Azure AD. In my case, I was connected to the Microsoft tenant in Teams and couldn’t switch back to my home tenant because the Teams client wasn’t able to authenticate with that tenant. As anyone who has ever looked at the Teams activity in the Office 365 audit log can testify, Teams desktop clients authenticate hourly when their access token expires. This might be the reason why Microsoft highlighted the effect on Teams in their communications for the Microsoft 365 health status page (Figure 2).
The Microsoft 365 Service health dashboard wasn’t much help (Figure 3) because it needs to authenticate to the Office 365 Services Communications endpoint (here’s an example of how to retrieve the same incident data with PowerShell). Because authentication failed, the Service health dashboard couldn’t retrieve incident details. No doubt some ISVs will make the point that relying on Microsoft APIs during a major outage might be a case of putting all your eggs in the proverbial basket.
If an app didn’t need to authenticate, it continued to work quite happy. I used Outlook desktop and the browser interfaces to SharePoint Online, OneDrive for Business, OWA, and Planner during the outage. On the other hand, while writing this article, Word couldn’t upload the document to SharePoint Online or autosave changes because of an authentication failure. Another interesting problem was that messages sent to a Teams channel email address failed because the connector used to deliver the email to Teams couldn’t authenticate.
Similar Authentication Woes in September 2020
At first glance, this problem seems like that of the September 28/29 Azure AD outage last year. Microsoft said then that “A latent code defect in the Azure AD backend service Safe Deployment Process (SDP) system caused this to deploy directly into our production environment.” In other words, a code change containing a bug made its way into production, which seems very like the reason cited by the Microsoft 365 status twitter account (@msft365status) at 20:10 UTC that the problem was due to “a recent change to an authentication system.”
Unfortunately, as I wrote in February 2019, Azure AD is the Achilles Heel of Office 365. When everything works, it’s great. When Azure AD falls over, everything comes to a crashing halt across Microsoft 365. This is despite Microsoft’s December 2020 announcement of a 99.99% SLA for Azure AD authentication, which comes into effect on April 1, 2021. That is, if you have Azure AD Premium licenses.
Quite reasonably, people asked why Microsoft had deployed code changes on the first day of the working week. That’s a good question. Although people use Office 365 every day, the weekend demand on the service is much lower than Monday-Friday, so it’s fair to look for Microsoft to make changes which might impact users then. Or even better, test their software before allowing code to make it through to production services.
Current Status and Preliminary Root Cause
As of 21:15 UTC, Microsoft said “We’ve identified the underlying cause of the problem and are taking steps to mitigate impact. We’ll provide an updated ETA on resolution as soon as one is available.” At 21:25 UTC the status became a little more optimistic “We are currently rolling out a mitigation worldwide. Customers should begin seeing recovery at this time, and we anticipate full remediation within 60 minutes.” After the mitigation rolled out, the problem appears to be resolved (22:10 UTC). Microsoft says “The update has finished its deployment to all impacted regions. Microsoft 365 services continue the process of recovery and are showing decreasing error rates in telemetry. We’ll continue to monitor service health as availability is restored.”
On March 16, Microsoft noted that some organizations might still see some Intune failures. They anticipate that remediation actions taken by 21:00 UTC will address these lingering problems. They also published a preliminary root cause analysis on the Azure Status History page saying: “an error occurred in the rotation of keys used to support Azure AD’s use of OpenID, and other, Identity standard protocols for cryptographic signing operations. As part of standard security hygiene, an automated system, on a time-based schedule, removes keys that are no longer in use. Over the last few weeks, a particular key was marked as “retain” for longer than normal to support a complex cross-cloud migration. This exposed a bug where the automation incorrectly ignored that “retain” state, leading it to remove that particular key. Metadata about the signing keys is published by Azure AD to a global location in line with Internet Identity standard protocols. Once the public metadata was changed at 19:00 UTC, applications using these protocols with Azure AD began to pick up the new metadata and stopped trusting tokens/assertions signed with the key that was removed. At that point, end users were no longer able to access those applications.”
Problem sorted for now, but it will be interesting to see if more details are revealed in Microsoft’s full Post Incident Report.