Y2K-type Failure Causes Exchange Server to Stop Processing Email
The New Year came in with a snarl for Exchange Server administrators when Exchange 2016 and 2019 servers stopped processing inbound email because the transport service couldn’t check messages for malware. Microsoft fixed the problem by releasing a script to clean up the files used by the malware engine and then downloading and applying the latest malware signature file. Administrators must apply the fix to every Exchange 2016 and Exchange 2019 server in an organization. Make sure that you use an account with permissions to update Exchange when you run the script.
Reminiscent of the Y2K problem, the root cause of the issue is a date check problem. Exchange Server downloads daily updates to ensure that the malware scanning engine can detect recent malware. After downloading, Exchange performs a version check using the date. After the clock clicked over to 2022, the date check failed and logged event 5300.
Log Name: Application Source: FIPFS Logged: 1/1/2022 1:03:42 AM Event ID: 5300 Level: Error Computer: server1.contoso.com Description: The FIP-FS "Microsoft" Scan Engine failed to load. PID: 23092, Error Code: 0x80004005. Error Description: Can't convert "2201010001" to long.
From the description, it looks like the failure occurred when a routine attempted to convert the 2201010001 (January 1, 2022, 00:01am?) to a long value (as this Reddit thread points out, the value is too high to store in a long value).
Microsoft emphasizes that this is not a security problem or a flaw with either malware scanning or the malware engine. However, as it stops mail flow, it’s a serious operational issue.
It seems like the fix works and Exchange servers begin clearing inbound mail queues after its applications. Some people report that they have had to restart servers, but this shouldn’t be necessary. The problem doesn’t appear to affect earlier versions of Exchange Server.
If you run a hybrid deployment and route messages through your on-premises infrastructure, the problem affected inbound email to Exchange Online mailboxes.
Now that the situation has calmed, the inevitable question is why Microsoft allowed such a problem to happen. Even viewed in the most benign light, this is an inexplicable catastrophic testing failure. It might be further evidence of the increasing lack of attention Microsoft pays to the on-premises version of Exchange in its desire to move seats to Exchange Online. Although some on-premises deployments do themselves no favors through lack of maintenance, especially in the timely application of security and other updates, this problem had nothing to do with on-premises administration: it is entirely due to Microsoft incompetence.
No disruption appears to have happened to Exchange Online. This could be because the code bases used by the on-premises and cloud versions separated some years ago; it could also be due to better testing and maintenance in the code. Or it’s because the on-call Microsoft engineers noticed the problem immediately after the New Year and fixed the issue without anyone noticing.
I still think many on-premises deployments would be better off in the cloud, but only because of the better feature set available there. Microsoft errors in on-premises software shouldn’t be a motivation for customers to move their mail traffic to Exchange Online.