I recently dissected a problem in an Exchange Server organization that I have decided to share here, as much for the technical root cause as for the troubleshooting steps that were involved.
The problem first appeared as an imbalance in the volume of email traffic that each Hub Transport server in one particular site were handling.
This was picked up in some routine performance monitoring. The daily email traffic for each of the five Hub Transport servers in this site for the last 30 days was calculated using message tracking log analysis, and the data used to generate this graph.
The heavy days are week days (Monday to Friday) and the dips are the weekends.
That trend is not surprising, but what did catch my eye was the way SERVER4 is handling twice the traffic as any other server in that site. Each of the other four servers is handling roughly the same amount as each other, but SERVER4 stands out above them all.
How Exchange Hub Transport Servers Load Balance Traffic
This particular site is not internet-facing. In other words it is not responsible for email traffic going in and out of the Exchange organization.
Therefore, the Hub Transport servers in this site are primarily handling messages between mailboxes within that site, or messages to/from other sites (whether for mailboxes in the other sites, or to/from the internet via one of the internet-facing sites).
Hub Transport servers in this scenario do not need any special load balancing configuration applied. Within an Active Directory site, and for email traffic between sites, Exchange performs it's own form of automatic load balancing.
In effect, the Exchange server will look at the list of available Hub Transport servers for the route an email message needs to take, randomize that list of servers, and then try each of them starting with the first in the list until it finds one that is able to accept the message. Unless there is a fault of some kind it will usually send to the first Hub Transport server in the randomized list.
So while it doesn't perfectly load balance the traffic, it should do so within a pretty small degree of variation.
Checking for Hub Transport Load Spikes
One of my first thoughts was that perhaps SERVER4 is being hit with a spike of traffic at some period of the day that is causing it to record more email traffic each day. It is entirely possible that some rogue device or application has been hard-coded to directly address SERVER4 for it's SMTP needs.
I once again used message tracking log analysis of the past 30 days to generate a graph of the average email traffic across each hour of the day.
The graph turned out to be unremarkable, no load spikes visible.
Checking for Top Senders
Since there is no apparent load spike, but the possibility remains that a particular host or application is sending a high volume of email throughout the day, the next angle of attack was to check the top senders (ie remote IP addresses) for email traffic through this server.
The results again were fairly unremarkable. But I also took a few extra moments to run the same message tracking log analysis on the other four Hub Transport servers in the site. This is where things took an interesting turn.
Where SERVER3-5 each showed an expected result for the top senders, SERVER1-2 showed interesting results. Those two servers had plenty of remote IP addresses logging hits on them, but a complete lack of any other Exchange servers in that list.
In other words, it began to appear that SERVER1 and SERVER2 were not handling any Exchange -> Exchange email traffic.
So what was all the email traffic they were logging?
SMTP Relay Traffic
Within this site we make available a DNS alias for applications and devices that need to use an SMTP service to send alerts or reports via email. This DNS alias is load balanced across both SERVER1 and SERVER2.
Now, considering that SERVER1 and SERVER2 should be processing their fair share of normal Exchange traffic, as well as the additional load of SMTP relay from applications and devices, you would expect their daily traffic graphs to be higher than the others in the site.
Instead, this is what a single day's email traffic amounts to on each Hub Transport server.
To confirm the suspicion that SERVER1 and SERVER2 were not processing any Exchange -> Exchange traffic at all I analysed the message tracking logs for hits from a sample of Hub Transport servers in other sites. When this data was collated the graph looked like this.
SERVER1 and SERVER2 are processing no intra-org email traffic coming in from other sites, and SERVER4 is having to pick up the slack. Although I was curious why the traffic still hadn't evenly load balanced across the other three Hub Transport servers that was not the primary concern.
The real concern is why is this happening, and how do we fix it?
What is Causing the Intra-Org Email Traffic Imbalance?
The root cause comes back to the SMTP relay configuration that is in place. SERVER1 and SERVER2 each have an additional Receive Connector configured for SMTP relay.
At the time these were implemented the servers had no additional network interfaces available in them, so the new Receive Connectors are bound to the same interface as the Default Receive Connector.
While this is something you can get away with, it is generally recommended that you dedicate an interface to a relay connector like this for reasons that I'm about to demonstrate.
When two Receive Connectors share the same interface and IP address they use the list of remote IP addresses configured on the connectors to determine which one should handle a particular connection.
Generally speaking the most specific match will determine which connector accepts a connection.
The Default Receive Connector specifies a remote IP range that could be described as “everything”.
When sharing an IP address between the default connector and receive connector it is easy for the server to determine that a connection from a specific IP address that is explicitly listed in the remote IP list of the relay connector should be handled by the relay connector.
The trouble begins when administrators take a short cut and add entire subnets to the remote IP address list on the relay connector. If that subnet also contains other Exchange servers, connections from those Exchange servers will be processed by the relay connector, not by the default connector, because the subnet is considered more specific than the “everything” range that is on the default connector.
A receive connector configured for SMTP relay usage by non-Exchange systems does not have a configuration that Exchange likes when it comes to Exchange -> Exchange communications. So, the connection fails and the Exchange server will attempt to use a different Transport server.
In our case the connections to SERVER1 or SERVER2 were continually failing and SERVER4 was handling the extra load.
How to Fix the Transport Load Imbalance
With the root cause clearly identified the possible solutions were clear. We could either:
- Configure dedicated IP addresses for the relay connectors so that there is no confusion as to which receive connector the server should use to handle Exchange traffic vs application relay traffic.
- Remove all of the subnet entries from the remote IP address lists and replace them with only the specific IP addresses that should be permitted to relay.
Because the situation with available network interfaces had not changed we went with option 2, with a note to use dedicated interfaces for relay connectors in future designs.
The outcome is a more balanced load among the Hub Transport servers in the site, now completely in line with expectations and providing better performance and resiliency.