Load balancing SMTP traffic is something that makes sense for a lot of organizations. They have an investment in load balancers for their CAS array, web server farm, etc and so SMTP seems like another logical protocol to run through the load balancers and get all the benefits that it delivers.
However it is also quite easy to create a situation where SMTP traffic is not being load balanced as intended, and worse still there are scenarios where the use of some load balanced configurations may actually diminish SMTP high availability, or even undermine security.
Let’s take a look at some of the issues and how they can be identified and resolved.
Issues with Load Balancer Configurations
The first issues are reasonably easy to correct if they exist. These are primarily related to the configuration and features of the load balancer itself, such as:
- the priority of the target servers
- the load balancing method/algorithm used
- whether source NATing is being used
- the health monitors/probes
Consider the following scenario where incoming internet email is passed through an email security server/appliance, which is configured to then send the traffic to a load balancer for distribution to the Hub Transport servers. Various internal applications and systems also use the load balancer as their SMTP target.
Priority of Target Servers
In most load balancer configurations you can configure a priority or weight for the servers that are the targets of the traffic. Different vendors use their own terminology for this, but the general idea is that it provides the option to have preferred servers that will be considered first for a new connection if they are available.
Now there are situations where this is a deliberate design choice, and if that is your case then you may not need to worry about this particular issue. However there are some considerations to be aware of if you find that your servers are weighted differently for no particular reason.
Here is a traffic graph of a typical day for two servers that were configured with different weightings/priorities in the load balancer. You can see that SERVER1 handled a higher volume of traffic than SERVER2.
This graph was created by gathering traffic stats from the message tracking logs. For more information see Calculate Daily Email Traffic using Message Tracking Logs and Log Parser.
Depending on your server resources and traffic load this may not be an issue for you, but in some environments it could lead to load issues that interrupt mail flow. So if your actual intention is evenly distribute traffic across multiple Hub Transport servers then you would consider adjusting the server weight/priority accordingly.
In the above scenario when the weightings were adjusted the traffic became more evenly distributed (not perfectly, but that is due to other factors in that environment which I will cover next).
Load Balancing Method/Algorithm
Along similar lines to the previous issue, a load balancer will usually have multiple methods for deciding which server should be used for a connection. For example, the Kemp load balancers have quite a few scheduling options available.
If you’re seeing SMTP traffic imbalances similar to those in the previous example, and your server weighting/priority is not the cause, you should look at the load balancing method and investigate whether your current configuration is not the best suited for that traffic.
As one specific example, if the load balancing is based on source IP it may inadvertently lead to traffic imbalances. In the example environment shown at the beginning of this article, source IP-based load balancing would generally result in well balanced traffic from the internal applications and systems, assuming each internal IP is sending roughly equal volumes of email, otherwise some imbalances can still occur.
But that configuration may result in imbalanced email traffic coming from the internet (via the email security server/appliance), because that all appears to come from a single IP.
As the earlier graph showed this was causing some imbalance in overall SMTP traffic even after the server weight/priority was reconfigured, because while that resolved traffic imbalance from internal sources that are all on different IP addresses, the incoming internet email was still treated as coming from a single IP and was almost entirely being sent to a single Hub Transport server.
The obvious reaction here may be to choose a different load balancing algorithm, however my recommendation for environments where incoming internet email all traverses a single host like that is to consider not using the load balancer for distribution of that incoming internet traffic.
I will explain my reasons for that in the next sections.
One of my concerns with source NATing and load balanced SMTP traffic is the impact is has on the protocol logs generated by the Hub Transport servers.
Note that much of data presented in this section relies on protocol logging being turned on for all receive connectors on the Hub Transport servers.
[PS] C:\>Get-ReceiveConnector -Server SERVER1 | select name,protocollogginglevel | ft -auto Name ProtocolLoggingLevel ---- -------------------- Default SERVER1 Verbose Client SERVER1 Verbose Internal Relay Verbose Internet via Gateway Verbose
For more on protocol logging see Troubleshooting Email Delivery with Exchange Server Protocol Logging.
With all internal and incoming SMTP traffic going via the load balancer, which is source NATing the connections, the protocol logs only recorded traffic from the load balancer (IP 10.1.1.12 below) and no other IP addresses.
IP Name Hits -------------- ----------------------- ----- 10.1.1.12 10.1.1.12 25976 Statistics: ----------- Elements processed: 1428114 Elements output: 1 Execution time: 13.49 seconds
The above stats were collected using protocol logs and Log Parser. For more information see Report Top Sender IP’s on Exchange Server 2010 using Log Parser.
Looking at hits per receive connector (recorded as “connector-id” in protocol logs) there was no traffic being handled by the receive connector that was configured for internet traffic.
Connector Hits -------------------------------------------------- ------- SERVER1Internal Relay 1422080 SERVER1Default SERVER1 4363
While this doesn’t necessarily result in an email disruption for your environment, if you have a receive connector for a specific purpose and it is not being used for that intended purpose then your environment is not operating as intended.
Aside from that there is also the issue of being able to identify the relative traffic volume of internal vs internet email, if you’re relying on protocol log data to give you that information about your email traffic patterns.
Depending on your incoming email routes there are multiple ways to respond to this issue.
In the example scenario used in this article the email security server has its own load balancing capability for incoming email because you can specify multiple internal hosts to deliver email to. This would also apply to hosted email security services.
By configuring each Hub Transport as an internal delivery target instead of just using the load balancer, the protocol logs now log incoming internet email as coming from the IP addresses for the email security system, rather than the load balancer.
IP Name Hits -------------- ----------------------- ----- 10.1.1.12 10.1.1.12 24819 192.168.0.32 192.168.0.32 115 192.168.0.31 192.168.0.31 105 Statistics: ----------- Elements processed: 1397172 Elements output: 3 Execution time: 22.47 seconds
If you do not have an email security server/appliance or other hosted solution, and SMTP connections go directly from the internet to the load balancer, then you could look at using multiple MX records instead, although this would require the availability of multiple public IP addresses.
In addition, any traffic imbalance being caused by the use of source IP-based load balancing should no longer be present. This graph represents incoming internet SMTP connections per server, which began imbalanced and then evened out almost precisely once the load balancer was bypassed.
And importantly, with traffic bypassing the load balancer it should be getting handled by the intended receive connector (which I will explore more in the section further down on security implications).
Connector Hits -------------------------------------------------- ------- SERVER1Internal Relay 1257702 SERVER1Internet via Gateway 6374 Statistics: ----------- Elements processed: 1267529 Elements output: 2 Execution time: 3.23 seconds
Health Monitors and Probes
Yet another issue with load balancing SMTP is the nature of how load balancers detect service availability.
Most load balancers that are service-aware have a health monitor or probe that makes an SMTP connection to the Hub Transport server, waits for a sign that the service is responding, then disconnects. That sign may be simply waiting for the SMTP banner to be returned, or waiting for a response to HELO.
For example, here is the protocol log data for a health check by a load balancer:
"220 SERVER1.domain.local Microsoft ESMTP MAIL Service ready at Fri, 26 Apr 2013 09:40:12 +1000" helo domain.com 250 server1.domain.com Hello [10.1.1.10] quit 221 2.0.0 Service closing transmission channel
That probe may detect complete service failures, but won’t necessarily detect back pressure if it only goes as far as a HELO.
For example, I pushed one of my test lab servers into “medium” back pressure and then used Telnet to connect and test the response.
As you can see below it was only when I progressed the SMTP conversation past HELO and into the “mail from:” stage that the server returned the familiar 452 4.3.1 Insufficient system resources error, but only for external senders.
220 HO-EX2010-MB1.exchangeserverpro.net Microsoft ESMTP MAIL Service ready at Mo n, 29 Apr 2013 19:55:12 +1000 helo 250 HO-EX2010-MB1.exchangeserverpro.net Hello [10.1.1.4] mail from: firstname.lastname@example.org 452 4.3.1 Insufficient system resources mail from:email@example.com 250 2.1.0 Sender OK
So this server would be rejecting incoming internet email (the sender from @gmail.com), even though the load balancer considers the server to be healthy and available.
If you combine this service-awareness issue with the problem of all email coming from one IP address (ie the email security server/appliance) being distributed only to the server that is suffering back pressure, you can end up with an email disruption for your end users.
Admittedly the combination of factors required to cause that problem scenario may be uncommon, but the potential impact is quite high.
Another issue with some load balanced SMTP configurations is how it can impact the security of your Exchange environment.
The first potential impact is for distribution groups that are configured to require that all senders be authenticated but are otherwise not restricted as to who can send to them (this is the default for distribution groups created in Exchange 2007 and later).
Because some administrators add the source NAT address(es) of the load balancers into the list of remote IP addresses on their internal relay connectors configured in Exchange, this results in any sender that is coming via the load balancer being considered as authenticated and therefore allowed to send to the distribution list.
For internal relay connectors that aren’t exposed to the outside world this may only be a minor inconvenience.
Where this becomes more serious is when incoming internet email traffic arrives via that same load balancer, and can send email to any recipient anywhere – in other words, you’ve got an open relay.
This is a Telnet session from outside of my test lab firewall, through to the load balancer’s IP address, and I am able to relay an email through my Exchange servers.
250 HO-EX2010-MB2.exchangeserverpro.net Hello [10.1.1.12] mail from: firstname.lastname@example.org 250 2.1.0 Sender OK rcpt to: email@example.com 250 2.1.5 Recipient OK data 354 Start mail input; end with . subject: test relay test . 250 2.6.0 <546b08e1-fd0f-4baa-a473-03fba110a1af@HO-EX2010-MB2.exchangeserverpro. net> [InternalId=334267] Queued mail for delivery
This occurs because the source NATing causes Exchange to believe that the email is originating from the load balancer (10.1.1.12), and that IP address is configured as a remote IP address on the internal relay connector.
Ideally if internet email traffic is coming in directly to a load balancer, and the load balancer has no other mechanism for preventing an open relay scenario, then you should ensure that the receive connectors configured for internal applications and systems to relay email are not also handling the internet email traffic.
This could be achieved by using a different VIP and source NAT pool on the load balancer for that traffic, so that it does not get included in the remote IP range for the internal relay connector.
I’ve covered a lot of points in this article and before you get too alarmed I want to make a few things clear.
Firstly, not all of these scenarios are necessarily bad. A traffic imbalance may not be a concern for smaller networks, and may even be a deliberate configuration in some situations.
The impact on protocol logs may not be a concern for administrators who simply do not make any use of the data they contain.
Limitations around health probes/monitoring by the load balancer may not be a concern if you have other robust enterprise monitoring systems alerting you to those conditions already.
Distribution groups being emailed by unauthenticated senders may not be an issue if there is spam filtering in place, and if the organization actually engages in a lot of group email with external parties.
And the sharing of a relay connector for both internal (trusted) and incoming (untrusted) email may not be an immediate issue if the incoming traffic first passes through another device or host that blocks the relay attempt (eg an email security server/appliance).
However, if you do have any concerns about any of these issues I’ve raised then it would be wise to review your configurations, perform some testing, and consider whether there is a better configuration you could move to that mitigates any issues you are actually experiencing.
We have Exchange 2016 Hybrid server with SMTP Relay. So how do I patch this server with downtime for our applications?
we have an f5 load balancers. Currently smtp.mydomain.com point to VIP of Loadbalancer.
Great article! Found this by searching how to achieve HA on hybrid Exchange inbound SMTP. The traffic should be direct (no SMTP hosts in between EXO and on-prem) so two options comes to mind – HLB and DNS RR, probably we will go with DNS RR. This looks not just much simpler but also do not have potential risks that HLB could involve.
we have L7 radware HLB from we route SMTP traffic of port 587 and when we disable natting from HLB we are not able to telnet to exchange shows service not available although source IP added to receive connector. on exchange we receive remote socket error when we see in SMTP receive logs.
what needed to be done on exchange to resolve this?
when we bypass HLB everything works perfect.
Can you advise me of a way to achieve SMTP high availability? I’m talking SMTP failover if Back Pressure occurs on one of my exchange servers. I’ve been struggling with this and I’m currently unable to find any useful solution.
Some load balancers can do SMTP health checks that will detect the back pressure condition. Others just check for a listening port 25. So it depends on what load balancer you’re using. I suggest referring to their documentation or contacting their support for assistance with the details.
Pingback: Smtp Server For Testing | wholewheatdiet.com
This is a great post, so just wanted to leave some feedback and thank you for putting it up (even if I am 2 years late to the party)!
I’m from a networking background and working on an issue affecting a load balanced email environment across multiple DCs using Cisco OTV to extend VLANs and it definitely helped me understand the traffic flows better when using Email SEG’s and upstream cloud based proxies.
I had a little experience in that area but this post will definitely help me troubleshoot the issue further.
Thanks for thorough explanation..
Pingback: Home – ExchangeServerInfo » Exchange – Lock Down Open SMTP Relay
one of my hub Server(E2k10) is not telneting even locally (telnet localhost 25 ) so its not routing any mails.
how can i find clues about the issue.?
Check if the Transport service is running. Check if you can ping/tracert etc to the server. Check whether a firewall is blocking it.
Great post on some of the decisions regarding whether to use HLB or just plain MX records for SMTP traffic. The Health Probe discussion is particularly interesting however for the issues in traffic load and security related to Source NAT by the HLB there are very easy ways to resolve this that I’d like to discuss for anyone who wishes to continue using their HLB.
1. The problem with the load is actually with the “persistence” algorithm, not the “scheduling” algorithm. Something like Round Robin for scheduling would send an even number of new connections to each SMTP Server however with Source IP persistence enabled returning SMTP senders would be sent back to the same backend server. By keeping Round Robin enabled but setting your Persistence method to “None” you would ensure that each new SMTP connection (regardless of the Source IP) would be rebalanced and get an even spread.
2. While many people might have a configuration where the requests to the SMTP servers come from the IP of the HLB (called Non-Transparent by KEMP) this is optional in most cases. Providing you meet the necessary network configuration requirements so traffic routes properly there is no reason why the load balancer can’t pass on the request using the original source IP of the client. This would ensure both logging and your SMTP connector configuration function as you would expect. If you’re a KEMP customer just contact Support and ask them for advice on configuring “Transparency”.
3. If you can’t meet the necessary requirements for passing on the original source IP there is another way you can work around the problem of Source IP to ensure the security of your SMTP connectors (but you would have limited visibility in your logs as you mentioned). To do this you need to setup 2 separate virtual services, one for internal use and a 2nd for external use. It is possible to configure in KEMP (and I assume most HLB) for a Virtual Service to use a specific IP for connections to the Real Server, this would allow your SMPT servers to identify whether the internal or external VIP was being used for the connection. Your SMTP connectors would be set so that you had different security profiles depending on which of these 2 VIPs was used for the connection. By locking down access to the internally used VIP you can maintain security. I have also helped a customer use this method in order to allow unauthenticated SMTP traffic from 2 specific internal application servers where the application did not allow for SMTP authentication but where they still wanted all other connections to use auth. It worked perfectly in this situation.
I’d be very interested in more information on how the SMTP health check can be improved, might be something for the Dev team to work on 🙂
Thanks again for a great discussion on some of the pitfalls in load balancing SMTP
Hi Ben, thanks for clarifying that persistence vs scheduling point.
And for spelling out the options for working around these issues. Exactly the type of info I was hoping to draw out from the experts in this field.
Regarding the health checks, I sent a note to Bhargav explaining one particular back pressure scenario. I think he’ll see straight away where potential improvements can be made but I’m happy to discuss further if you want to loop me in via email.
Very nice Blog!
We too have had this discussion internally which is why we now recommend L4 DR mode for the HT role(Kemp call this DSR).
With L4 DR mode it’s source IP transparent already plus the only requirement is that you configure a loopback adapter with the VIP address on your real servers(Also disabling strong host behaviour for Server 2008+). From my point of view this is the path of least resistance as I don’t need to implement 2 subnets or mess about with anything else.
For Exchange 2013 we recommend that the whole config is done using L4 DR mode except for cases where the real servers are a router hop away.