Score:1

Postfix - Troubleshooting connection timeouts for all servers except one

lr flag
Cos

I have inherited a postfix server running on RedHat. It is an undocumented build, but critical for business operations (Don't we all love those?)

It has developed issues with delayed and backlogged mail delivery. The issues were first reported a few weeks ago, but could go back an indefinite amount of time.

My *nix experience is rusty, but I have been able to poke around the system enough to determine that when the server is experiencing backlog it reports connection timeouts to the upstream SMTP relays at my organization.

Example Error:

*3D27412A016
4187 Tue Apr 19 17:04:26
[email protected]

(delivery temporarily suspended: connect to UpstreamRelayA4.doi.net[10.xx.xx.206]:25: Connection timed out)

[email protected]*

However, the upstream relay owners report they have no matching errors in their logs from this SMTP server. For my organization there is a single MX record with 4 relay servers included. All 4 can be reached from my SMTP server via telnet on port 25, however, 3 of the 4 are timing out in the postfix logs.

Any tips on how to track down why postfix thinks they are timing out?

Added 4/20/22 - postconf -n output

    [USERNAME@mailer ~]$ postconf -n

alias_database = hash:/etc/aliases

alias_maps = hash:/etc/aliases

broken_sasl_auth_clients = yes

command_directory = /usr/sbin

config_directory = /etc/postfix

daemon_directory = /usr/libexec/postfix

debug_peer_level = 2

debugger_command = PATH=/bin:/usr/bin:/usr/local/bin:/usr/X11R6/bin xxgdb
 
$daemon_directory/$process_name $process_id & sleep 5

disable_vrfy_command = yes

html_directory = no

inet_interfaces = all

inet_protocols = ipv4

local_recipient_maps =

mail_owner = postfix

mail_spool_directory = /var/mail

mailbox_size_limit = 0

mailq_path = /usr/bin/mailq.postfix

manpage_directory = /usr/share/man

maximal_queue_lifetime = 1d

message_size_limit = 30720000

mydestination = $myhostname, localhost.$mydomain, localhost

myhostname = mailer.domain.org.com

mynetworks = 

127.0.0.0/8,165.83.0.0/16,10.0.0.0/8,64.241.25.0/24,172.16.0.0/12

myorigin = $mydomain

newaliases_path = /usr/bin/newaliases.postfix

queue_directory = /var/spool/postfix

readme_directory = /usr/share/doc/postfix-2.10.1/README_FILES

relayhost = relayLOCATION.parentorg.com

sample_directory = /usr/share/doc/postfix-2.10.1/samples

sendmail_path = /usr/sbin/sendmail.postfix

setgid_group = postdrop

smtp_tls_note_starttls_offer = yes

smtp_use_tls = yes

smtpd_delay_reject = yes

smtpd_helo_required = yes

smtpd_helo_restrictions = 
permit_mynetworks,reject_non_fqdn_helo_hostname,reject_invalid_helo_hostname,permit

smtpd_policy_service_max_idle = 5s

smtpd_recipient_restrictions = permit_sasl_authenticated,permit_mynetworks,reject_unauth_destination

smtpd_sasl_auth_enable = yes

smtpd_sasl_authenticated_header = yes

smtpd_sasl_local_domain =

smtpd_sasl_security_options = noanonymous

smtpd_sender_restrictions = permit_mynetworks,reject_non_fqdn_sender,permit

smtpd_tls_CAfile = /etc/postfix/ssl/mailer_DOMAIN_ORG_COM.pem

smtpd_tls_auth_only = no

smtpd_tls_cert_file = /etc/postfix/ssl/mailer_DOMAIN_ORG_COM.crt

smtpd_tls_key_file = /etc/postfix/ssl/mailer_DOMAIN_ORG_COM.key

smtpd_tls_loglevel = 1

smtpd_tls_security_level = may

smtpd_use_tls = yes

tls_random_source = dev:/dev/urandom

transport_maps = hash:/etc/postfix/transport

[USERNAME@mailer ~]$

Added 4/20/22 - postconf -M output

[USERNAME@mailer ~]$ postconf -M
smtp       inet  n       -       n       -       -       smtpd
pickup     fifo  n       -       n       60      1       pickup
cleanup    unix  n       -       n       -       0       cleanup
qmgr       fifo  n       -       n       300     1       qmgr
tlsmgr     unix  -       -       n       1000?   1       tlsmgr
rewrite    unix  -       -       n       -       -       trivial-rewrite
bounce     unix  -       -       n       -       0       bounce
defer      unix  -       -       n       -       0       bounce
trace      unix  -       -       n       -       0       bounce
verify     unix  -       -       n       -       1       verify
flush      unix  n       -       n       1000?   0       flush
proxymap   unix  -       -       n       -       -       proxymap
smtp       unix  -       -       n       -       -       smtp
relay      unix  -       -       n       -       -       smtp -o fallback_relay=
showq      unix  n       -       n       -       -       showq
error      unix  -       -       n       -       -       error
discard    unix  -       -       n       -       -       discard
local      unix  -       n       n       -       -       local
virtual    unix  -       n       n       -       -       virtual
lmtp       unix  -       -       n       -       -       lmtp
anvil      unix  -       -       n       -       1       anvil
scache     unix  -       -       n       -       1       scache
maildrop   unix  -       n       n       -       -       pipe flags=DRhu user=vmail argv=/usr/local/bin/maildrop -d ${recipient}
old-cyrus  unix  -       n       n       -       -       pipe flags=R user=cyrus argv=/usr/lib/cyrus-imapd/deliver -e -m ${extension} ${user}
cyrus      unix  -       n       n       -       -       pipe user=cyrus argv=/usr/lib/cyrus-imapd/deliver -e -r ${sender} -m ${extension} ${user}
uucp       unix  -       n       n       -       -       pipe flags=Fqhu user=uucp argv=uux -r -n -z -a$sender - $nexthop!rmail ($recipient)
ifmail     unix  -       n       n       -       -       pipe flags=F user=ftn argv=/usr/lib/ifmail/ifmail -r $nexthop ($recipient)
bsmtp      unix  -       n       n       -       -       pipe flags=Fq. user=foo argv=/usr/local/sbin/bsmtp -f $sender $nexthop $recipient
retry      unix  -       -       n       -       -       error
proxywrite unix  -       -       n       -       1       proxymap
[USERNAME@mailer ~]$

Added 4/20/22 - Devices between mail relays

We do not have visibility into the network or security appliances between the relays. Traceroute only indicates 3 hops, all of which are most likely standard routers based on their IP addresses in our network layout.

Added 4/20/22 - Postfix Version

Postfix appears to be version 2.10.1 , which would place the install around 2013 per the Postfix Releases page

Added 4/22/22 - openssl Connection test

[USERNAME@mailer ~]$ openssl s_client -connect UPSTREAM_RELAY.ORG.net:25 -starttls smtp -crlf
CONNECTED(00000003)
depth=1 DC = net, DC = ORG, CN = CA_Server
verify error:num=20:unable to get local issuer certificate
---
Certificate chain
 0 s:/C=US/ST=STATE/L=CITY/O=PARENT_ORG/OU=PARENT_ORG/CN=UPSTREAM_RELAY.ORG.net
   i:/DC=net/DC=ORG/CN=CA_Server
 1 s:/DC=net/DC=ORG/CN=CA_Server
   i:/CN=ORGRootCA2
---
Server certificate
-----BEGIN CERTIFICATE-----

[Cert contents removed]

-----END CERTIFICATE-----
subject=/C=US/ST=STATEA/L=CITY/O=PARENT_ORG/OU=PARENT_ORG/CN=UPSTREAM_RELAY.ORG.net
issuer=/DC=net/DC=ORG/CN=CA_Server
---
No client certificate CA names sent
Peer signing digest: SHA1
Server Temp Key: ECDH, P-384, 384 bits
---
SSL handshake has read 5841 bytes and written 538 bytes
---
New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES256-SHA384
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : ECDHE-RSA-AES256-SHA384
    Session-ID: [REMOVED]
    Session-ID-ctx:
    Master-Key: [REMOVED]
    Key-Arg   : None
    Krb5 Principal: None
    PSK identity: None
    PSK identity hint: None
    Start Time: 1650649689
    Timeout   : 300 (sec)
    Verify return code: 20 (unable to get local issuer certificate)
---
250 XSHADOWREQUEST
 (Functional console after this)

Added 4/22/22 - Maillog grep for non-working server

[USERNAME@mailer ~]$  sudo mailq | grep  UPSTREAM_RELAY_103.ORG.net

(delivery temporarily suspended: conversation with UPSTREAM_RELAY_103.ORG.net[10.x.x.125] timed out while sending end of data -- message may be sent more than once)

(conversation with UPSTREAM_RELAY_103.ORG.net[10.x.x.125] timed out while sending end of data -- message may be sent more than once)

[Removed duplicates, All entries for that server are exactly the same two messages]

Final edit 4/27/2022

In troubleshooting last week, we discovered that /etc/resolv.conf had a nameserver that no longer exists. After removing this and restarting postfix we no longer appear to be getting timeouts in the logs, and mail flows rapidly.

As mentioned by @anx in the comments, this doesn't make a ton of sense in relation to the connection timeouts, but as soon as it was fixed and postfix was restarted our outbound sends drastically increased in speed, and we have not had any delay issues since, despite adding over 20,000 extra outbound test emails per day (Roughly a 30% increase over regular mail volume).

Cos avatar
lr flag
Cos
@anx Info added to post. As best as we can tell, name resolution is ok. There haven't been any documented changes recently and we can reach all 4 outbound relays via telnet on port 25 with via hostname and IP with no issue. We have some logging of outbound mail counts. No recent changes in volume. We average 50k-70K outbound per day, and are still within that range if you discard my bulk sends designed to intentionally induce the backlog for troubleshooting. Restarting postfix doesn't provide visible errors, but if there's a log\verbose to check I'm happy to give it a shot.
anx avatar
fr flag
anx
(n.b. inspecting `mailq` output is not equivalent with searching logs: mailq lists the reason for the *last attempt* per queue file, but would not necessarily contain additional cues that could have been written in the logs for *earlier* delivery attempts of the same message.)
Cos avatar
lr flag
Cos
@anx We may have figured this out. In /etc/resolv.conf we had a nameserver that no longer exists. After removing this we no longer appear to be getting timeouts in the logs. I will check back next week to see if the issue is fully resolved. Your guidance to check old logs was the tip we needed. I went back and checked the historical maillog archives. They indicated that the issue started years ago, but apparently no one was monitoring the server enough to notice. That would potentially line up with the old DNS server going offline, but no one has been around long enough to remember.
anx avatar
fr flag
anx
Thanks for the update. Even if it solves your immediate issue (please put the details in an answer!) there still seems to be something missing here, though: How could a DNS issue - resulting in an error after giving up after some *seconds* - lead to the server failing to respond (at least with "try again later") within the timeouts of some *minutes*?
Cos avatar
lr flag
Cos
@anx Final update has been added to post. We still can't figure out why we this was failing, but we can reproduce the issue immediately by adding any invalid nameserver IP to /etc/resolv.conf . If this was a modern build I'd probably try to report it as a bug and see if anyone with developer knowledge of the situation could track it down, but given the age of the installed version and OS build we're going to build a replacement from scratch with a fresh version of postfix. Thanks for all of your help.
Score:1
lr flag
Cos

As best as we were able to figure out, this issue was caused by having an invalid DNS entry in /etc/resolv.conf . Once the bad entry was removed we stopped having issues in the logs, and mail returned to flowing properly with minimal outbound delay.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.