I have inherited a postfix server running on RedHat. It is an undocumented build, but critical for business operations (Don't we all love those?)
It has developed issues with delayed and backlogged mail delivery. The issues were first reported a few weeks ago, but could go back an indefinite amount of time.
My *nix experience is rusty, but I have been able to poke around the system enough to determine that when the server is experiencing backlog it reports connection timeouts to the upstream SMTP relays at my organization.
Example Error:
*3D27412A016
4187
Tue Apr 19 17:04:26
[email protected]
(delivery temporarily suspended: connect to UpstreamRelayA4.doi.net[10.xx.xx.206]:25: Connection timed out)
[email protected]*
However, the upstream relay owners report they have no matching errors in their logs from this SMTP server. For my organization there is a single MX record with 4 relay servers included. All 4 can be reached from my SMTP server via telnet on port 25, however, 3 of the 4 are timing out in the postfix logs.
Any tips on how to track down why postfix thinks they are timing out?
Added 4/20/22 - postconf -n output
[USERNAME@mailer ~]$ postconf -n
alias_database = hash:/etc/aliases
alias_maps = hash:/etc/aliases
broken_sasl_auth_clients = yes
command_directory = /usr/sbin
config_directory = /etc/postfix
daemon_directory = /usr/libexec/postfix
debug_peer_level = 2
debugger_command = PATH=/bin:/usr/bin:/usr/local/bin:/usr/X11R6/bin xxgdb
$daemon_directory/$process_name $process_id & sleep 5
disable_vrfy_command = yes
html_directory = no
inet_interfaces = all
inet_protocols = ipv4
local_recipient_maps =
mail_owner = postfix
mail_spool_directory = /var/mail
mailbox_size_limit = 0
mailq_path = /usr/bin/mailq.postfix
manpage_directory = /usr/share/man
maximal_queue_lifetime = 1d
message_size_limit = 30720000
mydestination = $myhostname, localhost.$mydomain, localhost
myhostname = mailer.domain.org.com
mynetworks =
127.0.0.0/8,165.83.0.0/16,10.0.0.0/8,64.241.25.0/24,172.16.0.0/12
myorigin = $mydomain
newaliases_path = /usr/bin/newaliases.postfix
queue_directory = /var/spool/postfix
readme_directory = /usr/share/doc/postfix-2.10.1/README_FILES
relayhost = relayLOCATION.parentorg.com
sample_directory = /usr/share/doc/postfix-2.10.1/samples
sendmail_path = /usr/sbin/sendmail.postfix
setgid_group = postdrop
smtp_tls_note_starttls_offer = yes
smtp_use_tls = yes
smtpd_delay_reject = yes
smtpd_helo_required = yes
smtpd_helo_restrictions =
permit_mynetworks,reject_non_fqdn_helo_hostname,reject_invalid_helo_hostname,permit
smtpd_policy_service_max_idle = 5s
smtpd_recipient_restrictions = permit_sasl_authenticated,permit_mynetworks,reject_unauth_destination
smtpd_sasl_auth_enable = yes
smtpd_sasl_authenticated_header = yes
smtpd_sasl_local_domain =
smtpd_sasl_security_options = noanonymous
smtpd_sender_restrictions = permit_mynetworks,reject_non_fqdn_sender,permit
smtpd_tls_CAfile = /etc/postfix/ssl/mailer_DOMAIN_ORG_COM.pem
smtpd_tls_auth_only = no
smtpd_tls_cert_file = /etc/postfix/ssl/mailer_DOMAIN_ORG_COM.crt
smtpd_tls_key_file = /etc/postfix/ssl/mailer_DOMAIN_ORG_COM.key
smtpd_tls_loglevel = 1
smtpd_tls_security_level = may
smtpd_use_tls = yes
tls_random_source = dev:/dev/urandom
transport_maps = hash:/etc/postfix/transport
[USERNAME@mailer ~]$
Added 4/20/22 - postconf -M output
[USERNAME@mailer ~]$ postconf -M
smtp inet n - n - - smtpd
pickup fifo n - n 60 1 pickup
cleanup unix n - n - 0 cleanup
qmgr fifo n - n 300 1 qmgr
tlsmgr unix - - n 1000? 1 tlsmgr
rewrite unix - - n - - trivial-rewrite
bounce unix - - n - 0 bounce
defer unix - - n - 0 bounce
trace unix - - n - 0 bounce
verify unix - - n - 1 verify
flush unix n - n 1000? 0 flush
proxymap unix - - n - - proxymap
smtp unix - - n - - smtp
relay unix - - n - - smtp -o fallback_relay=
showq unix n - n - - showq
error unix - - n - - error
discard unix - - n - - discard
local unix - n n - - local
virtual unix - n n - - virtual
lmtp unix - - n - - lmtp
anvil unix - - n - 1 anvil
scache unix - - n - 1 scache
maildrop unix - n n - - pipe flags=DRhu user=vmail argv=/usr/local/bin/maildrop -d ${recipient}
old-cyrus unix - n n - - pipe flags=R user=cyrus argv=/usr/lib/cyrus-imapd/deliver -e -m ${extension} ${user}
cyrus unix - n n - - pipe user=cyrus argv=/usr/lib/cyrus-imapd/deliver -e -r ${sender} -m ${extension} ${user}
uucp unix - n n - - pipe flags=Fqhu user=uucp argv=uux -r -n -z -a$sender - $nexthop!rmail ($recipient)
ifmail unix - n n - - pipe flags=F user=ftn argv=/usr/lib/ifmail/ifmail -r $nexthop ($recipient)
bsmtp unix - n n - - pipe flags=Fq. user=foo argv=/usr/local/sbin/bsmtp -f $sender $nexthop $recipient
retry unix - - n - - error
proxywrite unix - - n - 1 proxymap
[USERNAME@mailer ~]$
Added 4/20/22 - Devices between mail relays
We do not have visibility into the network or security appliances between the relays. Traceroute only indicates 3 hops, all of which are most likely standard routers based on their IP addresses in our network layout.
Added 4/20/22 - Postfix Version
Postfix appears to be version 2.10.1 , which would place the install around 2013 per the Postfix Releases page
Added 4/22/22 - openssl Connection test
[USERNAME@mailer ~]$ openssl s_client -connect UPSTREAM_RELAY.ORG.net:25 -starttls smtp -crlf
CONNECTED(00000003)
depth=1 DC = net, DC = ORG, CN = CA_Server
verify error:num=20:unable to get local issuer certificate
---
Certificate chain
0 s:/C=US/ST=STATE/L=CITY/O=PARENT_ORG/OU=PARENT_ORG/CN=UPSTREAM_RELAY.ORG.net
i:/DC=net/DC=ORG/CN=CA_Server
1 s:/DC=net/DC=ORG/CN=CA_Server
i:/CN=ORGRootCA2
---
Server certificate
-----BEGIN CERTIFICATE-----
[Cert contents removed]
-----END CERTIFICATE-----
subject=/C=US/ST=STATEA/L=CITY/O=PARENT_ORG/OU=PARENT_ORG/CN=UPSTREAM_RELAY.ORG.net
issuer=/DC=net/DC=ORG/CN=CA_Server
---
No client certificate CA names sent
Peer signing digest: SHA1
Server Temp Key: ECDH, P-384, 384 bits
---
SSL handshake has read 5841 bytes and written 538 bytes
---
New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES256-SHA384
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
Protocol : TLSv1.2
Cipher : ECDHE-RSA-AES256-SHA384
Session-ID: [REMOVED]
Session-ID-ctx:
Master-Key: [REMOVED]
Key-Arg : None
Krb5 Principal: None
PSK identity: None
PSK identity hint: None
Start Time: 1650649689
Timeout : 300 (sec)
Verify return code: 20 (unable to get local issuer certificate)
---
250 XSHADOWREQUEST
(Functional console after this)
Added 4/22/22 - Maillog grep for non-working server
[USERNAME@mailer ~]$ sudo mailq | grep UPSTREAM_RELAY_103.ORG.net
(delivery temporarily suspended: conversation with UPSTREAM_RELAY_103.ORG.net[10.x.x.125] timed out while sending end of data -- message may be sent more than once)
(conversation with UPSTREAM_RELAY_103.ORG.net[10.x.x.125] timed out while sending end of data -- message may be sent more than once)
[Removed duplicates, All entries for that server are exactly the same two messages]
Final edit 4/27/2022
In troubleshooting last week, we discovered that /etc/resolv.conf had a nameserver that no longer exists. After removing this and restarting postfix we no longer appear to be getting timeouts in the logs, and mail flows rapidly.
As mentioned by @anx in the comments, this doesn't make a ton of sense in relation to the connection timeouts, but as soon as it was fixed and postfix was restarted our outbound sends drastically increased in speed, and we have not had any delay issues since, despite adding over 20,000 extra outbound test emails per day (Roughly a 30% increase over regular mail volume).