I've discovered that my sendmail configuration does not always try a
secondary MX host if the primary MX does not answer. Sometimes it does,
more often it doesn't.
I think my questions are, 1) how does sendmail decide when to give up on a given MX and try the next one? And
2) how to debug what is (not) happening?
To work on this, I set up the name mytest.freefriends.org (my own domain)
with an unroutable 10.x primary MX, and a good secondary:
mytest IN MX 1 nonesuch.freefriends.org.
mytest IN MX 10 goodmx.freefriends.org.
nonesuch IN A 10.10.10.10
In the real cases, the primary MX is a regular host, reachable but
intentionally not answering on port 25. Apparently some sysadmins do
this to stop some spammers who never try the second MX. (I hesitate to
publish the names of the domains doing this, but could provide
privately.) I get the same results with my test setup as with the real
cases -- sometimes my sendmail gives up on the bad primary and correctly
falls over to the secondary, but more often not.
I'm using the sendmail 8.14.7 binary that is distributed with CentOS 7,
on x86_64. I've customized sendmail.cf in various ways, but nothing that
seems remotely relevant except possibly the timeout values, which I'll
append below.
I'm sending my test mail to, e.g., [email protected]
. The /var/log/maillog entry just shows nonesuch being tried repeatedly until the 5 days are up and it bounces:
Mar 15 18:26:45 tug sendmail[26132]: 22FHPiET026128: to=<[email protected]>, delay=00:01:00, xdelay=00:01:00, mailer=esmtp, pri=293911, relay=nonesuch.freefriends.org. [10.10.10.10], dsn=4.0.0, stat=Deferred: Connection timed out with nonesuch.freefriends.org.
I'm trying to discern what's really happening with:
rm /tmp/f; sendmail -D/tmp/f -d0-99.99 [email protected]
but the voluminous /tmp/f debug output just shows the bad nonesuch MX
being tried over and over, although goodmx is found. Here's a little excerpt showing the final attempt on a given queue run:
hostsignature(mytest.freefriends.org.) = nonesuch.freefriends.org.:goodmx.freefriends.org.
...
dropenvelope 0x55db2c276ba0: id=<null>, flags=4405046<INQUEUE,NO_BODY_RETN,DELE\
TE_BCC,GLOBALERRS,METOO,IS_MIME,SPLIT>
sendq=0x55db2e364ab0=<[email protected]>:
mailer 4 (esmtp), host `mytest.freefriends.org.'
user `[email protected]', ruser `<null>'
state=QUEUEUP, next=0x0, alias 0x0, uid 0, gid 0
flags=80000182<QPRIMARY,QPINGONFAILURE,QPINGONDELAY,QRCPTOK>
owner=(none), home="(none)", fullname="(none)"
orcpt="(none)", statmta=nonesuch.freefriends.org., status=4.4.1
finalrcpt="RFC822; [email protected]"
rstatus="(none)"
statdate=Tue Mar 15 18:28:59 2022
====finis: stat 75 e_id=NOQUEUE e_flags=4405046<INQUEUE,NO_BODY_RETN,DELETE_BCC,GLOBALERRS,METOO,IS_MIME,SPLIT>
I have not been able to catch a log with a
successful message, when it falls back to the secondary. Any way to hook into that?
I suppose I could work around this with mailertable (or maybe bestmx) entries, but I don't know all the hosts that would need it. Besides, failing over to the secondary mx seems like a pretty fundamental operation (nowadays) not to be working.
I've searched around online, in the bat book, in the sendmail sources (e.g.,
domain.c), etc., but haven't yet found the handle. If anyone would like to
email me about this instead of/as well as replying here, my address is
karl (at) freefriends (dot) org.
Sorry for the long message. Thanks in advance for any clues.
# timeouts (many of these)
#O Timeout.initial=5m
O Timeout.connect=30s
O Timeout.aconnect=30s
O Timeout.iconnect=30s
O Timeout.helo=4m
O Timeout.mail=5m
O Timeout.rcpt=10m
O Timeout.datainit=2m
O Timeout.datablock=6m
O Timeout.datafinal=30m
O Timeout.rset=1m
O Timeout.quit=1m
O Timeout.misc=1m
O Timeout.command=5m
O Timeout.ident=0s
#O Timeout.fileopen=60s
#O Timeout.control=2m
O Timeout.queuereturn=5d
#O Timeout.queuereturn.normal=5d
#O Timeout.queuereturn.urgent=2d
#O Timeout.queuereturn.non-urgent=7d
#O Timeout.queuereturn.dsn=5d
O Timeout.queuewarn=2d
#O Timeout.queuewarn.normal=4h
#O Timeout.queuewarn.urgent=1h
#O Timeout.queuewarn.non-urgent=12h
#O Timeout.queuewarn.dsn=4h
#O Timeout.hoststatus=30m
#O Timeout.resolver.retrans=5s
#O Timeout.resolver.retrans.first=5s
#O Timeout.resolver.retrans.normal=5s
#O Timeout.resolver.retry=4
#O Timeout.resolver.retry.first=4
#O Timeout.resolver.retry.normal=4
O Timeout.lhlo=1m
#O Timeout.auth=10m
O Timeout.starttls=2m