TL;DR You may want to quickly jump to the answer to know what happened and not spend time reading the whole question.
I have a tool (ipmgr) to generate my zones (I had to manage about 35 of them, so that made it easier). All the zones are generated the same way, but one all misbehaves and the secondary DNS server never gets a copy of the zones. It somehow refuses the copy.
IMPORTANT NOTE: Everything used to work just fine. It just decided
to stop working on Marsh 22nd, somehow...
Primary DNS (Master)
Logs
Logs of the loading of the main zone (with the ns1/ns2 IPs) on the primary:
25-Mar-2023 01:27:28.745 zoneload: debug 1: zone m2osw.com/IN: starting load
25-Mar-2023 01:27:28.745 general: debug 1: zone_startload: zone m2osw.com/IN: enter
25-Mar-2023 01:27:28.745 zoneload: debug 1: zone m2osw.com/IN: journal rollforward completed successfully: no journal
25-Mar-2023 01:27:28.745 zoneload: debug 1: zone m2osw.com/IN: loaded; checking validity
25-Mar-2023 01:27:28.745 general: debug 1: dns_zone_verifydb: zone m2osw.com/IN: enter
25-Mar-2023 01:27:28.745 general: debug 1: zone_settimer: zone m2osw.com/IN: enter
25-Mar-2023 01:27:28.745 zoneload: info: zone m2osw.com/IN: loaded serial 248
25-Mar-2023 01:27:28.749 general: debug 1: dns_zone_maintenance: zone m2osw.com/IN: enter
25-Mar-2023 01:27:28.749 general: debug 1: zone_settimer: zone m2osw.com/IN: enter
25-Mar-2023 01:27:28.757 notify: info: zone m2osw.com/IN: sending notifies (serial 248)
The loading of the flaky zone on primary looks the same:
(Note: I've now determine that all zones are "flaky" as in they do not transfer, so it makes sense that all the logs look alike.)
25-Mar-2023 01:27:28.745 zoneload: debug 1: zone best-gamblers.games/IN: starting load
25-Mar-2023 01:27:28.745 general: debug 1: zone_startload: zone best-gamblers.games/IN: enter
25-Mar-2023 01:27:28.745 zoneload: debug 1: zone best-gamblers.games/IN: journal rollforward completed successfully: no journal
25-Mar-2023 01:27:28.745 zoneload: debug 1: zone best-gamblers.games/IN: loaded; checking validity
25-Mar-2023 01:27:28.745 general: debug 1: dns_zone_verifydb: zone best-gamblers.games/IN: enter
25-Mar-2023 01:27:28.745 general: debug 1: zone_settimer: zone best-gamblers.games/IN: enter
25-Mar-2023 01:27:28.745 zoneload: info: zone best-gamblers.games/IN: loaded serial 233
25-Mar-2023 01:27:28.749 general: debug 1: dns_zone_maintenance: zone best-gamblers.games/IN: enter
25-Mar-2023 01:27:28.749 general: debug 1: zone_settimer: zone best-gamblers.games/IN: enter
25-Mar-2023 01:27:28.753 notify: info: zone best-gamblers.games/IN: sending notifies (serial 233)
I don't see any errors in the primary logs for any of my zones.
Zone Files
The main zone has the ns1 & ns2 definitions among other things:
$ORIGIN .
$TTL 3600
m2osw.com IN SOA ns1.m2osw.com. hostmaster.m2osw.com. (248 10800 180 1209600 300)
NS ns1.m2osw.com.
NS ns2.m2osw.com.
MX 10 mail.m2osw.com.
A 165.232.146.181
$ORIGIN m2osw.com.
mail A 165.232.146.181
ns1 A 165.232.146.181
ns2 A 96.67.192.225
www A 165.232.146.181
... more TXT / A records ...
And here is the one that fails:
$ORIGIN .
$TTL 3600
best-gamblers.games IN SOA ns1.m2osw.com. hostmaster.m2osw.com. (233 10800 180 1209600 300)
NS ns1.m2osw.com.
NS ns2.m2osw.com.
A 165.232.146.181
$ORIGIN best-gamblers.games.
www A 165.232.146.181
The named.conf
includes a file which includes:
zone "m2osw.com" {
type primary;
file "/var/lib/bind/m2osw.com.zone";
allow-transfer { trusted-servers; };
check-names warn;
max-journal-size 2M;
};
And the failing zone is defined like so:
zone "best-gamblers.games" {
type primary;
file "/var/lib/bind/best-gamblers.games.zone";
allow-transfer { trusted-servers; };
check-names warn;
max-journal-size 2M;
};
Secondary DNS (Slave)
Zone Settings
The secondary has one file with zone references that looks like this:
zone "m2osw.com" {
type secondary;
primaries { list-of-primaries; };
allow-transfer { none; };
file "/var/cache/bind/m2osw.com.zone";
};
zone "best-gamblers.games" {
type secondary;
primaries { list-of-primaries; };
allow-transfer { none; };
file "/var/cache/bind/best-gamblers.games.zone";
};
The "/var/cache/bind/m2osw.com.zone"
file (and about 20 others) get created as expected. All of these work just fine. As I mentioned above, I use a tool to create all the files so it's not like it's going to be any different except for the zone name and corresponding file... the rest is the same for all of them and as we can see above, the two zones I present are exactly the same!
The "/var/cache/bind/best-gamblers.games.zone"
does not get created at all. As I test, I tried to remove the '-' in the filename, but that did not help at all.
Logs
Just like the primary logs, I can't find any errors in the secondary logs and they look alike for the primary domain (which works):
24-Mar-2023 18:47:29.074 general: debug 1: zone m2osw.com/IN: starting load
24-Mar-2023 18:47:29.075 general: debug 1: zone m2osw.com/IN: journal rollforward completed successfully: no journal
24-Mar-2023 18:47:29.075 general: debug 1: zone m2osw.com/IN: loaded; checking validity
24-Mar-2023 18:47:29.075 general: debug 1: zone_settimer: zone m2osw.com/IN: enter
24-Mar-2023 18:47:29.075 general: info: zone m2osw.com/IN: loaded serial 241
24-Mar-2023 18:47:29.077 general: debug 1: dns_zone_maintenance: zone m2osw.com/IN: enter
24-Mar-2023 18:47:29.077 general: debug 1: zone_settimer: zone m2osw.com/IN: enter
24-Mar-2023 18:47:29.086 notify: info: zone m2osw.com/IN: sending notifies (serial 241)
24-Mar-2023 18:52:22.075 general: debug 1: zone_timer: zone m2osw.com/IN: enter
24-Mar-2023 18:52:22.075 general: debug 1: zone_maintenance: zone m2osw.com/IN: enter
24-Mar-2023 18:52:22.075 general: debug 1: queue_soa_query: zone m2osw.com/IN: enter
24-Mar-2023 18:52:22.075 general: debug 1: zone_settimer: zone m2osw.com/IN: enter
24-Mar-2023 18:52:22.075 general: debug 1: soa_query: zone m2osw.com/IN: enter
24-Mar-2023 18:52:22.095 general: debug 1: refresh_callback: zone m2osw.com/IN: enter
24-Mar-2023 18:52:22.095 general: info: zone m2osw.com/IN: refresh: non-authoritative answer from master 165.232.146.181#53 (source 0.0.0.0#0)
24-Mar-2023 18:52:22.095 general: debug 1: queue_soa_query: zone m2osw.com/IN: enter
24-Mar-2023 18:52:22.575 general: debug 1: soa_query: zone m2osw.com/IN: enter
24-Mar-2023 18:52:22.575 general: debug 1: cancel_refresh: zone m2osw.com/IN: enter
24-Mar-2023 18:52:22.576 general: debug 1: zone_settimer: zone m2osw.com/IN: enter
and the other domain (which fails):
24-Mar-2023 18:47:29.075 general: debug 1: zone best-gamblers.games/IN: no master file
24-Mar-2023 18:47:29.075 general: debug 1: zone_settimer: zone best-gamblers.games/IN: enter
24-Mar-2023 18:47:29.076 general: debug 1: dns_zone_maintenance: zone best-gamblers.games/IN: enter
24-Mar-2023 18:47:29.076 general: debug 1: zone_settimer: zone best-gamblers.games/IN: enter
24-Mar-2023 18:47:29.083 general: debug 1: zone_timer: zone best-gamblers.games/IN: enter
24-Mar-2023 18:47:29.083 general: debug 1: zone_maintenance: zone best-gamblers.games/IN: enter
24-Mar-2023 18:47:29.083 general: debug 1: queue_soa_query: zone best-gamblers.games/IN: enter
24-Mar-2023 18:47:29.084 general: debug 1: zone_settimer: zone best-gamblers.games/IN: enter
24-Mar-2023 18:47:29.089 general: debug 1: soa_query: zone best-gamblers.games/IN: enter
24-Mar-2023 18:47:29.116 general: debug 1: refresh_callback: zone best-gamblers.games/IN: enter
24-Mar-2023 18:47:29.116 general: info: zone best-gamblers.games/IN: refresh: non-authoritative answer from master 165.232.146.181#53 (source 0.0.0.0#0)
24-Mar-2023 18:47:29.116 general: debug 1: queue_soa_query: zone best-gamblers.games/IN: enter
24-Mar-2023 18:47:29.583 general: debug 1: soa_query: zone best-gamblers.games/IN: enter
24-Mar-2023 18:47:29.583 general: debug 1: cancel_refresh: zone best-gamblers.games/IN: enter
24-Mar-2023 18:47:29.583 general: debug 1: zone_settimer: zone best-gamblers.games/IN: enter
Issue
As mentioned above the best-gamblers.games.zone
never gets created in the /var/cache/bind
folder.
The cancel_refresh
could sound like a good reason for doing so, except that if this is the case, then it should be an error, not a debug message. Also the messages are the same for both zones.
However, it looks like zone m2osw.com was not updated on the secondary DNS since it says it loads version 241 and the primary is at version 248. To prove that, I updated the main zone to include volcan.m2osw.com
and sure enough, after the necessary amount of time, I can find the new name on NS1 and NS2 says it doesn't know about it. So NS2 does not update anything. I, of course, restarted my bind9 service many times. That does not help at all.
So I think that the main issue is that my primary DNS does not set the "aa" flag when I query it from my secondary DNS. I tried from another server to which I have access and it does show me the "aa" flag on both, primary & secondary. So I think I'm good on that one.
What else could prevent the second from accepting those entries? Is the "aa" the issue? If so, how do I fix it on the second secondary server?
Example of a dig
from the secondary DNS server to the primary, no "aa" flag:
$ dig @ns1.m2osw.com m2osw.com
; <<>> DiG 9.11.3-1ubuntu1.18-Ubuntu <<>> @ns1.m2osw.com m2osw.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50582
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;m2osw.com. IN A
;; ANSWER SECTION:
m2osw.com. 3600 IN A 165.232.146.181
;; Query time: 24 msec
;; SERVER: 165.232.146.181#53(165.232.146.181)
;; WHEN: Wed Mar 22 17:30:34 PDT 2023
;; MSG SIZE rcvd: 54
Here is the same dig
command from my 3rd server (without BIND installed), the "aa" flag is present:
$ dig @ns1.m2osw.com m2osw.com
; <<>> DiG 9.11.3-1ubuntu1.18-Ubuntu <<>> @ns1.m2osw.com m2osw.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 40081
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 913b39f9d8226a8901000000641e51000a3e9c772af12db9 (good)
;; QUESTION SECTION:
;m2osw.com. IN A
;; ANSWER SECTION:
m2osw.com. 3600 IN A 165.232.146.181
;; Query time: 1 msec
;; SERVER: 165.232.146.181#53(165.232.146.181)
;; WHEN: Sat Mar 25 01:40:16 UTC 2023
;; MSG SIZE rcvd: 82
Note that since I have valid DNS files in the cache of the secondary, I think that it was working just fine before. I don't recall changing anything to the settings so I really don't see why it would all of a sudden stop working...
Update
As it felt like the update did not occur for other domains, I hid the existing cache after stopping bind:
$ sudo systemctl stop bind9
$ sudo mv /var/cache/bind /var/cache/bind-hidden
Then I created the cache folder again:
$ sudo mkdir /var/cache/bind
$ sudo chgrp bind /var/cache/bind
$ sudo 775 /var/cache/bind
I restarted bind:
$ sudo systemctl start bind9
and looked into the folder:
$ ls /var/cache/bind-empty
managed-keys.bind managed-keys.bind.jnl
and as we can see it created two files. So there are no read/write permission issues (and yes, the original folder is also root:bind and 775). I could see the transfer signals in the logs (like above) so I'm sure that new zone files should have appeared, but nothing at all.
This clearly proves that the secondary refuses all files from the primary. Probably because it sees the primary as non-authoritative?
dig AXFR ...
Output
I ran the following commands from the secondary:
$ dig afxr @ns1.m2osw.com best-gamblers.games
; <<>> DiG 9.11.3-1ubuntu1.18-Ubuntu <<>> afxr @ns1.m2osw.com best-gamblers.games
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 60945
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;afxr. IN A
;; Query time: 5 msec
;; SERVER: 165.232.146.181#53(165.232.146.181)
;; WHEN: Mon Mar 27 21:05:04 PDT 2023
;; MSG SIZE rcvd: 33
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 31521
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;best-gamblers.games. IN A
;; ANSWER SECTION:
best-gamblers.games. 3600 IN A 165.232.146.181
;; Query time: 57 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Mon Mar 27 21:05:04 PDT 2023
;; MSG SIZE rcvd: 64
Not sure whether this is success or not, but it looked exactly the same as from the 3rd party server so my take is that it's a failure even though the domain name appears in the ANSWER SECTION.
I repeated that directly on the primary and it shows an AUTHORITY SECTION:
$ dig axfr @ns1.m2osw.com best-gamblers.games
; <<>> DiG 9.18.12-0ubuntu0.22.04.1-Ubuntu <<>> axfr @ns1.m2osw.com best-gamblers.games
; (1 server found)
;; global options: +cmd
; Transfer failed.