I'm trying to find the maximum QPS (Query Per Second) of the DNS Resolver VM.
We have our infrastructure hosted on Azure, having a VM (bind based) acting as a resolver querying Azure native DNS (168.63.129.16) as well as on-prem DNS. I'm not caching any query on the resolver & each A-record has a TTL of 300 seconds.
I'm using dnsperf & resperf to trigger the load (only A-records). Now, that I'm preparing DNS resolvers to withstand DDOS attacks of up to 100K QPS. I'm facing issues like query ratelimiting between my resolver & azure native DNS resolver. As a result of this, when QPS is increased, the resolver returns SERVFAIL responses back to the client. However, we didn't see any SERVFAIL responses between resolver & on-prem-based DNS.
Maximum QPS, I could see while targeting Azure DNS is around 2100. I've searched a lot online if there is any such rate-limiting done by Azure, but couldn't find anything related. Somehow, I hunch resolver VM hit bottleneck as 2K QPS is very low for the scale of Azure infrastructure.
A few things (kernel sysctl changes), I changed at my end which improved a bit but not much.
Bind Config changes ::
recursive-clients from 1000 -> 30000
UDP buffers to a higher value than 26214400 to stop buffer failures::
net.core.rmem_max
net.core.rmem_default
Local port range from 32768 61000 to 1024 61000 to have maximum
ports available for DNS::
net.ipv4.ip_local_port_range
miscellaneous changes::
txqueuelen from 1000 -> 20000
ulimits changed to 100000
net.netfilter.nf_conntrack_max changed to a much higher value
In addition to the above, I did increase the VM size from (1 core, 2 GB RAM) -> (4 core, 8 GB RAM). After increasing, packet errors disappeared (checked netstat -s) but didn't improve SERVFAIL errors.
I did enable tcpdump to check the pattern of SERVFAIL errors. In case of failures, the resolver tries to send the query to Azure DNS 5 times (each after 1 sec), but it hasn't heard anything from Azure DNS & hence sending the SERVFAIL response back to the client. Having loaded the pcap file onto Wireshark, I see Azure DNS sends the response back to resolver but resolver had already sent the SERVFAIL response to the client.
Why is the connection closed before having the response? Current net.netfilter.nf_conntrack_udp_timeout is left untouched to 30 seconds but resolver sends SERVFAIL after 5 seconds to client.
Below are tcpdump logs during ServFail::
reading from file dns4.pcap, link-type EN10MB (Ethernet)
10.0.0.10.57710 > 10.0.0.11.domain: [udp sum ok] 1612+ A? SZxvvdyDYy.ns.westeurope.xx.yy.zz.net. (66)
10.0.0.11.44513 > 168.63.129.16.domain: [bad udp cksum 0xbecd -> 0x8cfd!] 52637+% [1au] A? SZxvvdyDYy.ns.westeurope.xx.yy.zz.net. ar: . OPT UDPsize=4096 DO (77)
10.0.0.11.32378 > 168.63.129.16.domain: [bad udp cksum 0xbecd -> 0x3950!] 20672+% [1au] A? SZxvvdyDYy.ns.westeurope.xx.yy.zz.net. ar: . OPT UDPsize=512 DO (77)
10.0.0.11.59973 > 168.63.129.16.domain: [bad udp cksum 0xbecd -> 0xe2e5!] 15199+% [1au] A? SZxvvdyDYy.ns.westeurope.xx.yy.zz.net. ar: . OPT UDPsize=512 DO (77)
10.0.0.11.29976 > 168.63.129.16.domain: [bad udp cksum 0xbec2 -> 0x051b!] 47104+ A? SZxvvdyDYy.ns.westeurope.xx.yy.zz.net. (66)
10.0.0.11.43442 > 168.63.129.16.domain: [bad udp cksum 0xbec2 -> 0xe791!] 41199+ A? SZxvvdyDYy.ns.westeurope.xx.yy.zz.net. (66)
10.0.0.11.domain > 10.0.0.10.57710: [bad udp cksum 0x2a89 -> 0x5e30!] 1612 ServFail q: A? SZxvvdyDYy.ns.westeurope.xx.yy.zz.net. 0/0/0 (66)
As you can see from the bottom line ServFail is sent after 5 attempts.
If you have come this far, I must thank you for reading this lengthy question. I know this is too much of an ask, but I appreciate it if you have some hints for me as I'm unable to get what's the bottleneck.
Originally posted on superuser here