Taking into consideration the following details:
- millions of requests daily
- Aurora Cluster on AWS
You may want to take a look at your system to make sure you're not exceeding the DNS quota for your account.
One of the stand-out items from the quota documentation is this:
Each Amazon EC2 instance limits the number of packets that can be sent to the Amazon-provided DNS server to a maximum of 1024 packets per second per network interface. This quota cannot be increased. The number of DNS queries per second supported by the Amazon-provided DNS server varies by the type of query, the size of response, and the protocol in use. For more information and recommendations for a scalable DNS architecture, see the Hybrid Cloud DNS Solutions for Amazon VPC whitepaper.
If you reach the quota, the Amazon Route 53 Resolver rejects traffic [...]
Note: Emphasis mine.
The "maximum of 1024 packets per second" bit is important because the actual number of packets per query can vary and there are typically multiple packets per DNS query.
If your server(s) are receiving millions of requests per day, then there is a high possibility that your server(s) are hitting that packet maximum:
- 1,000,000 requests / 86,400 seconds = 11.574 Requests per second
- 11.574 * 4 Packets¹ = 46-ish packets per second
- 1024 / 46 = 22.26-ish DNS calls per second
I cannot say this is definitely the problem, but this is a good place to start looking, particularly if your web servers have regular rush periods where traffic is not operating at a nice, flat average.
¹ Having been bitten by this problem in the past, I've measured that many DNS requests require an average of 4 packets per call