Score:2

Intermittent SERVFAIL from different locations - how to diagnose and fix?

jp flag

I run a service at https://asti.ga .

I get occasional reports that people are failing to lookup the domain name (either NXDOMAIN in a browser or SERVFAIL if diging the name via a root DNS server). These reports seem to originate from certain parts of the world, particularly south-east Asia.

I rarely see these issues myself. However I set up a Route53 health check, and I do indeed see these issues from certain places:

enter image description here

In addition, I notice that the result is not consistent. Sometimes it fails in a location, sometimes it works. It can switch between SERVFAIL and a successful lookup on a minute by minute basis.

How do I work out what is going wrong in these locations?

cn flag
Who is hosting your DNS? Is it AWS? There's some more info on the AWS pages if it's them: https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/troubleshooting-domain-unavailable.html
cn flag
Doing `dig @8.8.8.8 +short NS asti.ga` comes back with what looks like german nameservers. This could well be a loading/latency issue with the provider. If you're using AWS have you considered migrating the zone to Route53?
cn flag
Bob
Unless you have a good reason for having your TTL at 300 seconds research shows that increasing TTL values generally reduces latency and will make your DNS more robust. See https://www.sidnlabs.nl/en/news-and-blogs/how-to-choose-dns-ttl-values
jp flag
The DNS is hosted by Hetzner. I'm not using AWS for this service other than Route53 for the health checks.
jp flag
Thanks for that link @Bob . It looks a possibility, could it be that the Hetzner servers are refusing to reply because they are receiving too many requests, as the TTL is too low? But why would that only affect certain locations?
djdomi avatar
za flag
hetzner has a lot of restrictions. 300s ttl should be mostly only be used when you try to move your domain ti a other Nameserver or ip else as already told a higher valuke lika 86400 should be fine
jp flag
@djdomi "hetzner has a lot of restrictions" - can you say what these are? I can't see any obvious stuff in their docs. I'll try with the higher TTL.
jp flag
Things appear to have improved - Route53 checkers are all now showing success, and another separate status page checker is now recording uptime. Would you like to promote your comment to an answer @Bob ?
djdomi avatar
za flag
@dangravell i was using hetzner for my business and left after one year, the support is mostly good but in case of specific circumstances they leave you alone in the dark. the dns servers restrict too many querys and jzst droo them but i am not sure if this is valid after around 5-10 years
Score:1
cn flag
Bob

expanded from my earlier comment:

Thank you for posting your actual domain. That allowed me to check your current settings.

On thing I noticed was that the TTL, the time-to-live of your DNS record was set to 300 seconds.

300 seconds, 5 minutes, is quite a low value, which most people only choose when preparing for a change of IP-address or for example as part of a fail-over strategy.

Unless you have a good reason for having your TTL at 300 seconds research shows that increasing TTL values generally reduces latency and will make your DNS more robust. See for example sidnlabs.nl/en/news-and-blogs/how-to-choose-dns-ttl-values

For general zone owners, we recommend longer TTLs: at least one hour, and ideally four, eight, or 24 hours. Assuming planned maintenance can be scheduled in advance, long TTLs have little cost.

jp flag
Thanks - this seemed to do the trick. It took about an hour or so to work, but then the Route53 health checkers started showing successful checks.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.