Score:0

Intermittent 500 Error caused by psycopg2.OperationalError: could not translate host name

de flag
Zev

20% of requests to our backend Django application (deployed on AWS using ECS and Postgres RDS) are throwing 500 errors. Looking at the ECS logs, various related errors are shown:

psycopg2.OperationalError: could not translate host name "abc.efg.us-east-1.rds.amazonaws.com" to address
OSError: [Errno 16] Device or resource busy
<built-in function getaddrinfo>) failed with OSError

We use gunicorn and gevent to serve our app:

gunicorn -t 1000 -k gevent -w 4 -b 0.0.0.0:8000 backend.wsgi

Patrick Mevzek avatar
cn flag
You are not saying exactly which nameservers you are using to resolve names. In many cases things improve a lot if you install on the same box a local caching resolver, as simple as `unbound`, to have more stability and performance in resolving DNS queries, especially if they circle around a lot of time the same names...
Zev avatar
de flag
Zev
We use Route53 to route traffic to a CloudFront distribution so it is awsdns. It should almost be the same ones so a caching resolver makes sense.
Patrick Mevzek avatar
cn flag
I am specifically talking about a **recursive** nameserver installed as close as possible (ideally same box) as applications doing DNS calls. From experience, this improves things. Where and what the authoritative nameservers are is irrelevant (until you can prove that the problem is really between recursive and authoritative and not between application and recursive)
Zev avatar
de flag
Zev
I guess the answer to your original question would be AmazonProvidedDNS. Thanks for the suggestion. I'll have to dig more into this area and understand it more before modifying anything but I like the sound of the solution you suggested.
Score:0
de flag
Zev

getaddrinfo is a gevent function detailed here: https://www.gevent.org/dns.html

Those documentations mention that gevent offers 4 resolvers. The default resolver "Native thread-based hostname resolve" mentions that "there have been some reports of long delays, slow performance or even hangs, particularly in long-lived programs that make many, many DNS requests." And recommends switching resolvers if that happens to you.

We changed how we served our application to the ares resolver and we have not been able to reproduce the issue since:

GEVENT_RESOLVER=ares gunicorn -t 1000 -k gevent -w 4 -b 0.0.0.0:8000 backend.wsgi

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.