BACKGROUND
I have a long running discord bot (3+ years) written in discord.py
which has always run on GCP, zone us-east4-a. The bot runs in k8s
using discord.py 1.7.2 and python 3.9.
PROBLEM
In the past month or two, I have started to see an increasing number of connection interruptions, [Error 104] Connection reset by peer. The resets are not tied directly with the amount of activity on the bot. They happen intermittent throughout the day in production (every few minutes on average).
These resets cause random failures to the discord HTTP API and result in a high level of disconnects on the WebSocket. Many of these shard disconnects are able to RESUME but many (~200 per day) end up resulting in an IDENTIFY call like a new connection and sometimes trigger extended backoff waits and partial outages.
EXAMPLE
Here is an example of a disconnect:
Traceback (most recent call last):
File "/opt/venv/lib/python3.9/site-packages/discord/shard.py", line 187, in reconnect
self.ws = await asyncio.wait_for(coro, timeout=60.0)
File "/usr/local/lib/python3.9/asyncio/tasks.py", line 481, in wait_for
return fut.result()
File "/opt/venv/lib/python3.9/site-packages/discord/gateway.py", line 305, in from_client
gateway = gateway or await client.http.get_gateway()
File "/opt/venv/lib/python3.9/site-packages/discord/http.py", line 967, in get_gateway
data = await self.request(Route('GET', '/gateway'))
File "/opt/venv/lib/python3.9/site-packages/discord/http.py", line 192, in request
async with self.__session.request(method, url, **kwargs) as r:
File "/opt/venv/lib/python3.9/site-packages/aiohttp/client.py", line 1117, in __aenter__
self._resp = await self._coro
File "/opt/venv/lib/python3.9/site-packages/aiohttp/client.py", line 544, in _request
await resp.start(conn)
File "/opt/venv/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 890, in start
message, payload = await self._protocol.read() # type: ignore
File "/opt/venv/lib/python3.9/site-packages/aiohttp/streams.py", line 604, in read
await self._waiter
aiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer
EXPERIMENT TO ISOLATE THE PROBLEM
I performed an experiment to isolate what is causing the issue. I deployed a container with my bot to a VM (not k8s
) and isolated it such that it only communicates with discord (no outside database) and automatically send it commands to simulate user behavior and load (I send about 60 commands per minute in the same server -- well under my production load). I run this for 20 minutes or until I observe if connection resets happen, and I see the following:
- In
us-east4-a
, I am able to reproduce intermittent connection resets.
- In
us-east4-b
, I am able to reproduce intermittent connection resets.
- In
us-east4-c
, I am able to reproduce intermittent connection resets.
- In
us-central1-a
, I am not able to reproduce any connection resets (even after 3 hours -- no shard disconnects at all).
- In
us-east1-b
, I am not able to reproduce any connection resets.
- On my laptop (residential internet on the east coast), I am not able to reproduce any connection resets.
All experiments use the same container, same machine-type and same test procedure.
I repeated the experiment in us-east4-a
with multiple machine types up to 8 vCPU and with both the premium and standard network tiers and I still see resets. I also tried another VM in a different project, but the connection issues always persist in us-east4
.
I have a support case open with GCP as it appears to be a region specific issue.
Are there any additional experiments I could provide to attempt to narrow down the cause of this? Are there any common GCP configuration problems that could result in this problem?
Short of moving to another region, I feel as though I am out of options.