Intermittent 104 Connection Reset By Peer in GCP us-east4

Question

Score:1

Server

Intermittent 104 Connection Reset By Peer in GCP us-east4

Daniel DeSousa

10/15/22, 2:50 PM

BACKGROUND

I have a long running discord bot (3+ years) written in discord.py which has always run on GCP, zone us-east4-a. The bot runs in k8s using discord.py 1.7.2 and python 3.9.

PROBLEM

In the past month or two, I have started to see an increasing number of connection interruptions, [Error 104] Connection reset by peer. The resets are not tied directly with the amount of activity on the bot. They happen intermittent throughout the day in production (every few minutes on average).

These resets cause random failures to the discord HTTP API and result in a high level of disconnects on the WebSocket. Many of these shard disconnects are able to RESUME but many (~200 per day) end up resulting in an IDENTIFY call like a new connection and sometimes trigger extended backoff waits and partial outages.

EXAMPLE

Here is an example of a disconnect:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.9/site-packages/discord/shard.py", line 187, in reconnect
    self.ws = await asyncio.wait_for(coro, timeout=60.0)
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 481, in wait_for
    return fut.result()
  File "/opt/venv/lib/python3.9/site-packages/discord/gateway.py", line 305, in from_client
    gateway = gateway or await client.http.get_gateway()
  File "/opt/venv/lib/python3.9/site-packages/discord/http.py", line 967, in get_gateway
    data = await self.request(Route('GET', '/gateway'))
  File "/opt/venv/lib/python3.9/site-packages/discord/http.py", line 192, in request
    async with self.__session.request(method, url, **kwargs) as r:
  File "/opt/venv/lib/python3.9/site-packages/aiohttp/client.py", line 1117, in __aenter__
    self._resp = await self._coro
  File "/opt/venv/lib/python3.9/site-packages/aiohttp/client.py", line 544, in _request
    await resp.start(conn)
  File "/opt/venv/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 890, in start
    message, payload = await self._protocol.read()  # type: ignore
  File "/opt/venv/lib/python3.9/site-packages/aiohttp/streams.py", line 604, in read
    await self._waiter
aiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer

EXPERIMENT TO ISOLATE THE PROBLEM

I performed an experiment to isolate what is causing the issue. I deployed a container with my bot to a VM (not k8s) and isolated it such that it only communicates with discord (no outside database) and automatically send it commands to simulate user behavior and load (I send about 60 commands per minute in the same server -- well under my production load). I run this for 20 minutes or until I observe if connection resets happen, and I see the following:

In us-east4-a, I am able to reproduce intermittent connection resets.
In us-east4-b, I am able to reproduce intermittent connection resets.
In us-east4-c, I am able to reproduce intermittent connection resets.
In us-central1-a, I am not able to reproduce any connection resets (even after 3 hours -- no shard disconnects at all).
In us-east1-b, I am not able to reproduce any connection resets.
On my laptop (residential internet on the east coast), I am not able to reproduce any connection resets.

All experiments use the same container, same machine-type and same test procedure.

I repeated the experiment in us-east4-a with multiple machine types up to 8 vCPU and with both the premium and standard network tiers and I still see resets. I also tried another VM in a different project, but the connection issues always persist in us-east4.

I have a support case open with GCP as it appears to be a region specific issue.

Are there any additional experiments I could provide to attempt to narrow down the cause of this? Are there any common GCP configuration problems that could result in this problem?

Short of moving to another region, I feel as though I am out of options.

893

1 + 0

python

google-compute-engine

google-cloud-platform

Intermittent 104 Connection Reset By Peer in GCP us-east4

Post an answer