Score:1

Intermittent 104 Connection Reset By Peer in GCP us-east4

br flag

BACKGROUND

I have a long running discord bot (3+ years) written in discord.py which has always run on GCP, zone us-east4-a. The bot runs in k8s using discord.py 1.7.2 and python 3.9.

PROBLEM

In the past month or two, I have started to see an increasing number of connection interruptions, [Error 104] Connection reset by peer. The resets are not tied directly with the amount of activity on the bot. They happen intermittent throughout the day in production (every few minutes on average).

These resets cause random failures to the discord HTTP API and result in a high level of disconnects on the WebSocket. Many of these shard disconnects are able to RESUME but many (~200 per day) end up resulting in an IDENTIFY call like a new connection and sometimes trigger extended backoff waits and partial outages.

EXAMPLE

Here is an example of a disconnect:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.9/site-packages/discord/shard.py", line 187, in reconnect
    self.ws = await asyncio.wait_for(coro, timeout=60.0)
  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 481, in wait_for
    return fut.result()
  File "/opt/venv/lib/python3.9/site-packages/discord/gateway.py", line 305, in from_client
    gateway = gateway or await client.http.get_gateway()
  File "/opt/venv/lib/python3.9/site-packages/discord/http.py", line 967, in get_gateway
    data = await self.request(Route('GET', '/gateway'))
  File "/opt/venv/lib/python3.9/site-packages/discord/http.py", line 192, in request
    async with self.__session.request(method, url, **kwargs) as r:
  File "/opt/venv/lib/python3.9/site-packages/aiohttp/client.py", line 1117, in __aenter__
    self._resp = await self._coro
  File "/opt/venv/lib/python3.9/site-packages/aiohttp/client.py", line 544, in _request
    await resp.start(conn)
  File "/opt/venv/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 890, in start
    message, payload = await self._protocol.read()  # type: ignore
  File "/opt/venv/lib/python3.9/site-packages/aiohttp/streams.py", line 604, in read
    await self._waiter
aiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer 

EXPERIMENT TO ISOLATE THE PROBLEM

I performed an experiment to isolate what is causing the issue. I deployed a container with my bot to a VM (not k8s) and isolated it such that it only communicates with discord (no outside database) and automatically send it commands to simulate user behavior and load (I send about 60 commands per minute in the same server -- well under my production load). I run this for 20 minutes or until I observe if connection resets happen, and I see the following:

  • In us-east4-a, I am able to reproduce intermittent connection resets.
  • In us-east4-b, I am able to reproduce intermittent connection resets.
  • In us-east4-c, I am able to reproduce intermittent connection resets.
  • In us-central1-a, I am not able to reproduce any connection resets (even after 3 hours -- no shard disconnects at all).
  • In us-east1-b, I am not able to reproduce any connection resets.
  • On my laptop (residential internet on the east coast), I am not able to reproduce any connection resets.

All experiments use the same container, same machine-type and same test procedure.

I repeated the experiment in us-east4-a with multiple machine types up to 8 vCPU and with both the premium and standard network tiers and I still see resets. I also tried another VM in a different project, but the connection issues always persist in us-east4.

I have a support case open with GCP as it appears to be a region specific issue.

Are there any additional experiments I could provide to attempt to narrow down the cause of this? Are there any common GCP configuration problems that could result in this problem?

Short of moving to another region, I feel as though I am out of options.

Priya Gaikwad avatar
us flag
Can you confirm if your issue stands resolved or if you are still facing any issue ?
Score:0
gh flag

As mentioned in the Google groups discussion, “The Google Cloud Compute Engine Team is already investigating this regional issue happening on 'us-east4”. You can expect another update regarding the RCA (if any) in the public issue tracker report. Feel free to comment over there as well.” As mentioned in the update of another support channel, the progress for this issue can be tracked through the public issue tracker.

Score:0
pe flag

I reviewed the Public Issue tracker mentioned on the past answer but, it's already closed as it lacks enough elements to replicate the issue.

In addition as we aren't aware about your VPC configuration or your firewall rules, It sounds a bit tricky to troubleshoot further the issue with the given information.

This might not be the solution to your issue but I would recommend you to open a support ticket with GCP support to get your issue addressed in a better way.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.