Score:2

Google App Engine periodic failures

in flag

We have recently noticed that our Google App Engine project was experiencing failures periodically every 25 hours and 10 minutes (1510 mins) for three consecutive days for no apparent reason.

During the issue we saw requests failing with the code 499 (Client Closed Request) after very long request duration (10s). The requests normally take a few hundred milliseconds or occasionally 2-3 seconds, but never close to 10 seconds. At the time we didn't see any uptick in traffic and we don't have any background jobs running. CPU and memory all were fine until the issue started, then CPU increased somewhat (e.g. from around 10% to 60%) and even triggered a temporary scale-up from 3 to 5 hosts.

The project is a Python Fast API image deployed to a flex environment, min 3, max 12 hosts at the time.

Example of failures from the logs

The timing of these failures were interesting as they happened almost exactly 25 hours and 10 minutes apart from each other. We have had a few deployments during these days at various times, there is no correlation to server uptime either.

The timestamps below are in UTC:

2021-11-17 17:43
2021-11-18 18:53
2021-11-19 20:03

Has anyone seen anything similar happening on Google App Engine or perhaps with the mentioned Fast API image?

Score:0
us flag

However the 499 HTTP status code indicates that the client closed the request. A possible reason for that is that your client was disconnected at those timeframes that you specify.

I would recommend you to check that your App Engine flexible instances were healthy at those moments by inspecting the Cloud Logging logs and specially the health checks, also you use the App Engine dashboard to see if the instances were throttled with a high CPU or RAM usage. But this issue seems to be in the client side, so if might be worth to check also the status from where you were issuing the request.

Also I share this documentation regarding troubleshoot App Engine Flexible serving errors that I believe it may be useful for you.

robert avatar
in flag
Thank you for your answer. The 499 is not the cause, but the side-effect. The clients disconnected as they the requests timed out, because suddenly responses were not being sent back. Those GET 499s you see are from health checks and were also timing out. We have investigated Cloud Logging and all possible metrics in GCP and there is absolutely no correlation that we could find. The only correlation was to the timing, that it happened exactly 1510 minutes apart 3 days in a row, like on a schedule.
robert avatar
in flag
I also reached a dead end with Google Issue Tracker, they suggested to file a support case, which we might need to do eventually if we can't find anything else.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.