Score:1

GCP cloud run job fails without a reason

gb flag

A scheduled GCP cloud run job is failing sometimes (most if the time its running correctly) without a proper cause in the error message. The only message its returning is a very obscure one: "Execution JOB_NAME has failed to complete, 0/1 tasks were a success."

Context: The job runs a docker container, that is deployed in GCR. The container is being built with GCP Code build, trigger by a github push.

How can I debug this, or get to the bottom of this?

Score:2
mx flag

I am experiencing the same problem and I've some more things to add.

My CloudRun Jobs work just fine, most of the times. I see them starting and finishing just as expected.

Sometimes though they fail, and they don't fail for any obvious reason. No exception is raised in my program (I know this as I am sending exceptions to a third party) and I get no info about the reason in the GCP logs either. I can confirm by the state of the touched database records from many failed job runs it often executes parts of my code, but my custom logging just gets swallowed and doesn't show at all.

The only log message is this system GCP message

status: code 13.

message: "Execution xyz has failed to complete, 0/1 tasks were a success."

Looking deeper into that error protoPayload -> response -> status only shows Task xyz-task0 failed with message: Internal error.

Internal error... internal where? in my container, or in google? And most importantly where can I see the actual error!?

Another anomaly I've noticed is my program gets interrupted by receiving a SIGTERM just out of the blue. Sometimes they come even before my program is booted, so I can not even trap them as its too early.

And another thing I noticed is that failed jobs are retried even though I configured the Job to NEVER be retried. maxRetries: 0. In one case the initial job run did not show in the logs at all, but then the retry succeeded and showed up. I only figured that out through the state changes of my database records.

So, what am I complaining about here:

  • If I say don't retry, don't do it
  • don't swallow output from containers
  • don't interrupt jobs without any obvious reason

I've spent many hours adding custom Logging for more transparency, improving my code, clicking through the logs (log explorer is awful for that by the way), correlating things and trying to find info in the documentation. Nothing helped me to understand if I can fo anything to make my jobs run more reliably.

For testing purposes I spun up another task, a dummy task. A very simple rake task that sleeps for 10 seconds and just logs a status every 2 seconds. A few hours later I saw even this task sometimes fails in the exact same way without any trace of information.

I am quite frustrated because I am now questioning if CloudRun Jobs is still the right tool for me. Technically it is a perfect fit for my use case, but not if I can not rely on the environment.

I am happy to be proven wrong, but at this point I see too much "evidence" for the problem being the CloudRun Jobs platform.

I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.