I've had 4x crashes now of AWS ERP servers due to memory apparently maxing out and the system essentially dying with 100% CPU and no [little] available RAM.
Ubuntu 18.04.5 LTS (GNU/Linux 5.4.0-1060-aws x86_64) (AWS AMI)
Three times this occurred in the middle of a GitHub action. The action was doing a database import, and then a slack notification. You would thus think it was one of these steps that caused the issue, but oddly the steps all completed normally. The database was fine and the slack notification was pushed.
GitHub itself lost connection with the runner, and virtual memory went through the roof even after the action was completed.
A fourth time this happened while NOTHING was running. The server was in fact idling with nothing going on. I don't have any logs or "top" screenshots of THAT, however, but I did catch it in the act one time:
So the system is an AWS VM with 4G of RAM. Note that I believe the SI that setup this system configured for no swap space. This is arguably correct [very arguably] for a server, in the sense that if there's a memory leak you want the system to report out of memory and take corrective action, as with a memory leak you're going to eventually die anyway.
In the short term, I was asked to just double the RAM. This is somewhat unnecessary as it's a very lightly loaded system (normally runs with only about 2G of RAM in use when doing a heavy batch job), and frankly if the GitHub Runner.Worker maxes out at 7GB of RAM on a 4GB system, why wouldn't it max out at 16GB of RAM on an 8GB VM, but we'll see if it crashes again. I'm not averse to changing TFG's swap configuration, but I'm not sure it's a fix
I have reported this to GitHub, but after >3weeks of inaction thought I'd check here and see if anyone has any ideas or fixes.
Thank you,
== John ==