Background:
We are running a set of Ubuntu servers that gradually degrade until they reach a point of total unsuitability. What runs on those servers are solely Java Springboot services running on Java8, multiple of them on each server. Those servers run on top of EXSI and have an HaProxy balancer in-front of them to split the load in a roundrobin fashion.
The problem:
Over time the System CPU usage goes up until we max all cores and get load around x10 times then what the server should handle.
Observed behaviour:
The CPU usage is generally linked to one or two of the service PIDs.
Stopping the service leads to a PID from another service becoming the huge cpu user.
Stopping all services on the machine gets us to close to none CPU usage.
Starting the services back up results in CPU usage hitting the roof again.
Time Wait connections are low during the period, usually around 30 - 40.
Open files are low and far from set limits.
Restarting the VM results in the issue being temporary resolved.
Ubuntu version: Ubuntu 20.04.4 LTS
Kernel: 5.4.0-128-generic
EXSI: 7.0.3
I hope this is enough information for people to make suggestions at what to look at and what might be the problem.
Thanks you.