Score:4

Is this server overloaded (htop screenshots)

bd flag

I'm not a server guy, I think it looks overloaded but I'm not sure. Would you say this server is overloaded? enter image description here

jp flag
Yes, it is overloaded, load average is too high for two CPUs
Jack0220 avatar
bd flag
Thanks. You should make that an answer so you can get credit. @AlexD
Criggie avatar
in flag
@Jack0220 Is this a physical machine or a virtual machine? I ask because a 2 core physical machine would likely be getting a bit old now (thus replacement becomes more important), while a virtual can often be up-sized with nothing more than a reboot (and possibly a higher monthly if you're in AWS or similar)
Craig Estey avatar
kr flag
You have a lot of threads/processes. _If_ you can restructure the app/server and each request is "light", you may be able to implement a "thread pool". That is, the overhead of creating/joining a thread is higher than the processing it does. The server defines a pool of N threads (e.g. where N is the number of cores * 2). The server starts the threads. It can queue the requests to a common queue. Each thread grabs a request from the queue, processes it, and then loops/sleeps on the queue, waiting for more work. Otherwise, just "spend the money" ;-)
James avatar
in flag
"Is this *server* overloaded"? Impossible to tell from the provided data. What software is running and is it heavily CPU dependant, etc etc. Do things run slow or are you actually ok at peak? So is the required resources satisfied, albeit with available resources at the max. The latter is "generally" not good as you should have some overhead for when something needs more than planned or usually needs etc. "Is this *CPU* overloaded" no it's at max usage.
Score:12
jp flag

Your server has only two CPUs and LA (load average) in the range 10-15. That means that the running processes demand more CPU time than the CPUs can handle. You can read much more about LA in this article by Brendan Gregg.

Please note that LA is only a single metric and even though your system isn't getting all the CPU time it wants, it is still possible that it gets enough CPU time to serve end-user requests reasonably well. You need to check your other metrics before making any decisions about this server but if your users are already complaining then the solution is clear - get an instance with more CPUs.

Jack0220 avatar
bd flag
I appreciate that. The system keeps peaking. Overall it can handle the load but not in a timely fashion. The server often doesn't respond or responds too late. You confirmed my suspicion.
marcelm avatar
ng flag
_"Your server has only two CPUs and LA (load average) in the range 10-15."_ - And yet, 2 out of 3 screenshots show that CPU usage is at about 60%. I wouldn't be so quick to judge that the server is CPU-bound. It could be I/O-bound. I also see relatively high memory pressure, which might not be helping the I/O situation. And either way, a high load does not mean a system is overloaded per se. A well-utilized non-latency-sensitive server (e.g. mail) can be perfectly fine with high loads. It depends on the situation.
Guntram Blohm avatar
in flag
There's not one single process in D mode though, and (a part of) redis seems to consume 100% CPU (which means it's single threaded or it'd go above 100%). Which might mean everything else is waiting for the (quite overworked) redis, and adding cores won't help much here. I'd check the redis config and log files before just throwing more cores at the problem.
jp flag
@marcelm I agree that there are could be a significant I/O load due to `redis-rdb-bgsave` running but it is hard to tell as there is no iowait stat available and no processes with 'D' state. Please also note that on each screenshot 1 min LA is lower than 15 min LA, so it is a bit too long for 2 GB snapshot. Also, most of the CPU time is spent in `chirpstack-network-server` process.
jp flag
As the system is running on AWS, I would recommend moving `redis` to a managed ElastiCache Redis instance but this will introduce additional network delay that can affect the system performance.
Jack0220 avatar
bd flag
Thank you all for the additional input. This is a LoRa network server and is sensitive to latency. There are downlinks in response to uplinks that need to be delivered very quickly and what I'm seeing is that often they come too late and sometimes they don't come at all. The uplinks are sporadic so it's possible a bunch of them happen at the same time, peaking out the system. @marcelm Guntram Blohm
Score:10
mx flag

Define ‘overloaded’.

If you’re just going by load average, then yes, it’s overloaded (by a factor of about 5-7.5). However, load average is only a reasonable metric to use if your workload is massively parallel and primarily CPU-bound. Load average essentially tracks the average number of processes that could run over the past 1/5/15 minutes.

However, based on two of your screenshots, your instantaneous CPU utilization is not constantly 100% of what the system is capable of. This, combined with a high load average, means lots of processes needing to run, but they run quickly and then are done. That’s reasonably normal for a system providing network services, as most network services are not CPU-bound, but instead IO-bound. This means that load average is not a good metric for determining resource utilization on the system.

What you really should be looking at here (and actually, what you really should be looking at first for any network service) is the performance metrics of the service itself. In most cases, the relevant ones are latency measurements for the various request types the service serves (and, more specifically, you usually want to care about the average latency and one of the 95th or 99th percentile or peak latency). htop quite simply cannot track this for you, you need to look at another tool such as Netdata (disclaimer, I work for Netdata) or Prometheus.

Better than even that though: Are users reporting issues? If the answer is no, there are no reported problems, then it’s probably irrelevant if the server is ‘overloaded’ or not, because everything is working well enough.

jp flag
network bound processes don't affect `LA` so you won't get `LA` > `number of CPUs` on network IO-bound systems. When `LA` > `n CPUs` means that there are a lot of processes waiting for CPU but unable to run, not that "they run quickly and then are done" (in that case you would get LA roughly equal number of CPUs). High LA means that the system **is** CPU-bound or disk IO-bound. "The instantaneous CPU is not constantly 100%" means that the system is past the load peak, you can see it from 1m LA being less than 5 and 10 minutes LA.
Jack0220 avatar
bd flag
Yes there are problems with the end-service, the server is not always responding fast enough. There are uplinks coming in and some of them require downlinks in response. The downlink latency sometimes exceeds 5 seconds which is way too late (this is a LoRa system). I will take a look at netdata, it looks nice. The problem is that the people responsible for this server put every server on the same instance as opposed to spreading it out. It probably worked at first but as the system grows this is not sustainable. Many thanks to everyone for the good ideas!
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.