We are facing a weird issue with a server we have on our laboratory. Specifically, the server shows high CPU utilization from low priority processes (blue color in htop
) with 50% of the cores appearing to have 100% utilization as shown in the screenshot below.
htop high utilization
However, in the list of running processes there is no process that consumes this CPU:
$ ps aux --sort pcpu | head -n 20
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 2 0.0 0.0 0 0 ? S 10:42 0:00 [kthreadd]
root 3 0.0 0.0 0 0 ? S 10:42 0:00 [ksoftirqd/0]
root 5 0.0 0.0 0 0 ? S< 10:42 0:00 [kworker/0:0H]
root 6 0.0 0.0 0 0 ? S 10:42 0:00 [kworker/u96:0]
root 8 0.0 0.0 0 0 ? S 10:42 0:00 [rcu_sched]
root 9 0.0 0.0 0 0 ? S 10:42 0:00 [rcu_bh]
root 10 0.0 0.0 0 0 ? S 10:42 0:00 [migration/0]
root 11 0.0 0.0 0 0 ? S 10:42 0:00 [watchdog/0]
root 12 0.0 0.0 0 0 ? S 10:42 0:00 [watchdog/1]
root 13 0.0 0.0 0 0 ? S 10:42 0:00 [migration/1]
root 14 0.0 0.0 0 0 ? S 10:42 0:00 [ksoftirqd/1]
root 16 0.0 0.0 0 0 ? S< 10:42 0:00 [kworker/1:0H]
root 17 0.0 0.0 0 0 ? S 10:42 0:00 [watchdog/2]
root 18 0.0 0.0 0 0 ? S 10:42 0:00 [migration/2]
root 19 0.0 0.0 0 0 ? S 10:42 0:00 [ksoftirqd/2]
root 21 0.0 0.0 0 0 ? S< 10:42 0:00 [kworker/2:0H]
root 22 0.0 0.0 0 0 ? S 10:42 0:00 [watchdog/3]
root 23 0.0 0.0 0 0 ? S 10:42 0:00 [migration/3]
root 24 0.0 0.0 0 0 ? S 10:42 0:00 [ksoftirqd/3]
Cause of issue:
After crawling around a bit, we have found that when disabling the bridge interface we have set up on the server (ifdown br0
), the CPU utilization drops to normal states after 5-10 seconds. If we re-enable the bridge, then the utilization spikes again, similar to picture above.
What we have tried:
We have tried disabling libvirtd
service in case this was an issue with the VMs on the server, but no hope with that. We have also disabled docker
and containerd
, but nothing changed either. We have also removed and re-installed bridge-utils
on the server and also rename the interface to br1, but the issue is still there. Last, we also booted with a different kernel version, but, still nothing.
Has anyone faced any similar issue before?
Server specs:
$ uname -a
Linux cheetara 4.4.0-174-generic #204-Ubuntu SMP Wed Jan 29 06:41:01 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.7 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.7 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
---- Edit
Our server has two network interfaces p4p1
and p4p2
. We have assigned a static IP to each interface through the DHCP server (for convenience let's say they are 137.100.1.11
and 137.100.1.12
). Our /etc/network/interfaces
file looks as follows:
auto lo
iface lo inet loopback
auto p4p1
iface p4p1 inet manual
auto br0
iface br0 inet static
address 137.100.1.11
broadcast 137.100.1.255
netmask 255.255.255.0
gateway 137.100.1.200
dns-nameservers 137.100.1.210 137.100.1.220 8.8.8.8 8.8.4.4
bridge_ports p4p1
auto ib0
iface ib0 inet static
address 10.1.0.2
netmask 255.255.255.0
auto ib1
iface ib1 inet static
address 10.0.0.2
netmask 255.255.255.0
where ib0
and ib1
are infiniband interfaces not related to external networking.
Also the routing is as follows:
$ ip route show
default via 137.100.1.200 dev br0 onlink
10.0.0.0/24 dev ib1 proto kernel scope link src 10.0.0.2 linkdown
10.1.0.0/24 dev ib0 proto kernel scope link src 10.1.0.2 linkdown
147.102.37.0/24 dev br0 proto kernel scope link src 147.102.37.24