I have about 300 server nodes using PXE boot process and with dhcp ip. Those 300 nodes communicate with a "central" server that is served as the PXE server and dhcp server. However those nodes sometimes may lose network connection (when a node loses network connection in my case, the mac address of the nic is still visible using ifconfig, but often the ip address just does not show up and ping the central server fails). Losing network connection may occur at an seemingly arbitrary time point (maybe at the boot time and cause the server unable to start PXE boot immediately, or perhaps occur when trying to scp certain files, or executing some scripts that query the database); sometimes the connection might (but perhaps not always) resume itself after some time. Moreover, in some but not all those instances, the BMC ip of those nodes may also become unreachable at the same time as the network ip is unreachable.
Some other factors include
- The 300 nodes and the dhcp/pxe server are using Redhat 8.3
- The 300 nodes almost all have BMC MAC 1 (dedicated) disabled and BMC MAC 8 (shared) is enabled, but a handful (less than 10) servers with a different model have MAC 1 enabled
- The dhcp server lease time is about 3 days, and the pool of possible ip should be far more than 300
- I have another set of nodes with the same model in the past (but with fewer number or at a different facility) and do not have such a problem
- The cables, switches, dhcp server remain the same as in the past few years; it is just the client nodes in the network becomes different (not sure if something gets changed on the dhcp server that could make such an impact)
Any suggestion on how to troubleshoot or analyze the issue? If it might be related to some settings on the dhcp/pxe server, what settings could those be and how to check and modify them?