Score:0

Servers network connection lost in an arbitrary fashion

je flag
ywl

I have about 300 server nodes using PXE boot process and with dhcp ip. Those 300 nodes communicate with a "central" server that is served as the PXE server and dhcp server. However those nodes sometimes may lose network connection (when a node loses network connection in my case, the mac address of the nic is still visible using ifconfig, but often the ip address just does not show up and ping the central server fails). Losing network connection may occur at an seemingly arbitrary time point (maybe at the boot time and cause the server unable to start PXE boot immediately, or perhaps occur when trying to scp certain files, or executing some scripts that query the database); sometimes the connection might (but perhaps not always) resume itself after some time. Moreover, in some but not all those instances, the BMC ip of those nodes may also become unreachable at the same time as the network ip is unreachable.

Some other factors include

  1. The 300 nodes and the dhcp/pxe server are using Redhat 8.3
  2. The 300 nodes almost all have BMC MAC 1 (dedicated) disabled and BMC MAC 8 (shared) is enabled, but a handful (less than 10) servers with a different model have MAC 1 enabled
  3. The dhcp server lease time is about 3 days, and the pool of possible ip should be far more than 300
  4. I have another set of nodes with the same model in the past (but with fewer number or at a different facility) and do not have such a problem
  5. The cables, switches, dhcp server remain the same as in the past few years; it is just the client nodes in the network becomes different (not sure if something gets changed on the dhcp server that could make such an impact)

Any suggestion on how to troubleshoot or analyze the issue? If it might be related to some settings on the dhcp/pxe server, what settings could those be and how to check and modify them?

Nikita Kipriyanov avatar
za flag
BMCs get addresses from the same pool? Anything in dhcp server logs, especially when some node loses/resumes connection? Any attempts to invoke/cause the problem to appear and attempts to capture the traffic?
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.