Score:0

Why would a NIC on CentOS 7, attached to a Dell N2048P switch, only get a carrier when plugged in after boot?

cn flag

My situation:

  • I have a new machine, with two NICs (Intel Corporation Intel(R) Ethernet Controller I225-LM (rev 02), according to lspci), but only one is connected.
  • It's connected into a larger network, but directly connects to a Dell N2048P switch.
  • The machine is running the last CentOS 7 (for software compatibility reasons - this may change in the future), minimal installation, entirely vanilla.

My problem:

  • When I boot the machine, the NIC and switch both show activity, there appears to be a carrier, but when CentOS gets around to initialising the hardware or the network, the NIC seems to be disabled. The activity indicators on the switch go dead, then briefly come on again to go dead and never come back.
  • However, if I boot the machine with the same network cable disconnected and connect it once I get to the login prompt (or at least after that initialisation), the card connects properly, and there's no issues.

Some notes:

  • The problem sometimes doesn't occur, about 1 in 4-5 reboots. I've timed them, but there appears to be little pattern to when it works or doesn't, although I've never seen it work twice in a row. There's almost always two or three failures between successes, when cycling through reboots quickly.
  • The NIC always works, without fail and immediately, when plugged in after boot.
  • In the logging on the Switch, I see it cycling from Forwarding to Blocking, from Learning to Forwarding and then from Forwarding to Block (and then 'down' and 'failed') three times in a row, within a second or so.

What I tried already:

  • I suspected it might be some issue with STP, but turning off STP for this port on the switch doesn't fix it (the network does use STP); nor did any combination of LINKDELAY and NETWORKDELAY I attempted to configure. The frequency of failure remained unchanged, or at least not significantly changed.
  • I tried assigning the NIC a static IP with BOOTPROTO set to none. This does assign it the correct IP, but the NIC still shows as having NO-CARRIER after booting with the cable connected and the link is dead (no activity lights), which seems to suggest that it's not something happening at this stage. Without knowing more than the basics, I'd suggest it's a 'layer 2' problem and not a 'layer 3' problem.
  • I tried configuring the NIC with ETHTOOL_OPTS to default to the settings it ends up negotiating (which work and are to my liking), but this appears not to have an effect (ethtool will report that it is still configured for Auto-negation, even though I set it to be off) and the problem remains.

Any suggestions on what the issue might be?

Edit: adding some information, after suggestions in comments:

  • The system is now up to date with the latest LT kernel, the issue remains.
  • I've tried disabling the NetworkManager altogether, and after that, retried the varies settings I'd mentioned previously, realising they may have been ignored due to NetworkManager, but the same result.
  • Interestingly, without NetworkManager, I noticed that having the network card being disabled on boot and then using ifup to bring it online causes it to lose the link when I do that; also, running ifup before connecting the cable and then connecting the cable doesn't get me a connection; nothing works in this setup.
  • Also, in the situation where NetworkManager is enabled, where I can get a working network connection by plugging in after boot, if reboot with the cable connected and the connection goes dead, there is no (known) way to revive the connection, even if I disconnect the cable, bring the adapter down and back up, etc., in any order I could think of.
Michael Hampton avatar
cz flag
Update your system to the latest available kernel and try again. If the problem recurs, collect relevant information from `dmesg`.
Gerard H. Pille avatar
in flag
The two are identical? The other has the same problem?
Grismar avatar
cn flag
@GerardH.Pille both NICs have the same problem, neither has a problem when Windows Server 2019 or Windows 10 Pro is installed, the problem always occurs with CentOS 7.
Grismar avatar
cn flag
@MichaelHampton I've done what you suggested, not a bad tip anyway. This caused me to move the system from the installed SSD to the installed HDD RAID array, because it would not boot off of the SSD, but that's not a problem I want to have to debug first. Regardless, after reinstallation, and full update and upgrade with the latest LT kernel on an up to date CentOS 7, still the same issue.
Gerard H. Pille avatar
in flag
When the problem occurs, does de- and reconnecting solve it?
Grismar avatar
cn flag
No, once the link is dead, I cannot revive it, other than by rebooting the machine and plugging in the cable only after the boot process completes (or, presumably, until it passes some specific point)
Gerard H. Pille avatar
in flag
Check with Intel for the most recent driver for your kernel. Try a current live cd linux.
Michael Hampton avatar
cz flag
The latest LT kernel? Exactly what kernel are you running?
Grismar avatar
cn flag
The most recent kernel available on `kernel-lt` from elrepo; I gave the latest on `kernel-ml` a try as well, but that's causing many other issues for me and does not appear to resolve the problem either. (I don't have the version number on hand as I write this) Checking the most recent driver from Intel themselves for the kernel I'm using and then build a custom kernel, @GerardH.Pille? That's worth a shot at this point, though I have to wonder at the sanity of using this setup at this point.
Gerard H. Pille avatar
in flag
With a driver for your kernel, a build would not be needed. The live cd is to check if the problem has been solved.
Grismar avatar
cn flag
Running the Live CD caused the same situation, so what remains is to find the appropriate drivers for the current kernel.
Grismar avatar
cn flag
In the end, it turns out there is no current driver for Linux for the specific model (I225-LM) from Intel at this time, or at least none I was able to find. Intel lists its complete driver pack, but specifically for this model, only a win64 driver is included (as the only model, all other models have a Linux driver available as well). I've since added two fast USB-based NICs to work around the issue and hope Intel will release a Linux driver for this model later (or include an appropriate download if it's simply missing).
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.