Score:0

Ethernet connection is dropped when restarting openibd ( Infiniband ) service

in flag

I have a multiple servers with Ethernet controller on board and InfiniBand controller installed in a PCI slot.

The problem is when im restarting openibd.service which should control only the infiniband adapter, for some reason, my ethernet network is restarting as well.

If im stopping openibd, my ethernet stops as well.

Ethernet and InfiniBand should be separate and independent from each other.

I need to be able to stop or restart openibd.service without dropping my ethernet connection

Operating System: AlmaLinux 8.7

Ethernet port in use ( 1gb ): eno2np1

Ofed version: MLNX_OFED_LINUX-5.9-0.5.6.0

When restarting openibd.service im losing the ethernet connection until openibd is running again.
I suspect both cards using the same driver but im not sure how to proceed.

Firmware is updated on all cards.

./mlxfwmanager_LeSI_23B_OFED-23.04-1_build4_fw_update_aug_2023 --query :

Querying Mellanox devices firmware ...

Device #1:
----------

  Device Type:      ConnectX4LX
  Part Number:      Lenovo_Ultron_CX4Lx_2P_25GbE_1G-BaseT_Ax
  Description:      Lenovo Ultron ConnectX-4 Lx LOM 25GbE and 1G-BaseT
  PSID:             LNV0000000028
  PCI Device Name:  0000:65:00.0
  Base MAC:         088fc3a3cb9e
  Versions:         Current        Available
     FW             14.32.1010     14.32.1010
     PXE            3.6.0502       3.6.0502
     UEFI           14.25.0017     14.25.0017

  Status:           Up to date

Device #2:
----------

  Device Type:      ConnectX6
  Part Number:      SC57A40943_Ax
  Description:      ThinkSystem Mellanox ConnectX-6 HDR100/100GbE QSFP56 1-port VPI Adapter
  PSID:             LNV0000000016
  PCI Device Name:  0000:17:00.0
  Base GUID:        946dae030049bd14
  Versions:         Current        Available
     FW             20.37.1014     20.37.1014
     PXE            3.7.0102       3.7.0102
     UEFI           14.30.0013     14.30.0013

  Status:           Up to date

ethtool eno2np1:

Settings for eno2np1:
        Supported ports: [  ]
        Supported link modes:   1000baseKX/Full
        Supported pause frame use: Symmetric
        Supports auto-negotiation: Yes
        Supported FEC modes: None        RS      BASER
        Advertised link modes:  1000baseKX/Full
        Advertised pause frame use: Symmetric
        Advertised auto-negotiation: Yes
        Advertised FEC modes: None       RS      BASER
        Speed: 1000Mb/s
        Duplex: Full
        Auto-negotiation: on
        Port: None
        PHYAD: 0
        Transceiver: internal
        Supports Wake-on: g
        Wake-on: g
        Current message level: 0x00000004 (4)
                               link
        Link detected: yes

eno2np1 ib0:

Settings for ib0:
        Supported ports: [  ]
        Supported link modes:   Not reported
        Supported pause frame use: No
        Supports auto-negotiation: No
        Supported FEC modes: Not reported
        Advertised link modes:  Not reported
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Advertised FEC modes: Not reported
        Speed: 100000Mb/s
        Duplex: Full
        Auto-negotiation: off
        Port: Other
        PHYAD: 0
        Transceiver: internal
        Link detected: yes

lspci -nnn :

17:00.0 Infiniband controller [0207]: Mellanox Technologies MT28908 Family [ConnectX-6] [15b3:101b]
65:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
65:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]

lshw -C network:

  *-network
       description: interface
       product: MT28908 Family [ConnectX-6]
       vendor: Mellanox Technologies
       physical id: 0
       bus info: pci@0000:17:00.0
       logical name: ib0
       version: 00
       serial: 00:00:0a:81:fe:80:00:00:00:00:00:00:94:6d:00:00:00:00:00:00
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress vpd msix pm bus_master cap_list rom physical
       configuration: autonegotiation=off broadcast=yes driver=mlx5_core[ib_ipoib] driverversion=5.9-0.5.5 duplex=full firmware=20.37.1014 (LNV0000000016) ip=192.168.0.3 latency=0 link=yes multicast=yes
       resources: iomemory:21f0-21ef irq:18 memory:21ffc000000-21ffdffffff memory:d4200000-d42fffff
  *-network:0
       description: Ethernet interface
       product: MT27710 Family [ConnectX-4 Lx]
       vendor: Mellanox Technologies
       physical id: 0
       bus info: pci@0000:65:00.0
       logical name: eno1np0
       version: 00
       serial: 08:8f:c3:a3:cb:9e
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress vpd msix pm bus_master cap_list rom ethernet physical autonegotiation
       configuration: autonegotiation=on broadcast=yes driver=mlx5_core driverversion=5.9-0.5.5 firmware=14.32.1010 (LNV0000000028) latency=0 link=no multicast=yes
       resources: iomemory:24f0-24ef irq:18 memory:24ffc000000-24ffdffffff memory:e3500000-e35fffff memory:24ffe800000-24ffeffffff
  *-network:1
       description: Ethernet interface
       product: MT27710 Family [ConnectX-4 Lx]
       vendor: Mellanox Technologies
       physical id: 0.1
       bus info: pci@0000:65:00.1
       logical name: eno2np1
       version: 00
       serial: 08:8f:c3:a3:cb:9f
       size: 1Gbit/s
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress vpd msix pm bus_master cap_list rom ethernet physical autonegotiation
       configuration: autonegotiation=on broadcast=yes driver=mlx5_core driverversion=5.9-0.5.5 duplex=full firmware=14.32.1010 (LNV0000000028) ip=10.0.26.3 latency=0 link=yes multicast=yes speed=1Gbit/s
       resources: iomemory:24f0-24ef irq:19 memory:24ffa000000-24ffbffffff memory:e3400000-e34fffff memory:24ffe000000-24ffe7fffff

/var/log/messages:

systemd[1]: Stopping openibd - configure Mellanox devices...
root[8303]: openibd: running in manual mode
systemd[1]: /usr/lib/systemd/system/ibacm.service:22: Unknown lvalue 'ProtectHostname' in section 'Service'
systemd[1]: /usr/lib/systemd/system/ibacm.service:23: Unknown lvalue 'ProtectKernelLogs' in section 'Service'
NetworkManager[1345]: <info>  [1692350943.3204] device (ib0): state change: activated -> unmanaged (reason 'removed', sys-iface-state: 'removed')
dbus-daemon[1341]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.1' (uid=0 pid=1345 comm="/usr/sbin/NetworkManager --no-daemon ")
systemd[1]: Starting Network Manager Script Dispatcher Service...
dbus-daemon[1341]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
systemd[1]: Started Network Manager Script Dispatcher Service.
systemd[1]: Stopping RDMA Node Description Daemon...
systemd[1]: rdma-ndd.service: Succeeded.
systemd[1]: Stopped RDMA Node Description Daemon.
NetworkManager[1345]: <info>  [1692350945.4769] device (eno2np1): state change: activated -> unmanaged (reason 'removed', sys-iface-state: 'removed')
NetworkManager[1345]: <info>  [1692350945.4912] dhcp4 (eno2np1): canceled DHCP transaction
NetworkManager[1345]: <info>  [1692350945.4913] dhcp4 (eno2np1): activation: beginning transaction (timeout in 45 seconds)
NetworkManager[1345]: <info>  [1692350945.4913] dhcp4 (eno2np1): state changed no lease
NetworkManager[1345]: <info>  [1692350945.4926] manager: NetworkManager state is now DISCONNECTED 

I tried so far

Installing clean operating system
Updating Server's UEFI firmware
Updating Mellanox firmware and ofed\

Score:0
lb flag

As part of openibd.service restart process, the script unloads and reloads the mlx5_core module, which serves as the PCIe device driver for Mellanox / NVIDIA InfiniBand and Ethernet cards including both the cards listed in the question.

Oren avatar
in flag
Ive noticed that. Is it a bug or a feature ? It doen't make any sense to install Mellanox Ethernet and Infiniband on the same host if that's the way it suppose to be. can i exclude mlx5_core from being unloaded when restarting openibd ?
haggai_e avatar
lb flag
The question of excluding mlx5_core is interesting, but I guess it depends on why you need to restart `openibd` to begin with. If you want to unload and reload the module that handles your ConnectX-6 InfiniBand adapter, you would need to restart `mlx5_core`. If your goal is to restart the driver for a given device without reloading the module, perhaps you could use the bind/unbind sysfs interface instead.
Oren avatar
in flag
its not that i need to restart openibd but... Mounting fails on start up because network is dropped when openibd is starting. Network services such as ntp, syslog etc are interupted on startup when openibd is starting. Updating ofed drops the network and require console access.
haggai_e avatar
lb flag
As far as I understand, only stopping/restarting openibd unloads the mlx5_core module.
Oren avatar
in flag
Thats the way it should be and thats why ive opened this issue. if i disable openibd, restart the node, and starting openibd, network will go down for a moment.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.