Score:0

HPC node, Infiniband is DOWN

in flag

I have an HPC with 17 nodes running CentOS 7 and a dedicated Mellanox SX6036 Infiniband switch, each node has an Infiniband FDR interface.

Recently one node started giving errors and a quick look showed that the ib0 IPoIB interface was down.

4: ib0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 256
link/infiniband 80:00:02:08:fe:80:00:00:00:00:00:00:f4:52:14:03:00:f6:7c:41 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff

So I checked the ibstat output.

[root@node12 ~]# ibstat
CA 'mlx4_0'
CA type: MT4099
Number of ports: 1
Firmware version: 2.36.5000
Hardware version: 1
Node GUID: 0xf452140300f67c40
System image GUID: 0xf452140300f67c43
Port 1:
    State: Active
    Physical state: LinkUp
    Rate: 56
    Base lid: 22
    LMC: 0
    SM lid: 23
    Capability mask: 0x02594868
    Port GUID: 0xf452140300f67c41
    Link layer: InfiniBand

Seeing two conflicting things I started checking what I could. Starting with the simple stuff I checked the link lights, all were on, then tried a reboot, a new cable, and a different(known-working) card, none made any difference though I didn't expect them to. I also checked the switch and verified that the interface logical and physical states both showed as good. Lastly I checked the configurations on other nodes to verify they matched the configs on the broken node. As all nodes boot off of the same network image, I'm using Bright CM for that, I would expect no differences and I found none.

So that brings me here, again. I'm not an Infiniband expert so if anyone has any ideas I will gladly hear them.

I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.