I have an HPC with 17 nodes running CentOS 7 and a dedicated Mellanox SX6036 Infiniband switch, each node has an Infiniband FDR interface.
Recently one node started giving errors and a quick look showed that the ib0 IPoIB interface was down.
4: ib0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 4092 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 256
link/infiniband 80:00:02:08:fe:80:00:00:00:00:00:00:f4:52:14:03:00:f6:7c:41 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
So I checked the ibstat output.
[root@node12 ~]# ibstat
CA 'mlx4_0'
CA type: MT4099
Number of ports: 1
Firmware version: 2.36.5000
Hardware version: 1
Node GUID: 0xf452140300f67c40
System image GUID: 0xf452140300f67c43
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 22
LMC: 0
SM lid: 23
Capability mask: 0x02594868
Port GUID: 0xf452140300f67c41
Link layer: InfiniBand
Seeing two conflicting things I started checking what I could. Starting with the simple stuff I checked the link lights, all were on, then tried a reboot, a new cable, and a different(known-working) card, none made any difference though I didn't expect them to. I also checked the switch and verified that the interface logical and physical states both showed as good. Lastly I checked the configurations on other nodes to verify they matched the configs on the broken node. As all nodes boot off of the same network image, I'm using Bright CM for that, I would expect no differences and I found none.
So that brings me here, again. I'm not an Infiniband expert so if anyone has any ideas I will gladly hear them.