I upgraded from IBM LSF Suite for Enterprise 10.2.0.10 to version 10.2.0.12,and now, on only one of our GPU cluster servers (1 out of 8), I can't get the LIM service to stay running. It keeps crashing with a segmentation fault:
lim[42062]: segfault at 0 ip 00007f63476c07f7 sp 00007f6345218958 error 4 in libc-2.27.so[7f6347607000+1e7000]
The process seg faults generally after a job has been submitted to the server or has finished there. If there is a running job on the server, the LIM and its child processes fail after a minute or so after starting.
Since we are using the IBM "Academic Initiative", in a Bioinformatics university chair, we have no access to support or Fix Packs, other than major releases.
nvidia-smi
shows the following, currently:
+---------------------------------------------+
| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 8000 On | 00000000:1A:00.0 Off | Off |
| 33% 40C P8 25W / 260W | 3968MiB / 48601MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Quadro RTX 8000 On | 00000000:3E:00.0 Off | Off |
| 33% 25C P8 12W / 260W | 1MiB / 48601MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Quadro RTX 8000 On | 00000000:89:00.0 Off | Off |
| 33% 24C P8 21W / 260W | 1MiB / 48601MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Quadro RTX 8000 On | 00000000:B1:00.0 Off | Off |
| 33% 24C P8 15W / 260W | 1MiB / 48601MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
I managed to get a core dump of the segmentation fault and ran it through gdb
. Here is the backtrace some further inspection:
(gdb) bt
#0 __strcat_sse2_unaligned () at ../sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S:298
#1 0x00000000004efa5c in getNvidiaGpu (index=-1408930708, dev=0x7f7dac056810, allDevices=0xbdd9, errorGPU=0x0, errorCount=0, warningGPU=0x7f7dac011730, warningCnt=2) at lim.gpu.c:580
#2 0x00000000004f074b in getGpuReportFullThreadFunc () at lim.gpu.c:858
#3 0x00000000004f11ad in collectGpuInfoThread (arg=0x7f7dac056c6d) at lim.gpu.c:949
#4 0x00007f7db92756db in start_thread (arg=0x7f7db5ec8700) at pthread_create.c:463
#5 0x00007f7db83d771f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Here is the assembly where it fails:
=> 0x00007f7db836f7f7 <+1255>: movdqu (%rsi),%xmm1
And here we see that the memory address of rsi is 0, or NULL pointer
rsi 0x0 0
#0 __strcat_sse2_unaligned () at ../sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S:298
No locals.
#1 0x00000000004efa5c in getNvidiaGpu (index=-1408930708, dev=0x7f7dac056810, allDevices=0xbdd9, errorGPU=0x0, errorCount=0, warningGPU=0x7f7dac011730, warningCnt=2) at lim.gpu.c:580
fname = 0x7d6878 "getNvidiaGpu"
modelname = "QuadroRTX8000", '\000' <repeats 242 times>
device = 0x7f7db79b3e58
memory = {total = 50962169856, free = 42197254144, used = 8764915712}
pState = NVML_PSTATE_2
utilization = {gpu = 100, memory = 49}
computeMode = NVML_COMPUTEMODE_DEFAULT
temperature = 83
vsbecc = 0
vdbecc = 0
power = 249652
i = 0
j = 0
#2 0x00000000004f074b in getGpuReportFullThreadFunc () at lim.gpu.c:858
dev = 0x7f7dac056810
fname = "getGpuReportFullThreadFunc"
dGlobal = 0x7f7dac001c70
errorGPU = 0x0
warningGPU = 0x7f7dac011730
allDevices = 0x7f7dac00a850
ret = 2886036588
ret1 = 2886036588
ver = {major = 2885721120, minor = 32637, patch = 4294967168, build = 0x11 <error: Cannot access memory at address 0x11>}
rsmi_cnt = 0
nvml_cnt = 4
majorTmp = "11\000\000\000\000\000"
compMajorV = <optimized out>
compMinorV = <optimized out>
majorVer = <optimized out>
majorV = 470
minorV = 57
errorCount = 0
warningCnt = 2
i = 0
gpu_lib = -1408931824
nvmlOpened = 1
#3 0x00000000004f11ad in collectGpuInfoThread (arg=0x7f7dac056c6d) at lim.gpu.c:949
fname = "collectGpuInfoThread"
gpuinfo = 0x7f7dac001c70
gpuinfoError = 0
sampleInterval = 5
#4 0x00007f7db92756db in start_thread (arg=0x7f7db5ec8700) at pthread_create.c:463
pd = 0x7f7db5ec8700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140177899816704, -4327163297919163674, 140177899814848, 0, 0, 10252544, 4398249031032873702, 4398224247775797990}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <optimized out>
#5 0x00007f7db83d771f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
With all that being said, we have another server, with the exact same specifications, that does not have this problem. The NVIDIA CUDA and driver versions are also the same, running the same version of Ubuntu, version 18.04.06 LTS.
The LSF installation is using a shared configuration over NFS - meaning each server is accessing the same configuration files and scripts.
The only differences I can see between the other servers and the one with the problem is in the command option used to start LIM:
On all the other servers:
root 53635 1.8 0.0 277728 18844 ? S<sl Feb07 472:40 /opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/lim -d /opt/ibm/lsfsuite/lsf/conf/ego/rost_lsf_cluster_1/kernel
root 53639 0.0 0.0 18652 5976 ? S<s Feb07 0:11 \_ /opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/melim
root 53645 0.0 0.0 4681288 14400 ? S<l Feb07 6:26 | \_ /opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/lsfbeat -c /opt/ibm/lsfsuite/lsf/conf/lsfbeats/lsfbeat.yml
root 53640 0.0 0.0 21268 9136 ? S Feb07 7:56 \_ /opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/pim -d /opt/ibm/lsfsuite/lsf/conf/ego/rost_lsf_cluster_1/kernel
root 53641 0.0 0.0 39576 9604 ? Sl Feb07 0:42 \_ /opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/pem
On the one with the segmentation fault:
root 44902 1.8 0.0 272472 16680 ? D<sl 12:17 0:00 /opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/lim
root 44919 4.4 0.0 18656 6500 ? S<s 12:17 0:00 \_ /opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/melim
root 44924 2.2 0.0 468764 11280 ? S<l 12:17 0:00 | \_ /opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/lsfbeat -c /opt/ibm/lsfsuite/lsf/conf/lsfbeats/lsfbeat.yml
root 44920 5.6 0.0 19276 7364 ? S 12:17 0:00 \_ /opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/pim
root 44921 4.6 0.0 39576 10288 ? Sl 12:17 0:00 \_ /opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc/pem
I tried restarting the services using bctrld
on both the master and server, in addition to using the lsfd.service
unit... even starting the lim
service manually using the -d /opt/ibm/lsfsuite/lsf/conf/ego/rost_lsf_cluster_1/kernel
options. All produce a segmentation fault.
Does anyone have any idea what the problem is, or how to fix it? I'm going crazy here.
Thank you very much for taking the time to read this and offer your feedback!