I'm experiencing random segmentation faults and python errors I can't explain on one of my Ubuntu servers. I have been unable to reproduce these faults on any other server or on my development computer. I'm trying to figure out if this is software related (external library or my own code) or hardware related.
These problems first started happening just under a month ago and I've been trying everything I can think of to solve it.
The problematic server has the following OS/Kernel/Hardware/Configuration:
- OS: Ubuntu 22.04.2 LTS
- Kernel: 5.19.0-46-generic
- Memory: 2x16GB DDR4 @ 3200MHz
- CPU: i9-13900K@ 3GHz
- GPU: RTX 4080 16GB
- Storage: 250GB NVMe Sandisk SN770
- Swap is disabled. Note: Was not initially disabled as is part of the reason behind the faults.
I have been unable to reproduce the error on the healthy servers when running Ubuntu 20 or 22 or using Linux Kernel 5.15 or 5.19.
I'm not able to show exact code examples from the codebase - I know this makes helping significantly more difficult, but any pointers would be greatly appreciated. The codebase has the following external dependencies:
- opencv-python-headless
- numpy
- scipy
- scikit-learn
- torch
- torchvision
- transformers
- fastapi
- websockets
- requests
- google-cloud-storage
- google-cloud-pubsub
- firebase-admin
- Pillow
- loguru
- uvicorn
On the problematic server, I have 14 kubernetes pods - made up of 5 unique pods running services for the remaining 9 'client' pods. The faults can occur in any of the pods, but occur most frequently in the 9 client pods when all the pods are running. When starting the pods, the unique pods sometimes experience segmentation faults but eventually stabilise.
The client pods encounter segmentation faults at no particular interval - they can happen between 15m to 15hrs. The client pods may be idling internally or they could be in the middle of processing.
The unique pods encounter segmentation faults rarely when they have successfully started - between 24hrs to 48hrs.
Systemd, AppArmor and Apport have all experienced segmentation faults as well - though only once or twice. I also had the entire server deadlock the other day and had to run sudo systemctl --force --force reboot to force the server to reboot as sudo reboot now refused to work.
Software-wise, there is one major difference between the problematic server and my healthy servers that I can see. The problematic server has a GPU and uses CUDA and nvidia-container-runtime while the healthy servers have smaller dedicated GPU servers (nvidia Jetsons) connected to them via an SSH tunnel.
My development computer has a GPU, CUDA and nvidia-container-runtime but does not encounter the same faults as the problematic server.
Regarding the python faults - I honestly can't explain them beyond something external altering the memory of existing objects. I've encountered MemoryErrors, SystemErrors, UnboundLocalErrors, TypeErrors, AttributeErrors and LookupErrors.
Some of my python objects were being passed to functions that never should be receiving them and are never given to them in my python code. For example:
- ObjectAstores a variety of subclass instances of- ObjectBs.
- FunctionAtakes an instance of- ObjectBor any of its subclasses as a keyword arguments.- FunctionAhas positional arguments disabled.
- FunctionAreceived an instance of- ObjectAin its call instead of an instance of- ObjectBresulting in an- AttributeErrorwhen- FunctionAattempted to access an instance variable that is present in instances- ObjectBand its subclasses.
- ObjectAis never passed to- FunctionAin the codebase - this is known with certainty.
- FunctionAis called between 10 and 20 times per second.
I've seen the above situation happen once. Other things I've seen:
File "/usr/local/lib/python3.11/urllib/request.py", line 2521, in getproxies_environment
  for name, value in os.environ.items():
File "<frozen _collections_abc>", line 861, in __iter__
File "<frozen os>", line 680, in __getitem__
File "<frozen os>", line 761, in decode
LookupError: unknown error handler name 'ʚ;'
---
File "/my-repo/.my-venv/lib/python3.11/site-packages/src/subpkg1/subpkg1a/module1.py", line 99, in get_name
    for variable_name, variable in self.__dict__.items():
TypeError: 'dict_itemiterator' object is not callable
---
File "/my-repo/.my-venv/lib/python3.11/site-packages/src/subpkg1/subpkg1b/module2.py", line 136, in _validate_object
callback = self.on_validate_object(call_object)
SystemError: error return without exception set
---
File "/my-repo/.my-venv/lib/python3.11/site-packages/src/subpkg2/subpkg2a/module3.py", line 157, in jsonify_self
    return deepcopy(self.__dict__)
  File "/usr/local/lib/python3.11/copy.py", line 128, in deepcopy
    def deepcopy(x, memo=None, _nil=[]):
SystemError: unknown opcode
---
File "/my-repo/.my-venv/lib/python3.11/site-packages/src/subpkg1/subpkg1b/module2.py", line 52, in __call__
    logger.trace(f"Check called on {type(self).__name__}.")
File "/my-repo/.my-venv/lib/python3.11/site-packages/loguru/_logger.py", line 2006, in trace
    __self._log("TRACE", False, __self._options, __message, args, kwargs)
File "/my-repo/.my-venv/lib/python3.11/site-packages/loguru/_logger.py", line 1886, in _log
    raise ValueError("Level '%s' does not exist" % level) from None
ValueError: Level 'TRACE' does not exist
 - NB: This particular logging statement happens hundreds of times per second.
---
SystemError: Objects/dictobject.c:2509: bad argument to internal function
Those are just some I selected - so I've got a real variety of errors I can't explain. Here are some GDB/faulthandler traces:
#0  0x0000000000000000 in ?? ()
#1  0x00007ffff3197193 in PyArray_TransferNDimToStrided ()
   from /home/myusername/folder/my-repo/.my-venv/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so
#2  0x00007ffff31bc56c in npyiter_copy_to_buffers ()
   from /home/myusername/folder/my-repo/.my-venv/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so
#3  0x00007ffff31b760a in npyiter_buffered_iternext ()
   from /home/myusername/folder/my-repo/.my-venv/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so
#4  0x00007ffff321f6b6 in execute_ufunc_loop ()
   from /home/myusername/folder/my-repo/.my-venv/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so
#5  0x00007ffff32275c5 in ufunc_generic_fastcall ()
   from /home/myusername/folder/my-repo/.my-venv/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so
#6  0x00005555556b0914 in ?? ()
#7  0x00005555556b0803 in PyObject_CallFunctionObjArgs ()
#8  0x00007ffff31cbace in array_multiply ()
   from /home/myusername/folder/my-repo/.my-venv/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so
#9  0x00005555556dc1c8 in PyNumber_Multiply ()
#10 0x000055555569a3c4 in _PyEval_EvalFrameDefault ()
#11 0x00005555556b14ec in _PyFunction_Vectorcall ()
#12 0x0000555555699c14 in _PyEval_EvalFrameDefault ()
#13 0x00005555556b14ec in _PyFunction_Vectorcall ()
#14 0x000055555569ad6b in _PyEval_EvalFrameDefault ()
#15 0x00005555556bef11 in ?? ()
#16 0x000055555569ad6b in _PyEval_EvalFrameDefault ()
#17 0x00005555556bef11 in ?? ()
#18 0x000055555569ad6b in _PyEval_EvalFrameDefault ()
#19 0x00005555556b14ec in _PyFunction_Vectorcall ()
#20 0x0000555555699c14 in _PyEval_EvalFrameDefault ()
#21 0x00005555556bef11 in ?? ()
#22 0x000055555569ad6b in _PyEval_EvalFrameDefault ()
#23 0x00005555556b14ec in _PyFunction_Vectorcall ()
--Type <RET> for more, q to quit, c to continue without paging--c
#24 0x0000555555699c14 in _PyEval_EvalFrameDefault ()
#25 0x00005555556b14ec in _PyFunction_Vectorcall ()
#26 0x0000555555699a1d in _PyEval_EvalFrameDefault ()
#27 0x00005555556b14ec in _PyFunction_Vectorcall ()
#28 0x000055555569f75a in _PyEval_EvalFrameDefault ()
#29 0x00005555556b14ec in _PyFunction_Vectorcall ()
#30 0x0000555555699a1d in _PyEval_EvalFrameDefault ()
#31 0x0000555555696176 in ?? ()
#32 0x000055555578bc56 in PyEval_EvalCode ()
#33 0x00005555557b8b18 in ?? ()
#34 0x00005555557b196b in ?? ()
#35 0x00005555557b8865 in ?? ()
#36 0x00005555557b7d48 in _PyRun_SimpleFileObject ()
#37 0x00005555557b7a43 in _PyRun_AnyFileObject ()
#38 0x00005555557a8c3e in Py_RunMain ()
#39 0x000055555577ebcd in Py_BytesMain ()
#40 0x00007ffff7c29d90 in __libc_start_call_main (main=main@entry=0x55555577eb90, argc=argc@entry=16, argv=argv@entry=0x7fffffffddb8) at ../sysdeps/nptl/libc_start_call_main.h:58
#41 0x00007ffff7c29e40 in __libc_start_main_impl (main=0x55555577eb90, argc=16, argv=0x7fffffffddb8, init=<optimised out>, fini=<optimised out>, rtld_fini=<optimised out>, stack_end=0x7fffffffdda8) at ../csu/libc-start.c:392
#42 0x000055555577eac5 in _start ()
---
#0  __pthread_kill_implementation (no_tid=0, signo=11, threadid=140653751431168) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=11, threadid=140653751431168) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140653751431168, signo=signo@entry=11) at ./nptl/pthread_kill.c:89
#3  0x00007fec80c42476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#4  <signal handler called>
#5  0x0000000000000000 in ?? ()
#6  0x00007ffff3197193 in PyArray_TransferNDimToStrided ()
   from /home/myusername/folder/my-repo/.my-venv/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so
#7  0x00007ffff31bc56c in npyiter_copy_to_buffers ()
   from /home/myusername/folder/my-repo/.my-venv/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so
#8  0x00007ffff31b760a in npyiter_buffered_iternext ()
   from /home/myusername/folder/my-repo/.my-venv/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so
#9  0x00007ffff321f6b6 in execute_ufunc_loop ()
   from /home/myusername/folder/my-repo/.my-venv/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so
#10  0x00007ffff32275c5 in ufunc_generic_fastcall ()
   from /home/myusername/folder/my-repo/.my-venv/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so
#11  0x00005555556b0914 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimised out>, args=0x7fffffffc2d0,
    callable=<numpy.ufunc at remote 0x7ffff36b4840>, tstate=0x555555b5d960) at ../Include/cpython/abstract.h:114
#12  object_vacall (tstate=0x555555b5d960, base=<optimised out>, callable=<numpy.ufunc at remote 0x7ffff36b4840>,
    vargs=<optimised out>) at ../Objects/call.c:734
#13  0x00005555556b0803 in PyObject_CallFunctionObjArgs (callable=<optimised out>) at ../Objects/call.c:841
#14  0x00007ffff31cbace in array_multiply ()
   from /home/myusername/folder/my-repo/.my-venv/lib/python3.10/site-packages/numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so
#10 0x00005555556dc1c8 in binary_op1 (op_slot=16, w=<numpy.ndarray at remote 0x7fff287c34b0>,
    v=<numpy.ndarray at remote 0x7fff287c3ed0>) at ../Objects/abstract.c:891
#15 PyNumber_Multiply (v=<numpy.ndarray at remote 0x7fff287c3ed0>, w=<numpy.ndarray at remote 0x7fff287c34b0>)
    at ../Objects/abstract.c:1109
#16 0x000055555569a3c4 in _PyEval_EvalFrameDefault (tstate=<optimised out>, f=<optimised out>, throwflag=<optimised out>)
    at ../Python/ceval.c:2003
#17 0x00005555556b14ec in _PyEval_EvalFrame (throwflag=0,
    f=Frame 0x7ffee801a980, for file /home/myusername/folder/my-repo/src/subpkg3/mask.py, line 122, in mask_array (self=<Mask(_is_frozen=False, mask=<numpy.ndarray at remote 0x7fff287c3810>) at remote 0x7fff402c7580>, array=<numpy.ndarray at remote 0x7fff287c3ed0>, mask=<numpy.ndarray at remote 0x7fff287c3bd0>),
    tstate=0x555555b5d960) at ../Include/internal/pycore_ceval.h:46
--Type <RET> for more, q to quit, c to continue without paging--
---
Current thread 0x00007f08e4408740 (most recent call first):
  File "/my-repo/.my-venv/lib/python3.9/site-packages/src/subpkg3/module7.py", line 122 in mask_array
  File "/my-repo/.my-venv/lib/python3.9/site-packages/src/subpkg4/module6.py", line 674 in get_object
  File "/my-repo/.my-venv/lib/python3.9/site-packages/src/subpkg1/module5.py", line 362 in _run_iteration
  File "/my-repo/.my-venv/lib/python3.9/site-packages/src/subpkg1/module5.py", line 432 in _run
  File "/my-repo/.my-venv/lib/python3.9/site-packages/src/subpkg1/module5.py", line 531 in activate
  File "/my-repo/run_headless.py", line 81 in run_headless_main
  File "/my-repo/run_headless.py", line 179 in <lambda>
  File "/my-repo/run_headless.py", line 129 in main
  File "/my-repo/run_headless.py", line 202 in <module>
Segmentation fault (core dumped)
make: *** [Makefile:125: run] Error 139
What I've tried
- Running directly from source in python versions 3.8, 3.9, 3.10 and 3.11.
- Running in docker.
- Running in rootless docker.
- Running in microk8s.
- Using base docker images python3.11:slim-bookworm,python3.11:slim-bullseye,python3.9:slim-bookwormandpython3.9:slim-bullseye.
- Shutting off parts of my code where GDB previously caught a segmentation fault.
- Producing an application separate from my codebase to see if the faults occurred - one occurred when importing numpy.
- Using root user and regular user inside the containers.
- Rolling back to earlier versions of my codebase.
- Reinstalling the server OS.
- Scaling down the number of client pods to 1 and scaling back up gradually. I encountered the faults with a single client pod.
- Compared installed apt packages between the healthy and problematic servers. The package versions of those they share are all the same. The problematic server has numerous nvidia related packages installed which the healthy servers do not (CUDA,libnvidia,libcurand,nvidia-container,nvidia-driver,nvidia-kernel, etc.).
- Memory tests using memtester and stress - neither picked up any errors.
- Monitoring resources to look for sudden spikes/memory leaks etc. The server sits at 7GB of memory usage consistently. Most CPU cores are sitting at 12% usage, some go up to 40%, 45-60C consistently. 2GB of VRAM on the GPU is used consistently, 48-57C consistently.
Finally (and it may not be relevant), the syslog of the server contains the following entries:
Jul 12 18:09:56 hostname kernel: [  542.966210] systemd-journald[545]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected
---
Jul 13 16:19:47 hostname kernel: [77765.893975] python[3937030]: segfault at 0 ip 0000000000000000 sp 00007ffe617d0f98 error 14 in python3.11[55e903651000+1000]
Jul 13 16:19:47 hostname kernel: [77765.893983] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
 - NB: Whenever a seg fault happens in one of the containers, a syslog statement similar to above always occurs with error 4, 5, 14 or 15.
Hopefully I've given enough detail, though I know this will be difficult without seeing my code and with me removing identifying information from the traces.
Thanks.