I'm having issue with an NVIDIA RTX A6000 whenever the machine starts up or whenever there is load on the GPU.
dmesg reports AER: buffer overflow in recovery for
for three separate PCI addresses:
41:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)
41:00.0 VGA compatible controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)
40:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
AER also reports these issues as being corrected. But they also point to the snd_hda_intel 0000:41:00.1
being affected by the issue.
[ 5.301395] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301397] nvidia 0000:41:00.0: [ 0] RxErr (First)
[ 5.301399] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301401] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301402] snd_hda_intel 0000:41:00.1: [ 0] RxErr (First)
[ 5.301403] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301405] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301405] nvidia 0000:41:00.0: [ 0] RxErr (First)
[ 5.301406] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301407] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301408] snd_hda_intel 0000:41:00.1: [ 0] RxErr (First)
[ 5.301409] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301410] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301411] nvidia 0000:41:00.0: [ 0] RxErr (First)
[ 5.301411] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301413] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301414] nvidia 0000:41:00.0: [ 0] RxErr (First)
[ 5.301414] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301416] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301416] snd_hda_intel 0000:41:00.1: [ 0] RxErr (First)
[ 5.301417] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301418] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301419] nvidia 0000:41:00.0: [ 0] RxErr (First)
[ 5.301420] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301421] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301422] snd_hda_intel 0000:41:00.1: [ 0] RxErr (First)
[ 5.301422] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301424] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301424] nvidia 0000:41:00.0: [ 0] RxErr (First)
[ 5.301425] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301426] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301427] snd_hda_intel 0000:41:00.1: [ 0] RxErr (First)
[ 5.301428] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301429] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301430] nvidia 0000:41:00.0: [ 0] RxErr (First)
[ 5.301430] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301432] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301432] snd_hda_intel 0000:41:00.1: [ 0] RxErr (First)
[ 5.301433] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301435] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301435] nvidia 0000:41:00.0: [ 0] RxErr (First)
[ 5.301436] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301437] pcieport 0000:40:01.1: AER: aer_status: 0x00001000, aer_mask: 0x00000000
[ 5.301438] pcieport 0000:40:01.1: [12] Timeout
[ 5.301439] pcieport 0000:40:01.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID
[ 5.301440] pcieport 0000:40:01.1: AER: aer_status: 0x00001000, aer_mask: 0x00000000
[ 5.301441] pcieport 0000:40:01.1: [12] Timeout
[ 5.301442] pcieport 0000:40:01.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID
PCI Address Corrected
Corrected messages are present for all 3 PCI addresses listed previously, an example of the correction:
[ 10.419954] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 512
[ 10.419957] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 10.419958] {3}[Hardware Error]: event severity: corrected
[ 10.419959] {3}[Hardware Error]: Error 0, type: corrected
[ 10.419960] {3}[Hardware Error]: section_type: PCIe error
[ 10.419960] {3}[Hardware Error]: port_type: 4, root port
[ 10.419961] {3}[Hardware Error]: version: 0.2
[ 10.419961] {3}[Hardware Error]: command: 0x0407, status: 0x0010
[ 10.419962] {3}[Hardware Error]: device_id: 0000:40:01.1
[ 10.419963] {3}[Hardware Error]: slot: 0
[ 10.419964] {3}[Hardware Error]: secondary_bus: 0x41
[ 10.419964] {3}[Hardware Error]: vendor_id: 0x1022, device_id: 0x1483
[ 10.419965] {3}[Hardware Error]: class_code: 060400
[ 10.419966] {3}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0012
Tests & Attempts to Resolve the Messages
Working with the supplier, I have tried a fair amount of things to rule out the problem.
- Removing the GPU stops the issue completely.
- Updated firmware on the two NVME/SSD Western Digital SN850X in the machine.
- Installed system to a different SSD model following speculation it was the WD SN850X was the problem.
- PNY have confirmed that there is no BIOS update available for the A6000 GPU* BIOS has been updated.
- Windows running doesn't seem to pick up any specific issues.
- Kernel 6.2 has been tested to ensure all the components on were catered for.
- ASPM has been turned off in grub boot menu in case the power switching on the PCI lane was causing issues. No ASPM control in the BIOS for GPU, only storage.
A Student had ran a few computational jobs on this machine and didn't report any specific issues while using the GPU. Also FurMark in windows and GPUburn in Ubuntu appear to run without issue which seems to indicate the problem is being corrected.
I'm still keen to better understand what is going wrong just to best ensure that this AER message isn't going to affect future work on the machine as it's going to be used for computation. It's still hard to tell whether this is a software issue from the OS or a hardware issue with the card.
Thanks in advance!