Score:0

PCI Errors Reported with RTX A6000

in flag

I'm having issue with an NVIDIA RTX A6000 whenever the machine starts up or whenever there is load on the GPU.

dmesg reports AER: buffer overflow in recovery for for three separate PCI addresses:

41:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)
41:00.0 VGA compatible controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)
40:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge

AER also reports these issues as being corrected. But they also point to the snd_hda_intel 0000:41:00.1 being affected by the issue.

[    5.301395] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301397] nvidia 0000:41:00.0:    [ 0] RxErr                  (First)
[    5.301399] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301401] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301402] snd_hda_intel 0000:41:00.1:    [ 0] RxErr                  (First)
[    5.301403] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301405] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301405] nvidia 0000:41:00.0:    [ 0] RxErr                  (First)
[    5.301406] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301407] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301408] snd_hda_intel 0000:41:00.1:    [ 0] RxErr                  (First)
[    5.301409] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301410] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301411] nvidia 0000:41:00.0:    [ 0] RxErr                  (First)
[    5.301411] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301413] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301414] nvidia 0000:41:00.0:    [ 0] RxErr                  (First)
[    5.301414] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301416] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301416] snd_hda_intel 0000:41:00.1:    [ 0] RxErr                  (First)
[    5.301417] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301418] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301419] nvidia 0000:41:00.0:    [ 0] RxErr                  (First)
[    5.301420] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301421] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301422] snd_hda_intel 0000:41:00.1:    [ 0] RxErr                  (First)
[    5.301422] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301424] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301424] nvidia 0000:41:00.0:    [ 0] RxErr                  (First)
[    5.301425] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301426] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301427] snd_hda_intel 0000:41:00.1:    [ 0] RxErr                  (First)
[    5.301428] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301429] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301430] nvidia 0000:41:00.0:    [ 0] RxErr                  (First)
[    5.301430] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301432] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301432] snd_hda_intel 0000:41:00.1:    [ 0] RxErr                  (First)
[    5.301433] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301435] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301435] nvidia 0000:41:00.0:    [ 0] RxErr                  (First)
[    5.301436] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301437] pcieport 0000:40:01.1: AER: aer_status: 0x00001000, aer_mask: 0x00000000
[    5.301438] pcieport 0000:40:01.1:    [12] Timeout               
[    5.301439] pcieport 0000:40:01.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID
[    5.301440] pcieport 0000:40:01.1: AER: aer_status: 0x00001000, aer_mask: 0x00000000
[    5.301441] pcieport 0000:40:01.1:    [12] Timeout               
[    5.301442] pcieport 0000:40:01.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID

PCI Address Corrected

Corrected messages are present for all 3 PCI addresses listed previously, an example of the correction:

[   10.419954] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 512
[   10.419957] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
[   10.419958] {3}[Hardware Error]: event severity: corrected
[   10.419959] {3}[Hardware Error]:  Error 0, type: corrected
[   10.419960] {3}[Hardware Error]:   section_type: PCIe error
[   10.419960] {3}[Hardware Error]:   port_type: 4, root port
[   10.419961] {3}[Hardware Error]:   version: 0.2
[   10.419961] {3}[Hardware Error]:   command: 0x0407, status: 0x0010
[   10.419962] {3}[Hardware Error]:   device_id: 0000:40:01.1
[   10.419963] {3}[Hardware Error]:   slot: 0
[   10.419964] {3}[Hardware Error]:   secondary_bus: 0x41
[   10.419964] {3}[Hardware Error]:   vendor_id: 0x1022, device_id: 0x1483
[   10.419965] {3}[Hardware Error]:   class_code: 060400
[   10.419966] {3}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0012

Tests & Attempts to Resolve the Messages

Working with the supplier, I have tried a fair amount of things to rule out the problem.

  • Removing the GPU stops the issue completely.
  • Updated firmware on the two NVME/SSD Western Digital SN850X in the machine.
  • Installed system to a different SSD model following speculation it was the WD SN850X was the problem.
  • PNY have confirmed that there is no BIOS update available for the A6000 GPU* BIOS has been updated.
  • Windows running doesn't seem to pick up any specific issues.
  • Kernel 6.2 has been tested to ensure all the components on were catered for.
  • ASPM has been turned off in grub boot menu in case the power switching on the PCI lane was causing issues. No ASPM control in the BIOS for GPU, only storage.

A Student had ran a few computational jobs on this machine and didn't report any specific issues while using the GPU. Also FurMark in windows and GPUburn in Ubuntu appear to run without issue which seems to indicate the problem is being corrected.

I'm still keen to better understand what is going wrong just to best ensure that this AER message isn't going to affect future work on the machine as it's going to be used for computation. It's still hard to tell whether this is a software issue from the OS or a hardware issue with the card.

Thanks in advance!

Score:0
in flag

Not that this is a fix, but I managed to move the GPU to a different PCI socket and it stopped the error coming through. GPUburn test seemed to run with no issues.

Reporting the problem with the motherboard manufacturer given it seems to be some obscure PCI address issue (ASUS TeK / Pro WS WRX80E-SAGE SE WIFI)

I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.