Score:1

RHEL server + what is the meaning of the kernel messages about Hung TX queue XX

gb flag

We have 524 RHEL machines. in our Hadoop cluster ( all machines are DELL HW ) all machines are RHEL 7.2 version (old kernel version)

uname -r
3.10.0-327.el7.x86_64

last week we seen the following kernel messages on 64 machines.

[Wed Mar 15 00:45:11 2023] i40e 0000:81:00.0 pap3: VSI_seid 388, Hung TX queue 43, tx_pending_hw: 3, NTC:0x90, HWB: 0x99, NTU: 0x9c, TAIL: 0x9c
[Wed Mar 15 00:45:11 2023] i40e 0000:81:00.0 pap3: VSI_seid 388, Issuing force_wb for TX queue 43, Interrupt Reg: 0x0

Above kernel messages bring me to think about kernel upgrade or RHEL upgrade From 7.2 to 7.9

As we understand upgrading to RHEL 7.9 on all machines is huge major task And takes time

but because the messages as described here are not so clearly,

then I will appreciate to get others' opinions.

here more details about the dmesg output

[Thu Mar  9 16:27:14 2023] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME ---DIGITS_0058--- SOCKET 0 APIC 0
[Thu Mar  9 16:27:15 2023] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x284b463 offset:0xa80 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:0)
[Thu Mar  9 16:27:37 2023] mce: [Hardware Error]: Machine check events logged
[Thu Mar  9 16:47:01 2023] {12}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Thu Mar  9 16:47:01 2023] {12}[Hardware Error]: It has been corrected by h/w and requires no further action
[Thu Mar  9 16:47:01 2023] {12}[Hardware Error]: event severity: corrected
[Thu Mar  9 16:47:01 2023] {12}[Hardware Error]:  Error 0, type: corrected
[Thu Mar  9 16:47:01 2023] {12}[Hardware Error]:  fru_text: A1
[Thu Mar  9 16:47:01 2023] {12}[Hardware Error]:   section_type: memory error
[Thu Mar  9 16:47:01 2023] {12}[Hardware Error]:   error_status: 0x---DIGITS_0038---
[Thu Mar  9 16:47:01 2023] {12}[Hardware Error]:   physical_address: 0x---DIGITS_0057---b467b80
[Thu Mar  9 16:47:01 2023] {12}[Hardware Error]:   node: 0 card: 0 module: 0 rank: 0 bank: 3 row: 15545 column: 1000
[Thu Mar  9 16:47:01 2023] {12}[Hardware Error]:   error_type: 2, single-bit ECC
[Thu Mar  9 16:47:01 2023] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Thu Mar  9 16:47:01 2023] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 1: ---DIGITS_0039---f
[Thu Mar  9 16:47:01 2023] EDAC sbridge MC0: TSC b656932cc18c
[Thu Mar  9 16:47:01 2023] EDAC sbridge MC0: ADDR 284b467b80
[Thu Mar  9 16:47:01 2023] EDAC sbridge MC0: MISC 0
[Thu Mar  9 16:47:01 2023] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME ---DIGITS_0059--- SOCKET 0 APIC 0
[Thu Mar  9 16:47:01 2023] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x284b467 offset:0xb80 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:0)
[Thu Mar  9 16:47:09 2023] {13}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Thu Mar  9 16:47:09 2023] {13}[Hardware Error]: It has been corrected by h/w and requires no further action
[Thu Mar  9 16:47:09 2023] {13}[Hardware Error]: event severity: corrected
[Thu Mar  9 16:47:09 2023] {13}[Hardware Error]:  Error 0, type: corrected
[Thu Mar  9 16:47:09 2023] {13}[Hardware Error]:  fru_text: A1
[Thu Mar  9 16:47:09 2023] {13}[Hardware Error]:   section_type: memory error
[Thu Mar  9 16:47:09 2023] {13}[Hardware Error]:   error_status: 0x---DIGITS_0038---
[Thu Mar  9 16:47:09 2023] {13}[Hardware Error]:   physical_address: 0x---DIGITS_0057---b465180
[Thu Mar  9 16:47:09 2023] {13}[Hardware Error]:   node: 0 card: 0 module: 0 rank: 0 bank: 3 row: 15545 column: 832
[Thu Mar  9 16:47:09 2023] {13}[Hardware Error]:   error_type: 2, single-bit ECC
[Thu Mar  9 16:47:09 2023] mce: [Hardware Error]: Machine check events logged
[Thu Mar  9 16:47:09 2023] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Thu Mar  9 16:47:09 2023] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 1: ---DIGITS_0039---f
[Thu Mar  9 16:47:09 2023] EDAC sbridge MC0: TSC b65ab992762c
[Thu Mar  9 16:47:09 2023] EDAC sbridge MC0: ADDR 284b465180
[Thu Mar  9 16:47:09 2023] EDAC sbridge MC0: MISC 0
[Thu Mar  9 16:47:09 2023] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME ---DIGITS_0060--- SOCKET 0 APIC 0
[Thu Mar  9 16:47:10 2023] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x284b465 offset:0x180 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:0)
[Thu Mar  9 16:47:37 2023] mce: [Hardware Error]: Machine check events logged
[Thu Mar  9 16:54:47 2023] perf: interrupt took too long (587547 > 458393), lowering kernel.perf_event_max_sample_rate to 1000
[Thu Mar  9 19:04:47 2023] INFO: NMI handler (ghes_notify_nmi) took too long to run: 761611.066 msecs
[Thu Mar  9 19:08:06 2023] INFO: NMI handler (ghes_notify_nmi) took too long to run: 418088.094 msecs
[Thu Mar  9 19:23:55 2023] INFO: NMI handler (ghes_notify_nmi) took too long to run: 377227.104 msecs
[Thu Mar  9 19:59:52 2023] hrtimer: interrupt took ---DIGITS_0061--- ns
[Thu Mar  9 20:32:02 2023] perf: interrupt took too long (998530 > 734433), lowering kernel.perf_event_max_sample_rate to 1000
[Thu Mar  9 20:35:25 2023] {14}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Thu Mar  9 20:35:25 2023] {14}[Hardware Error]: It has been corrected by h/w and requires no further action
[Thu Mar  9 20:35:25 2023] {14}[Hardware Error]: event severity: corrected
[Thu Mar  9 20:35:25 2023] {14}[Hardware Error]:  Error 0, type: corrected
[Thu Mar  9 20:35:25 2023] {14}[Hardware Error]:  fru_text: A5
[Thu Mar  9 20:35:25 2023] {14}[Hardware Error]:   section_type: memory error
[Thu Mar  9 20:35:25 2023] {14}[Hardware Error]:   error_status: 0x---DIGITS_0038---
[Thu Mar  9 20:35:25 2023] {14}[Hardware Error]:   physical_address: 0x0000001ce008b940
[Thu Mar  9 20:35:25 2023] {14}[Hardware Error]:   node: 0 card: 0 module: 1 rank: 0 bank: 0 row: 58882 column: 224
[Thu Mar  9 20:35:25 2023] {14}[Hardware Error]:   error_type: 2, single-bit ECC
[Thu Mar  9 20:35:25 2023] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Thu Mar  9 20:35:25 2023] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 1: ---DIGITS_0039---f
[Thu Mar  9 20:35:25 2023] EDAC sbridge MC0: TSC d1c21b47e1a2
[Thu Mar  9 20:35:25 2023] EDAC sbridge MC0: ADDR 1ce008b940
[Thu Mar  9 20:35:25 2023] EDAC sbridge MC0: MISC 0
[Thu Mar  9 20:35:25 2023] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME ---DIGITS_0062--- SOCKET 0 APIC 0
[Thu Mar  9 20:35:26 2023] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x1ce008b offset:0x940 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:4)
[Thu Mar  9 20:37:37 2023] mce: [Hardware Error]: Machine check events logged

notes and points

Those i40e TX queue errors maybe comes from Intel i40e driver, and I find several references to this same error. Current version 2.22.18 released Feb 14 2023

maybe we can identified i40e as the source of the messages, and searched the web for i40e "Hung TX queue" and i40 "Issuing force_wb for TX queue". The results I found date from 2015 - 2017, and I got the gist that it's a driver failure, maybe a bug. Then I checked what Intel has to offer, and provided my results to you. You need to validate whether they're applicable to your situation and decide the course of action

i40e is a kernel driver so a kernel update likely updates that driver as well. Intel offers also installation instructions if one wants to have the newest version - that won't be included in any kernels.

references

https://www.intel.com/content/www/us/en/download/18026/intel-network-adapter-driver-for-pcie-40-gigabit-ethernet-network-connections-under-linux.html

https://www.intel.com/content/www/us/en/docs/programmable/683362/1-3-1/installing-the-xl710-driver.html

I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.