Score:1

Precipitous page allocation failures leading to system crash

it flag

We run a platform that uses Linux bridging to filter traffic and also logs that activity to a MySQL server. Occasionally we have an issue where the unit will experience very high latency, and leading up to that we often see a repeating page allocation failure in the mpt3sas driver, and logged to /var/log/messages. These seem to occur under times of high system load, but also on a system with seemingly sufficient memory. I do not have the expertise to read these logs properly and was hoping someone may have some insight.

I have tried tuning the vm.min_free_kbytes = 65536 (and we are using vm.reclaim_mode = 1) but that doesn't seem to alleviate the problem. Does anyone have any ideas? (Logs follow:)

localhost kernel: [21572436.601597] sas3ircu: page allocation failure: order:4, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
localhost kernel: [21572436.601601] CPU: 2 PID: 22663 Comm: sas3ircu Tainted: G        W  O      #1
localhost kernel: [21572436.601602] Hardware name: XXXXXXXXXXX , BIOS 3.1 06/06/2018
localhost kernel: [21572436.601602] Call Trace:
localhost kernel: [21572436.601609]  dump_stack+0x7c/0x9c
localhost kernel: [21572436.601612]  warn_alloc.cold+0x7b/0xdf
localhost kernel: [21572436.601615]  ? _cond_resched+0x15/0x30
localhost kernel: [21572436.601617]  ? __alloc_pages_direct_compact+0x141/0x150
localhost kernel: [21572436.601618]  __alloc_pages_slowpath+0xd88/0xdc0
localhost kernel: [21572436.601622]  ? node_reclaim+0x2b1/0x310
localhost kernel: [21572436.601624]  ? get_page_from_freelist+0xaf/0x3a0
localhost kernel: [21572436.601625]  __alloc_pages_nodemask+0x2bf/0x310
localhost kernel: [21572436.601628]  __dma_direct_alloc_pages+0x137/0x220
localhost kernel: [21572436.601630]  dma_direct_alloc_pages+0x1c/0x80
localhost kernel: [21572436.601639]  _ctl_do_mpt_command+0x724/0xc40 [mpt3sas]
localhost kernel: [21572436.601642]  ? ima_file_check+0x59/0x80
localhost kernel: [21572436.601646]  _ctl_compat_mpt_command+0xd1/0x100 [mpt3sas]
localhost kernel: [21572436.601651]  _ctl_ioctl_main+0x4e0/0xb80 [mpt3sas]
localhost kernel: [21572436.601655]  ? __ia32_compat_sys_ioctl+0x189/0x210
localhost kernel: [21572436.601656]  __ia32_compat_sys_ioctl+0x189/0x210
localhost kernel: [21572436.601659]  do_int80_syscall_32+0x6e/0x1d0
localhost kernel: [21572436.601660]  entry_INT80_compat+0x85/0x90
localhost kernel: [21572436.601669] Mem-Info:
localhost kernel: [21572436.601672] active_anon:9743919 inactive_anon:513867 isolated_anon:0
localhost kernel: [21572436.601672]  active_file:35892 inactive_file:14339 isolated_file:0
localhost kernel: [21572436.601672]  unevictable:0 dirty:398 writeback:1 unstable:0
localhost kernel: [21572436.601672]  slab_reclaimable:51419 slab_unreclaimable:4912133
localhost kernel: [21572436.601672]  mapped:18355 shmem:22661 pagetables:53364 bounce:0
localhost kernel: [21572436.601672]  free:1065699 free_pcp:351 free_cma:0
localhost kernel: [21572436.601675] Node 0 active_anon:38975676kB inactive_anon:2055468kB active_file:143568kB inactive_file:57356kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:73420kB dirty:1592kB writeback:4kB shmem:90644kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
localhost kernel: [21572436.601675] Node 0 DMA free:15884kB min:12kB low:24kB high:36kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15968kB managed:15884kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
localhost kernel: [21572436.601678] lowmem_reserve[]: 0 1784 64117 64117
localhost kernel: [21572436.601679] Node 0 DMA32 free:255804kB min:1892kB low:3788kB high:5684kB active_anon:170384kB inactive_anon:80484kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1965184kB managed:1899648kB mlocked:0kB kernel_stack:0kB pagetables:56kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
localhost kernel: [21572436.601682] lowmem_reserve[]: 0 0 62333 62333
localhost kernel: [21572436.601683] Node 0 Normal free:3991108kB min:63624kB low:127460kB high:191296kB active_anon:38805292kB inactive_anon:1974984kB active_file:143684kB inactive_file:57032kB unevictable:0kB writepending:1596kB present:65011712kB managed:63836092kB mlocked:0kB kernel_stack:5604kB pagetables:213400kB bounce:0kB free_pcp:1404kB local_pcp:232kB free_cma:0kB
localhost kernel: [21572436.601686] lowmem_reserve[]: 0 0 0 0
localhost kernel: [21572436.601687] Node 0 DMA: 1*4kB (U) 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15884kB
localhost kernel: [21572436.601694] Node 0 DMA32: 14687*4kB (UME) 10010*8kB (UME) 7183*16kB (UME) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB (H) 0*4096kB = 255804kB
localhost kernel: [21572436.601697] Node 0 Normal: 297793*4kB (UM) 129409*8kB (UM) 110330*16kB (UME) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3991724kB
localhost kernel: [21572436.601701] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
localhost kernel: [21572436.601702] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
localhost kernel: [21572436.601702] 107240 total pagecache pages
localhost kernel: [21572436.601707] 34281 pages in swap cache
localhost kernel: [21572436.601708] Swap cache stats: add 18740072, delete 18705912, find 159408767/161694352
localhost kernel: [21572436.601708] Free swap  = 4913860kB
localhost kernel: [21572436.601708] Total swap = 33554424kB
localhost kernel: [21572436.601709] 16748216 pages RAM
localhost kernel: [21572436.601709] 0 pages HighMem/MovableOnly
localhost kernel: [21572436.601709] 310310 pages reserved
localhost kernel: [21572436.601710] 0 pages cma reserved
localhost kernel: [21572436.601710] 0 pages hwpoisoned
localhost kernel: [21572436.601711] failure at drivers/scsi/mpt3sas/mpt3sas_ctl.c:763/_ctl_do_mpt_command()!
Wilson Hauck avatar
jp flag
Additional information request, please. Any SSD or NVME devices on MySQL Host server? Post on pastebin.com and share the links. From your SSH login root, Text results of: A) SELECT COUNT(*) FROM information_schema.tables; B) SHOW GLOBAL STATUS; after minimum 24 hours UPTIME C) SHOW GLOBAL VARIABLES; AND very helpful OS information, includes - htop top ulimit -a for a Linux/Unix list of limits, iostat -xm 5 3 for IOPS by device and core/cpu count, for server workload tuning analysis to provide suggestions.
Wilson Hauck avatar
jp flag
Why the confusion about 12 lines from the end of the posted log with the hugepages_size of 1G and 2M? The 2 lines follow here. localhost kernel: [21572436.601701] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB localhost kernel: [21572436.601702] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Just curious and it looks like hugepages never really used this session, should they even be enabled? Some talent suggests for MySQL they should always be disabled.
Wilson Hauck avatar
jp flag
My workload analysis is still available to you - once your data is posted. Thanks
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.