Every Saturday night around midnight the server in question experiences a sudden loss of available memory. Over the course of about 35-40 minutes the available memory drops until the OS is unable to function and locks up. The OS is unresponsive at that point until the server is rebooted.
There are many alerts in the Windows Event logs during the resource depletion from various processes complaining that they’re out of resources. For example, you can track the progression of SQL Server as more and more of its process memory is paged out. Eventually the log entries stop as the OS locks up, and the next entry is after the server is rebooted.
I checked Task Scheduler and didn’t see anything obvious running at midnight on Saturdays that could be the cause.
Last weekend I ran Windows Performance Monitor on a scheduled task to track and log the following parameters during the crash:
Total available memory
page file bytes (broken down by process)
total page file bytes
private bytes (by process)
private bytes (total)
virtual bytes (by process)
virtual bytes (total)
working set (by process)
working set (total)
working set – private (by process)
working set - private total
Total processor usage
The total available memory column in the log shows a clear drop starting around midnight until around 12:40am when available memory is zero. Similarly, you can see the working set memory for each individual process reduce. However, there doesn’t appear to be an obvious culprit for the memory loss. There’s no record I can see of a specific process increasing in memory usage while everything else drops.
I did force a Non-Maskable Interrupt this time around so I could look at the memory dump, but I could only locate a minidump which had very little useful information. I’m not sure if that was due to the size of the page file at the time, or something caused by the resource depletion, or Windows deleted it automatically. As far as I can tell the settings are default for Windows Server 2019. The page file is now the size of the RAM (16GB) so it’s possible Windows changed it (https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/automatic-memory-dump), and a further NMI will result in a correctly-saved memory dump. I’ll try it again next Monday.
I am unsure how best to proceed from here and would appreciate any ideas. Are there any parameters I missed from my Performance Monitor log that I should have included, for example?
Thanks.