Our application is run on multiple VMs (isolated instances). It comprises of a few services and ASP.NET website. This issue is happening on almost all of them to some degree.
It was found that the service was crashing often, like every 2-4 hours, sometimes less, sometimes more. Shockingly, the website worker process also crashed at the same time. On some environment there would be additional processes involved in the combined crash(these usually had something to do with windows/defender updates, or splunk event forwarder).
The service process may not always be involved, but typically is. There is no clear trigger that I can identify. In some cases I could see that as a direct result of the service opening a sqlite database I would result in the crash, but not always. Sometimes the crash would occur after some inactivity on the website and then crash when a user interacts with it. The cases where the hints of windows/defender updates, they seem to be the indirect trigger. Sometimes there are just no real indication of anything that was directly or indirectly leading to this crash of multiple processes.
Based on the error codes I'm taking a stab that there is a memory leak somewhere, that is exhausting a kernel resource, but I am baffled why it would involve other processes. Most of these servers have 64GB RAM while only using about 15GB, they are generally near idle - nothing is hogging CPU.
Event Viewer showing multiple APPCRASH in a very short time
Our service crash would typically look like this:
Fault bucket , type 0
Event Name: APPCRASH
Response: Not available
Cab Id: 0
Problem signature:
P1: xxxxxxxxxxxxxxxxxx.exe
P2: 2.126.2266.1245
P3: 6385e363
P4: SQLite.Interop.dll
P5: 1.0.106.0
P6: 59cba20e
P7: c0000005 /* memory access violation */
P8: 000000000011a659
P9:
P10:
The website crash looks like this:
Fault bucket , type 0
Event Name: APPCRASH
Response: Not available
Cab Id: 0
Problem signature:
P1: w3wp.exe
P2: 10.0.14393.0
P3: 57899135
P4: ChakraCore.DLL
P5: 1.11.5.0
P6: 5c374f16
P7: 8007000e /* 0xE - ERROR_OUTOFMEMORY */
P8: 0001663d
P9:
P10:
Here are are some more from other processes:
Event Name: APPCRASH
Problem signature:
P1: splunk-winevtlog.exe
P2: 2304.769.25500.50706
P3: 639cc7e5
P4: ucrtbase.dll
P5: 10.0.14393.3659
P6: 5e9140a1
P7: 000000000006de4e
P8: c0000409
P9: 0000000000000007
Event Name: APPCRASH
P1: splunk-winevtlog.exe
P2: 2304.769.25500.50706
P3: 639cc7e5
P4: KERNELBASE.dll
P5: 10.0.14393.5427
P6: 633689d4
P7: eeab5254
P8: 0000000000026ea8
Event Name: WindowsUpdateFailure3
Problem signature:
P1: 10.0.14393.5127
P2: 8007000e
P3: 00000000-0000-0000-0000-000000000000
P4: Scan
P5: 0
P6: 1
P7: 0
P8: Windows Defender
P9: {3DA21691-E39D-4DA6-8A4B-B43877BCB1B7}
P10: 0
Event Name: WindowsUpdateFailure3
Problem signature:
P1: 10.0.14393.5127
P2: 80070008
P3: 00000000-0000-0000-0000-000000000000
P4: Scan
P5: 0
P6: 1
P7: 8024500b
P8: Windows Defender
P9: {9482F4B4-E343-43B6-B170-9A65BC822C77}
P10: 0
Event Name: WindowsUpdateFailure3
Problem signature:
P1: 10.0.14393.5127
P2: 80070643
P3: CAC11B5A-55A0-4E6E-A5AC-F2DD8411BF7C
P4: Install
P5: 200
P6: 0
P7: 65a
P8: CcmExec
P9: {3DA21691-E39D-4DA6-8A4B-B43877BCB1B7}
P10: 0
Event Name: WindowsWcpOtherFailure3
Problem signature:
P1: 10.0.14393.5351:3
P2: wcp\sil\merged\ntu\ntsystem.cpp
P3: Windows::Rtl::SystemImplementation::DirectFileSystemProvider::SysSetInformationFile
P4: 3903
P5: c0190037
P6: 0xfe1c914f
Event Name: CLR20r3
Problem signature:
P1: SCNotification.exe
P2: 5.0.9068.1000
P3: a6f95f90
P4: System
P5: 4.8.4545.0
P6: 62bd3c75
P7: 204d
P8: 8e
P9: System.Net.Sockets.Socket
Event Name: PowerShell
Problem signature:
P1: powershell.exe
P2: 10.0.14393.5127
P3: System.IO.FileLoadException
P4: System.IO.FileLoadException
P5: .Automation.Internal.TelemetryWrapper.TraceMessage
P6: .Automation.Internal.TelemetryWrapper.TraceMessage
P7: Consol.. main thread
I have tried to configure a postmortem debugger to get a crash dump for these since most of the Windows Error Reporting did not make dumps. But somehow I'm just not getting it right, The crashes happen, but I'm just not getting the dumps i am expecting. Here is the script I used to set up WinDbg Preview:
Set-ItemProperty 'HKLM:\SOFTWARE\Microsoft\Windows NT\CurrentVersion\AeDebug' -Name 'Auto' -Value 1
Set-ItemProperty 'HKLM:\SOFTWARE\Microsoft\Windows NT\CurrentVersion\AeDebug' -Name 'Debugger' -Value `
'"E:\CrashDumps\windbgx64\DbgX.Shell.exe" -p %ld -e %ld -c ".dump -ma -j %p -u E:\CrashDumps\dump64.dmp; qd" /accepteula'
Set-ItemProperty 'HKLM:\SOFTWARE\WOW6432Node\Microsoft\Windows NT\CurrentVersion\AeDebug' -Name 'Auto' -Value 1
Set-ItemProperty 'HKLM:\SOFTWARE\WOW6432Node\Microsoft\Windows NT\CurrentVersion\AeDebug' -Name 'Debugger' -Value `
'"E:\CrashDumps\windbgx86\DbgX.Shell.exe" -p %ld -e %ld -c ".dump -ma -j %p -u E:\CrashDumps\dump32.dmp; qd" /accepteula'
I also tried to configure WER to do full dumps for all processes but that also doesn't do. I'm stumped. How can I proceed further to hunt this issue down? What tools should I be using? what should I be looking at/for.