I have a fleet of ~70 servers sending logs to Papertrail using Rsyslog.
On September 20th Papertrail encountered an issue and most of our servers logged theses messages:
Sep 20 11:42:30 server-name rsyslogd[7400]: unexpected GnuTLS error -53 - this could be caused by a broken connection. GnuTLS reports: Error in the push function. [v8.32.0 try http://www.rsyslog.com/e/2078 ]
Sep 20 11:42:30 server-name rsyslogd[7400]: omfwd: TCPSendBuf error -2078, destruct TCP Connection to logs.papertrailapp.com:xxxxx [v8.32.0 try http://www.rsyslog.com/e/2078 ]
Sep 20 11:42:30 server-name rsyslogd[7400]: action 'action 7' suspended (module 'builtin:omfwd'), retry 0. There should be messages before this one giving the reason for suspension. [v8.32.0 try http://www.rsyslog.com/e/2007 ]
Sep 20 11:42:43 server-name rsyslogd[7400]: action 'action 7' resumed (module 'builtin:omfwd') [v8.32.0 try http://www.rsyslog.com/e/2359 ]
However, 3 of the servers didn't log the last line with the action 'action 7' resumed (module 'builtin:omfwd')
.
Since then, these servers are sending delayed logs in batches to Papertrail as we can see on the velocity graph.
Two of them are sending batches of ~750 lines and the last one is sending batches of ~1500 lines.
All of our servers have an identic configuration, deployed with Ansible. Most of the rsyslog configuration is the default one except for this part:
$ActionResumeInterval 10
$ActionQueueSize 100000
$ActionQueueDiscardMark 97500
$ActionQueueHighWaterMark 80000
$ActionQueueType LinkedList
$ActionQueueFileName papertrailqueue
$ActionQueueCheckpointInterval 100
$ActionQueueMaxDiskSpace 2g
$ActionResumeRetryCount -1
$ActionQueueSaveOnShutdown on
$ActionQueueTimeoutEnqueue 2
$ActionQueueDiscardSeverity 0
Restarting the rsyslog service fix the issue, but I would like to prevent this from happening, does anyone ever experienced this?
Thanks!