I have a situation where a rhel server becomes inaccessible using AD accounts but lets a local account to login. We are using a PAM tool that serves as our AD broker that enables us to login with AD credentials and enforces MFA. The issue is intermittent and just randomly appear every now and then and throws an error message saying "Remote side unexpectedly closed network connection" when we try to login with an AD account. We can login fine using a local account with su privileges to restart the PAM tool and the server becomes accessible again.
- we have already tried reaching out to our vendor to analyze and debug the log but nothing points to anywhere where the PAM tool is having trouble/crashing.
- we also tried comparing sshd config from a working server and everything is the same
- we also tried comparing nsswitch config from a working server and got the same results
- also checked nscd caching and compared to a working server, no difference
- vm utilization is also not the case, server has low cpu/mem usage when issue is happening
- PAM tool agent has been upgraded as well - still did not help
- compare host file with a working server, no difference
- compare resolv config file with a working server, no difference
- compare pam tool config file with a working server, no difference
- compare os/kernel to a working server, no difference
- compared selinux config and logs to a working server, no difference
- no issue with firewall/ports
- analyzed tcpdump file, does not show any connection drop
- nothing points to /var/log/secure where errors are coming
- we were seeing "MsgType: 507" socket errors previously and now it has disappeared but the problem still persist
- there is another agent that app team is using on this server and whenever this "agent" triggers an alarm that it is unavailable, our PAM agent can no longer accept AD creds and just throws the error. Though this is not always the case because sometimes their alert triggers and our PAM tool is just fine. I just noticed that there is some connection between the two giving us an impression that our PAM tool is getting overloaded.
To add: this is the only server in the whole company that has this issue. this is a vm hosted in azure.
Anyone experienced this issue before and had any luck to get this resolved? Is there something you can suggest to get this working? Appreciate any help on this thank you!