Something happened on one of our server running Ubuntu 20.04 that caused a co-worker who was logged in via ssh at the time to get a "user not in passwd" error when they tried to use the su command. The user they were trying to switch to existed and they have the privilege to switch to that user.
At the same time, any attempt I made to connect to the server via ssh were immediately refused (connection refused.) Because this is very odd, I checked on our website that's hosted on this server and it was returning 500 errors.
I decided to try to connect to the server using my Server Hosts (DigitalOcean) web console, which also refused my connection. As a last attempt I logged into one of our other servers and attempted to ssh into the effected server, thinking that maybe my IP address got blocked by accident, but again the connection was immediately refused.
At this point, having no access to the server, and my co-worker being unable to perform any privileged commands, I decided to attempt rebooting the server.
After the server rebooted, it would no longer respond at all. Our website was no longer returning anything, and the DigitalOcean metrics were gone as well. So I took a snapshot of the system, restored the server to a backed up version, updated all of the login credentials and tried to spin up a new server using the snapshot I took, but using the snapshot failed.
I've been talking with DigitalOceans support team, but they haven't been that helpful. Have any of you had a similar problem occur? Or have any idea what may have caused such issues in a Ubuntu system?
Our system was up to date, the most recent updates were installed two weeks ago, and we didn't have any symptoms of a problem, that I noticed, before this problem actually occurred. All of our server metrics were normal, our disk wasn't anywhere near capacity, we didn't have higher than usual bandwidth, and memory, and cpu use were well below what we have available on our system.
Update:
After continuing to talk with DigitalOcean Support it seems that the snapshot I took can't be used to create a new server because the shadow file is missing. I'm guessing this would explain at least a few of the issues we seen, and would explain why my co-worker was unable to use su.