Score:0

Something is closing connections in my CentOS VMs - how to best troubleshoot?

cn flag

I have a setup with 3 VMs (1 application server on CentOS6 and 2 database servers on CentOS7). The last 1-2 weeks we have had issues with timeouts when connecting to the database servers (and between the two servers that are in a cluster).

The database provider (Couchbase) can see from logs that the connections are forced closed:

WARN com.couchbase.endpoint - [com.couchbase.endpoint][UnexpectedEndpointDisconnectedEvent] The remote side disconnected the endpoint unexpectedly

The logs also show that packages are dropped, like:

[warn] Interface ‘ens32’ (removedip) failures: RX:2863 / TX:0 - Details:
- RX packets:308,593,167 errors:0
dropped:2,863 overruns:0 frame:0

The VMs are hosted on the same host which is a VMware ESXi (version 6.5). So they should be able to have good connections to each other.

And what has changed over the last couple of weeks? Security updates on the VM OSes and the database server version (from 6.6.0 to 7.0.0). The database upgrade shouldn't change anything in the network but obviously is the reason why I first contacted the database provider...

Any ideas to find the culprit much appreciated :-)

Edit:

Following Camerons suggestion I just ran a short network trace and loaded it into Wireshark on my local machine. Then I opened the "Expert information" and got this: Wireshark - Expert information I need to say that there is an Nginx proxy server in front of the application server. It handles SSL and "lifts it off" before hitting the app. server. Just looking at the info I would expect the two "red" blocks to be related to requests coming from the outside - and not from the app. server to the database servers.

But I'm not really sure what to look for in the results? - and I guess I need to let it run a little longer - but perhaps without the information from the outside?

Edit 2

While sitting and looking at it the issue actually arose... - so I quickly started the tcpdump again. So the results may not contain the root cause - but should be more relevant than the first: Wireshark - Expert info (2) The blocks I have expanded seem to be related to communication with one of the database servers.... :-)

But what do these results mean and how do I get closer to finding the cause?

Cameron Kerr avatar
id flag
TCP window being full would point to something hanging for some reason and not reading its available input. The 'Couchbase' packets are of particular concern; if you click on them you'll find that a packet is selected in the main Wireshark window. Right click and use the Follow TCP Stream function to see what was actually being said. I suspect you are dealing with a client version incompatibility, or the server is now more sensitive to some type of requests, such as illegal characters in a header name.
John Dalsgaard avatar
cn flag
Thanks @CameronKerr. I cannot tell from that if something is not like it should be... but I have sent the last capture to an engineer at Couchbase. We'll see what it means to him :-)
Score:0
id flag

welcome to Server Fault.

Given the age; CentOS 6 being out of support now, it would be very likely that you are suffering from SSL/TLS incompatibilies; assuming of course that you are connecting over that. We've certainly experienced plenty of such events over our time with RHEL6 as SSL2 etc. got progressively disabled by default. Similarly with various point-versions of Java (some point releases in 1.7 series were particularly fractious)

Another possible reason, seeing as you are running a CentOS workload on ESXi, is that you could be running shy of entropy, which causes blocking behaviour which can lead to timeouts and cluster issues, leading to connection aborts. Up to somewhere within Java 8, Java was particularly susceptible to this. You can judge if this is a problem for you by looking at /proc/sys/kernel/random/entropy_avail over time; if it gets to under 128 or so and doesn't bounce back, then you have entropy starvation. Common on a VM where there is no keyboard-mouse activity; you might try running an entropy gather daemon if this is the case.

BTW, I wouldn't conclude from those logs that something [else] is actively forcing those connections closed; its just that the connection closed at at time when one party wasn't expecting it to. This could be due to things like timeout, exceptions, process crash etc etc.

You say that the database server was upgraded... was that an OS upgrade from CentOS 6? Was the application also upgraded, or was it lifted and shifted?

Cheers, Cameron

John Dalsgaard avatar
cn flag
Thanks for responding Cameron! I'm running without TLS on "the inside" where the servers are not publicly reachable so that is hardly the problem. entropy_avail on the CentOS6 server is between 129 and 177 over the last 5-10 mins. On the CentOS7 box it is approx. 3500. Only the DB software was updated (yum). When I last checked there didn't seem to be an "upgrade" option for CentOS 6 -> 7, which to some extend explains why the app.server is still on 6 ;-)
John Dalsgaard avatar
cn flag
By _"actively forcing connections to close"_ I meant exactly something "around" the application. The database server (and the SDK) did not expect it to close. I agree that this most likely could be due to either timeouts or running out of ressources somewhere.... Just not sure where to find it!
Cameron Kerr avatar
id flag
I would be concerned that your available 'entropy' (its a misnomer really) is rather depressed. I've generally run my servers in a similar setting with the following: `echo 1024 > /proc/sys/kernel/random/read_wakeup_threshold` (the default is 64). I've done this for years on RHEL5, 6, 7, 8 in production environments, including vendor appliances that run on VMware (its nice because its very low-touch)
Cameron Kerr avatar
id flag
What was the version jump in the database side? Does the database client library need to match? (I'm not familiar with Couchbase). Since you're running clear-text, looking at a traffic capture (eg. tcpdump -i eth0 -s0 -w /tmp/capture.pcap, then copy the completed capture to a machine with wireshark, you might find useful clues using the 'Expert Info' and 'Follow TCP Stream' functions.
John Dalsgaard avatar
cn flag
The database was updated from version 6.6.0 to 7.0.0. The SDK that was installed should happily talk to both. I have also upgraded the SDK to the latest to rule that issue out. Logging from the SDK is showing that there most likely are problems in the underlying network layer. I searched for more info about `read_wakeup_threshold` and `entropy` and found this: https://redhatlinux.guru/2016/04/03/increase-system-entropy-on-rhel-centos-6-and-7/ - it suggest that the `entropy_avail` should be in the range 3-3.500. So perhaps I should try to increase it on the app. server?
John Dalsgaard avatar
cn flag
Hmmm... not knowing anything about `entropy` I read a couple of articles.... it seems to be related to cryptography (SSL) and randomise.... Can this still be the problem in my situation where I don't use SSL? But I see that the `entropy_avail` is much lower than 3000 on my app. server (128-190ish)
John Dalsgaard avatar
cn flag
Especially this article: https://www.2uo.de/myths-about-urandom/ - though I didn't intend to understand all of it as it points in the direction of SSL....
John Dalsgaard avatar
cn flag
I increased the `read_wakeup_threshold` the way you described just before it reported a timeout again (when I took the second tcpdump). So not `entropy_avail` reports in the range around 2500-2600
John Dalsgaard avatar
cn flag
I just learned from the hosting people that they host these servers on 3 ESXi servers. So the servers could have been running on different hosts at some point - I have asked them to see if they can verify that. After the changes to the `entropy_avail` (or by coincidence) the servers seem to have been running better. I have not yet had a response from the Couchbase engineer.
John Dalsgaard avatar
cn flag
If they keep running smoothly then I consider taking a new trace on the app server and the two db servers - just to see if the previous problems are still there.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.