TL;DR: Redeploy by restarting the server (Check final update).
This is an old question, but haven't found a solution just yet and at this point I don't know where to look.
We have three Java(Spring, non boot) webapps deployed on two servers: one app is on a GCP Compute Engine deployed on Tomcat9 (port 80) and the other two are in our in-house server on one instance of Tomcat8.5 (port 8080 redirected from port 80). Both systems have Mysql8 and use pretty much the same configuration to connect to it: DataSource for the main DB and ConnectionProvider for the tenants.
The issue is that, when redeployed, some older connections from the pool (HikariCP) are not being killed, whereas the others are. These connections that stay are from our multi-tenant connection provider. The main tenant so to speak, kills the older connections properly. This of course leads to the case when we have so many connections we run out of them, throwing SQLExceptions. I've patched that by increasing the number of connections, but this is not a solution.
We are redeploying by just updating the war file wihtout GUI. I am positive this is what causes the issue, but doesn't really explain why some connections are properly closed and others don't.
What I've tried:
- I've seen related answers about this (mostly dealing with PHP), where
Mysql connections stay in Sleep status well after their work is done.
I've tried also the fixes provided in those questions because they
seemed reasonable for my case scenario too. Things such as decreasing
the
wait_timeout
and interactvive_timeout
to 30 minutes.
- Our HikariCP config idles connections after 10 minutes and they have
maxLifetime
of 15 minutes. Even after hours, the conections don't
close and they actually get refreshed after those 30 minutes. By this
I mean that the time displayed by the query SELECT * FROM information_schema.processlist GROUP BY db;
goes up to 1799(even less) and then back to 0. Why? I know the system is not being used by users at the time and the logs show that HikariCP is only aware of 4 connections (The ones I configured) instead of the up to 20 sometimes are 'active'.
We're using Spring Data JPA, so all the connection management is handled by Hibernate/JPA. The connections are also being properly refreshed by Hikari so I don't think is something with connections being left open in the code.
So far, I'm positive that is not a problem with Hikari (and by this I mean our configuration). This leads me to believe there is something odd with the Database configuration or we are simply not redeploying properly.
I believe this problem will go away if I reconstruct the servers layout (pardon my lack of vocabulary) by having both webapps in their own Tomcat instance, and using Apache or Nginx to proxy to them. I've done this configuration in my test enviroment and I've been wanting to do it for a while now, but it's hard to justify such a change in my position (Pretty much a not-quite-junior-yet backend dev that somehow got in charge of this). Even so, it's a big change, it will take me a couple of days while I'm working on more stuff and I really rather (properly) fix the current config than rebuild the server.
Other options are to schedule server + db restarts. Our system is regional, and our still few users work on regular hours, so they will never notice a daily restart at, say, 3 AM. I just don't like this and think is as inefficient as blindly increasing max_connections
every so days IMO.
There is also the option of rebuilding how we handle our multiple tenants. We are using ConnectionProvider and these connections are the "faulty" ones. I've seen several examples of other approaches using DataSource, and I know DataSource doesn't have this problem because the "main" database connections go down as expected on redeploy. Even so, I still believe this is a configuration issue.
Due my inexperience and how many things I need to look into, my guess is that I've overlooked something in the docs or I just don't truly understand the configurations I've touched. And as lost as I am, I came seeking others experience on the matter. Is there something else I should be looking into? I've also set up slow_query_logs
but said file is still empty after days.
Does anyone have had this type of issue before? If you need any more information about our structure or deployment, please feel free to request it. As you might've guess we are a small company still learning the ropes of this.
UPDATE:
I rolled out a few extra methods in our backend that should probably help with the extra connections. Some methods where not Overrited and since we're extending from another class, the super method might not be working. These methods target especifically the data structure were the connections are being accessed from.
Also, after one redeployment, I saw the connections go from 4 to 8 (Expected: The 4 from the first deployment and the extra 4 from the redeployment) but after a few hours the number of connections went down to 6. I hoped it was the end of it, but the next day we had those 8 connections again.
Even worse, today I had the chance to restart some services and experimented restarting only the database service. In the begining, it seemed to lower the connections to the expected 4 per tenant, but after a while it ramped up to the same value it had before restarting. This tells me that the connections are held hostage(?) by Tomcat, meaning that perhaps there is something in the documentation addressing this behaviour. I haven't found the right keywords to find it, but my bet is around the context, the realm or one valve.
If I can't find anything, I'll roll out a custom made ConnectionProvider I extended from a EntityManagerFactoryBean
. In this, I set a stop()
method triggering a @PreDestroy
method accessing the data structure with the tenants connections and manually shutting them down with Hikari's own methods. In theory, this is the most I can do from the code to close this connections. If that doesn't work and I can't find anything in Tomcat's docs either, I'll need to speak up and choose between scheduled restarts or rebuilding the server + "proper redeployments" (Stop, Update, Start).
UPDATE 2:
I invested yesterday trying to manually close the connections using the method described on the last update and with another method aiding myself with a ServletContextListener
. Neither worked, and found out that the method close()
in HikariCPs connection provider wasn't reffering to the connections, so yikes. I also decided to try and dynamically generate the ConnectionProviders in a bean, with it's proper close/destroy method but since the method I though of using wansn't meant for that, I'll drop that idea partially.
Next up: Change from ConnectionProviders
to DataSouces
. If this works, then we can keep redeploying as we always do. I'll try the three methods I came up with (In case the connections present the same issue on redeployment): Setting up a @PreDestroy
method for manually iterating the DataSources map and closing all relevant connections, dynamically generate and register all DataSource
s as beans (Probably "grouping" them with an Interface or something so the MultiTenantResolver
can work with it or taking the first approach but closing the connections in a ServletContextListener
.
Another thing I found is that the connections are being kept at a level greater to the webapps context. This is key information, but I honestly don't understand enough either why a set of connections from one app isn't closing as the other set not why isn't Tomcat letting those threads/connections die after they time out. The source of this information is this question from StackOverflow.
I managed to silently "cut myself a piece of the server" and set up a personal testing environment withing the testing environment. As I'm technically in charge of that and this in an effort to fix things currently going in Production, I think I'm justified.
I might try asking in SO and HikariCPs Google Group, although with different purposes to keep my question relevant to both communities.
UPDATE 3
Changing from ConnectionProvider to DataSource solved half the problems and bring new, more confusing errors:
- While most of the pools where properly initialized at 4 connections on redeploy, two of those pools stayed in the old behaviour (4 from original deploy + 4 of the new deploy) and one, somehow, ended upt with 12 on redeploy. That is the original 4, the 4 from redeployment and just some random 4 extra.
- While testing for any odd behaviour using the system, I noticed that every time I changed tenants, a new pool was created. Later on I founded out that in fact, onlo two pools were being created at startup, and every other pool was created only when requested. That was ok, really, but I still had one tenant with some random connections at startup that went on when using that particular database.
I then tried all my options and close the connections manually during shutdown, but i can't really say that any of this worked.
It seems that I just need to change how the server works. I'm a bit surprised I can't seem to find an answer no matter what I look into and I'm getting frustrated that after all the time I invested into this, it all would likely resolve to a batch file taking care of the redeployment by shutding down, replacing and starting up again.
In the Hikari Documentation there is stated that for hot-deployments (And hot-redeployments by extention) one needs to close the connections, but it talks about DataSources, no ConnectionProvider. At this point I'm even considering ditching Hikari for another solution but I also feel this is unnecesary and product of my frustration.
Anyway, I will continue to try stuff I guess. There little left for me to try.
Update 4:
Well, I finally gave up. I talked with whom I needed to talk and actually got a deadline to finish other stuff, including a small overhaul of our servers. That was part of the reason I started looking into this as well. Anyway, given this deadline and given that I just didn't found a solution, I will rebuild the servers structure: I'll be using a Proxy server to give each app a Tomcat instance in different, secured ports. That way the clients don't need to change anything. Inside, I will provide the project leaders with deployment scripts that will update their deployment branch, generate the updated WAR, stop their particular Tomcat service, clean previous builds, add new build and start the Tomcat service agian. This way I don't need to worry about the connections, finally give each project the needed independency and automate deployments to avoid as much errors as I can.
Not gonna lie, it kinda sucks it ends this way but we don't always win, right?