Score:1

Why does our Tomcat Java application suddenly open hundreds of connections to our database?

it flag

We have a Tomcat application running on Elastic Beanstalk and our MySQL database is hosted on AWS RDS (2 or 3 t3.medium instances). Ever since we upgraded from MySQL 5 to MySQL 8 (currently 8.0.23), we've been having an issue that occurs about once a week. Most of the times the database is fine, but then, all of a sudden, the number of connections skyrockets (sometimes even surpassing the 307-connection limit during a 1-minute range, which is also something we don't get. How is it capable of going beyond that limit?) and that causes the Elastic Beanstalk instances to become degraded. Sometimes the entire database crashes after those connection peaks.

While monitoring the application's JVM with VisualVM, I've come to notice that, during those connection peaks, Tomcat suddenly creates dozens of worker threads. My guess is every one of those threads establishes a new connection to the database. While we could limit the number of those threads (the servers wouldn't be able to handle so many threads in the first place, after all), we want to understand what's causing that. Why is Tomcat creating so many threads and connections to our database? Is that a cause or a consequence of issues in the database? Where should we look to find the root of the issue?

I've Googled a lot, trying to find people who've had similar issues in order to shed a light on the problem. We've also tried analyzing the most expensive queries and other database performance insights but there seems to be no clear pattern.

Wilson Hauck avatar
jp flag
Does the spike clear in a while - how long? How do you get your system back on line?
Helder Sérvio avatar
it flag
@WilsonHauck , when the spike happens, the load balancer health checks starts failing, which causes Elastic Beanstalk to bring the instances down and replace them, which, in turn, solves the issue.
Score:1
ua flag

Where should we look to find the root of the issue?

  • Enable the slowlog in MySQL and (after the spike) investigate what queries were running at that time. If the slowlog does not show much, lower long_query_time before the next spike.
  • (I don't know if Tomcat has a log.)
  • Does it happen at the same time every day or week?
  • When does Amazon do backups?
  • If you are online when it happens, see if you can do SHOW PROCESSLIST;. Keep yourself connected; it may be difficult to connect when you see the spike.
  • The MySQL 'VARIABLE' max_connections controls the 307. Increasing it may postpone the top of the spike, but make things worse. (I don't see this as a "solution".)
  • Tomcat can [probably] hold off excess connections without hurting things too much; it is likely to be better to throttle Tomcat than to change the 307. When MySQL has "lots of busy connections", it gives each one equal access to resources; this has the effect of slowing down all connections.
Helder Sérvio avatar
it flag
We have already taken a look at the slow query log and we were able to remove/refactor a few expensive full-time queries when the situation was really dire (DB crashing all the time), but still, it doesn't explain why the issue only started happening after the transition to MySQL 8. Tomcat has a log, but we haven't been storing it after the instances are destroyed. We'll do that next time and take a look at the threads. And no, it varies a lot in frequency and time. Does not overlap with the backups.
Wilson Hauck avatar
jp flag
@HelderSérvio Additional information request, please. AWS instance type - RAM size, # cores, Any SSD or NVME devices on MySQL Host server? Post on pastebin.com and share the links. From your SSH login root, Text results of: A) SELECT COUNT(*) FROM information_schema.tables; B) SHOW GLOBAL STATUS; after minimum 24 hours UPTIME C) SHOW GLOBAL VARIABLES; D) SHOW FULL PROCESSLIST; E) STATUS; not SHOW STATUS, just STATUS; for server workload tuning analysis to provide suggestions.
Helder Sérvio avatar
it flag
@WilsonHauck . The servers are 2-3 t4g.small instances (2 GiB, 2 vCPUs), while the database is (a single, I was mistaken when I said it was 2-3) t3.medium (4 GiB, 2 vCPUs) instance, with gp2 SSD. I don't have direct access to the database, so I'm afraid I can't show you the result of those queries. However, my boss has given me a dump of the slow queries table. Essentially, what happens is, at a certain moment, all queries start getting slower (the density of slow queries increases a lot), until some reach around 2 or 3 minutes. The RDS performance insights show long LOCK_table_cache waits.
Wilson Hauck avatar
jp flag
@HelderSérvio Could you post the slow query information provided by your boss? Could your boss run the listed above, post the data to pastebin.com and you share the links with us for workload analysis of your t3.medium instance?
ua flag
"Density increases" -- Often a single query precipitates the crowding. `SHOW PROCESSLIST` can sometimes spot that, but getting that is difficult. The raw slowlog can sometimes show which query is the naughty query. ("Digesting" the query is better for figuring out which query is the most burden for the system.)
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.