Score:0

Application slows down (SOMETIMES) at peak times, running on tomcat with a reverse proxy from apache server, deployed on AWS

US flag

Just for preface: I'm a software developer and at first no one knew what's going on, so I did some testing and reading, and would like to help my colleagues fix this problem.

The issue:

The issue is that at peak times, the server becomes so slow that the connection times out in browsers like chrome (after 30 seconds), but the server is still up and can serve pages after ~100 seconds (tested with insomnia). I've replicated the issue using abs ... on the production server, and it's got something to do with the number of concurrent requests, probably caused by our config of apache server.

More info:

When developing we run tomcat8 locally, and I've tested it with "abs -c 200 -n 2000 https:/[link]" and the execution times are fine. But testing the production server, even with 50 concurrent requests, the slowdown on the API I was testing was significant: from 800ms default time to 27846 ms.

Things tried and more details:

We have java melody running, and I thought maybe we are hitting the thread limit on the tomcat execution thread, so we increased the number of threads to 500 from the default of 200 (this was before I did the testing). When running the aforementioned test, I can see the busy threads go up to 50-something out of 500 (running on production, so some people are actually using the program as well), but nope, it still slows down a lot.

At peak times, I see we have around 1000 http sessions, but the threads, memory and cpu are not anywhere close to 100%. Just to make sure we upgraded the server to the best one we can, but of course that wasn't it. We are using SQL, but the SQL server isn't peaking either so I doubt that's the issue.

I know I shouldn't ape JVM arguments, but looking at similar problems I've tried adding "-XX:ReservedCodeCacheSize=512M" but that didn't work either. I've also tried increasing acceptCount to 1000 in server.xml, but it's still not working. Should I reverse these changes ? I haven't noticed any performance change and as far as I can tell from reading the documentation it's ok to leave it like this.

We have a weird feature where the webapp goes to the home page after some time of inactivity, and then it keeps refreshing the home page every xx minutes. I think this is bad for performance, especially if a user has a lot of tabs open and they start to refresh, probably not what's causing our issues, but worth mentioning.

Next thing I will try today is to mess with the apache server arguments. I'm reading a tuning guide and MaxRequestWorkers / MaxClients looks like something that might explain what we're experiencing. Quote If this directive is too low, Apache under-utilizes the available hardware which translates to wasted money and long delays in page load times during peak hours.

I would appreciate any tips. Hopefully it's just the apache server and I can at least make the server usable today. Any other configs that might cause this slowdown ?

Score:0
gb flag

Your webserver might be choking with too many requests, which in fact emulates the SlowLoris DOS attack, please check my previous answer on the topic.

Stefan Horvath avatar
md
Yeah that was my assumption as well that the web server is to blame somehow. I'll try the steps from your answer and give an update when I have results.
Tim avatar
gp flag
Tim
Sounds right. When things are slow you could try SSHing into the server, checking CPU, and requesting a few pages direct from Tomcat. If it's fast then it's definitely the web server. Increasing web server connections is likely to help. You could also consider Nginx instead of Apache, which from memory has a different model which may (or may not) scale better. Benchmarking as you're doing is a great idea
Stefan Horvath avatar
md
I've just now done the changes according to your previous answer, all seems well so far and I'm gonna wait for tomorrows peak times to make sure the problem was fixed, but ```net.ipv4.ip_local_port_range = "15000 61000"``` gave me an invalid parameter error. I'll try to fix it but if you have any ideas they are welcome.
Stefan Horvath avatar
md
Nope. It's still slow. Today at peak it took > 50 seconds to reach the page I've been testing on (the one that's usually 500ms).
Stefan Horvath avatar
md
I've narrowed it down, it seems like the https links are the only ones that are slowed down. Http links (like going to /manager) work fine.
Marcel avatar
gb flag
Just out of curiosity, what's your AWS server profile? It might just be that you're running out of RAM. Do you have anything like netdata installed to get an overview on server health while your server is under peak loads?
Stefan Horvath avatar
md
We're using the 2xl server, and I've been monitoring with HTOP and JMelody and the system resources are definitely not the problem. (less than 50% usage at peak) I've been reading and experimenting a lot on a dev server I made. I think the issue is that the production server is using prefork mpm. I'm looking into switching to event mpm but I'm a bit paranoid about thread and session safety.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.