Just for preface: I'm a software developer and at first no one knew what's going on, so I did some testing and reading, and would like to help my colleagues fix this problem.
The issue:
The issue is that at peak times, the server becomes so slow that the connection times out in browsers like chrome (after 30 seconds), but the server is still up and can serve pages after ~100 seconds (tested with insomnia). I've replicated the issue using abs ...
on the production server, and it's got something to do with the number of concurrent requests, probably caused by our config of apache server.
More info:
When developing we run tomcat8 locally, and I've tested it with "abs -c 200 -n 2000 https:/[link]
" and the execution times are fine. But testing the production server, even with 50 concurrent requests, the slowdown on the API I was testing was significant: from 800ms default time to 27846 ms.
Things tried and more details:
We have java melody running, and I thought maybe we are hitting the thread limit on the tomcat execution thread, so we increased the number of threads to 500 from the default of 200 (this was before I did the testing). When running the aforementioned test, I can see the busy threads go up to 50-something out of 500 (running on production, so some people are actually using the program as well), but nope, it still slows down a lot.
At peak times, I see we have around 1000 http sessions, but the threads, memory and cpu are not anywhere close to 100%. Just to make sure we upgraded the server to the best one we can, but of course that wasn't it. We are using SQL, but the SQL server isn't peaking either so I doubt that's the issue.
I know I shouldn't ape JVM arguments, but looking at similar problems I've tried adding "-XX:ReservedCodeCacheSize=512M" but that didn't work either. I've also tried increasing acceptCount to 1000 in server.xml, but it's still not working. Should I reverse these changes ? I haven't noticed any performance change and as far as I can tell from reading the documentation it's ok to leave it like this.
We have a weird feature where the webapp goes to the home page after some time of inactivity, and then it keeps refreshing the home page every xx minutes. I think this is bad for performance, especially if a user has a lot of tabs open and they start to refresh, probably not what's causing our issues, but worth mentioning.
Next thing I will try today is to mess with the apache server arguments. I'm reading a tuning guide and MaxRequestWorkers / MaxClients looks like something that might explain what we're experiencing. Quote If this directive is too low, Apache under-utilizes the available hardware which translates to wasted money and long delays in page load times during peak hours.
I would appreciate any tips. Hopefully it's just the apache server and I can at least make the server usable today. Any other configs that might cause this slowdown ?