I have an AWS ALB that load balances requests round-robin to four servers.
Each server uses pm2 to round-robin those requests to six CPUs.
NodeJS processes (react NextJS) are running on each of those six CPUs, served by Express.js. One of the first things they do is log the incoming request. (They are not fronted by a web server like apache or nginx, it goes straight to Express.js.)
Usually every single request that hits the ALB gets successfully forwarded, and logged by the NodeJS process. However, sometimes at high traffic times, some requests just get dropped and never make it to the NodeJS process. Obviously our server logs don't log these failures since they never make it there in the first place; we only see this gap by comparing to the ALB request counts.
I'm trying to understanding the mechanism that could lead to them being dropped. Could it be that a NodeJS internal queue times out? Or could it be a linux kernel thing? We are seeing indications that during periods of higher traffic, some of the CPUs are busy while others are idle, which makes me think of queue length (kingmans formula, little's law, etc). I can think of a few ways to decrease the probability of this happening, from increasing server capacity, to reducing response time, to changing the server-level load balancing strategy, but I'm more trying to understand where the request actually gets stuck and what determines whether and how it drops/disappears - especially if I could log it or send some kind of signal when it happens.
Snippets of pm2 config:
module.exports = {
apps: [
{
name: 'community',
script: 'dist/server.js',
instances: -1,
exec_mode: 'cluster',
autorestart: true,
watch: false,
log_date_format: 'YYYY-MM-DD HH:mm Z',
max_memory_restart: '2G',
// ...
// and env-specific configs, such as
env_production: {
NODE_ENV: 'production',
NODE_OPTIONS: '--max-old-space-size=3584 --max-http-header-size=16380',
LOG_LEVEL: 'INFO',
PORT: 3000,
},
},
],
deploy: {
// ...
},
};