We run nginx as a reverse proxy with TLS termination.
We moved from a 4 core machine to a 1 core machine recently, by mistake, and noticed under load that our "request queuing" metric shot up (from normal of ~10ms up to around 10 seconds on average). When we realised the sizing error and upgraded to 4 cores, the problem went away, but I'd like to understand exactly what was responsible for the slow down, so that we can know if we're going to approach a similar situation in the future as traffic increases.
The "request queuing" metric is simply calculated as a delta between a timestamp that the nginx proxy adds to proxied requests (X-Request-Start
), and the time that that request is processed by our application on the backend web servers.
This value normally indicates queuing on our backend servers, as there's an explicit request queuing mechanism built into the backend web servers. In this case, there wasn't any queuing on the backend servers.
Our worker_connections/worker_processes configuration is:
worker_processes auto;
worker_rlimit_nofile 20000;
events {
worker_connections 10000;
}
Observations at the time:
- The
$request_time
figures displayed in nginx's access logs were in the normal range (~100ms average)
- The system metrics showed a fairly steady average CPU utilisation of 60%
- Active connections reported by nginx's stub_status was around 6k and went up to around 10k after the upgrade
What is the technical explanation, in nginx internals, for where this delay came from? It seems as though it was after the connection was accepted and the header was added, but before the request reached the backend. How do I monitor it, and can I calculate at what level of traffic we may experience it again at 4 cores?
Nginx config (the bits I felt were pertinent..):
user www-data;
worker_processes auto;
worker_rlimit_nofile 20000;
pid /run/nginx.pid;
include /etc/nginx/modules-enabled/*.conf;
events {
worker_connections 10000;
}
http {
log_format upstreamlog '[$time_local] $remote_addr - $remote_user - $host to: $upstream_addr: $request upstream_response_time $upstream_response_time msec $msec request_time $request_time status $status sent_bytes $bytes_sent';
upstream app_backend {
least_conn;
server 100.64.32.11:8080; # app01
server 100.64.32.41:8080; # app02
# ... etc ...
}
server {
listen 80;
listen [::]:80;
server_name ~^[^\.]+\.app\.net$;
return 301 https://$host$request_uri;
}
server {
listen 443 ssl;
listen [::]:443 ssl;
server_name ~^[^\.]+\.app\.net$;
ssl_certificate /etc/nginx/ssl/app.crt;
ssl_certificate_key /etc/nginx/ssl/app.key;
location / {
proxy_pass http://app_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-Request-Start "t=${msec}";
access_log /var/log/nginx/access.log upstreamlog;
proxy_send_timeout 300;
proxy_read_timeout 300;
}
}
}