Given the following configuration (reduced to the relevant parts):
/etc/nginx/nginx.conf:
http {
# ... general configuration stuff here ...
map $http_user_agent $isbot_ua {
default 0;
~*(GoogleBot|bingbot|YandexBot|mj12bot|PetalBot|SemrushBot|AhrefsBot|DotBot|oBot) 1;
}
map $isbot_ua $limit_bot {
0 "";
1 $binary_remote_addr;
}
limit_req_zone $limit_bot zone=bots:10m rate=2r/m;
limit_req_log_level warn;
limit_req_status 429;
include sites.d/vhost_*.conf;
}
/etc/nginx/sites.d/vhost_example.org.conf:
server {
# ... general vhost config here ...
location / {
index index.php index.html index.htm;
try_files $uri $uri/ /index.php$is_args$args;
}
location ~ ^(.+?\.php)(/.*)?$ {
try_files /does-not-exist-099885c5caef6f8ea25d0ca26594465a.htm @php;
}
location @php {
try_files $1 =404;
include /etc/nginx/fastcgi_params;
fastcgi_split_path_info ^(.+\.php)(/.+)\$;
fastcgi_param SCRIPT_FILENAME $document_root$1;
fastcgi_param PATH_INFO $2;
fastcgi_param HTTPS on;
fastcgi_pass unix:/var/lib/php/11-example.org-php-fpm.socket;
fastcgi_index index.php;
}
}
/etc/nginx/fastcgi_params:
limit_req zone=bots burst=5 nodelay;
# ... more fastcgi_param here ...
The issue is the following:
Each request matching a bot UA (no matter if via a virtual URL mapping to /index.php
or a native URL pointing directly to index.php
) will result in a response code 404 instead of the expected code 200 - unless I'll exceed the rate limit when it suddenly responds with the expected code 429.
No 404 or 429 code is generated, if I change the map to:
map $request_uri $is_req_limited {
default 0;
# ~*(GoogleBot|bingbot|YandexBot|mj12bot|PetalBot|SemrushBot|AhrefsBot|DotBot|oBot) 1;
}
In this case, all requests are answered with 200. This is also true if I do not match any of the bots.
The thing is: It worked correctly in our pre-deployment tests which had a simpler vhost config (we moved limit_req
from the global config to the fastcgi section during deployment because we only want to match page-generation, cached pages and static resources are fine). This totally killed SEO rankings for our sites.
Command used for testing:
# Causes the problem:
for i in $(seq 1 30) do; curl -Is -A GoogleBot https://example.org/ | head -n1; done
# Does not cause the problem:
for i in $(seq 1 30) do; curl -Is -A ThisIsNotABot https://example.org/ | head -n1; done
Is this a bug or a mis-configuration? Is it possible to work-around it if it is a bug?
Side note: It is almost impossible to prevent this somewhat strange configuration because it is generated by the host management software (Froxlor), tho I think it may play into the problem. We also cannot add or modify any configuration here:
location ~ ^(.+?\.php)(/.*)?$ {
try_files /does-not-exist-099885c5caef6f8ea25d0ca26594465a.htm @php;
}
location @php {
try_files $1 =404;
#...
I wonder if limit_req
would be better placed inside location ~ ^(.+?\.php)(/.*)?$
but OTOH, location @php
should be equally fine.