// Not sure if this question is best fit for serverfault or webmasters stack exchange...
I am thinking to rate limit access to my sites because identifying and blocking bad bots take most of my time.
For example I have bots accessing the site by vpns/proxies and each individual IP makes 10k - 15k requests per 1-2 hours before I find and block it. And daily I see around 100-200 of them. They slow down the site and maybe try to copy the content (not sure what their purpose is).
Because rate limiting action is kind of extreme and there are plenty of things to go wrong, I am hoping to find out:
- some best practices/techniques/ways to block bad bots
- what are some things that can go wrong ? There are always more :)
What I have in mind is identifying over x requests to certain content pages (e.g not images) in last 24 hours (from same IP), if over 100 requests in 24 hours, then display a captcha with a rate limited response code.
Then to allow "good/known bots" I want to identify them by their IP hostname (if hostname IP verifies back to same IP), I seen 90% of known bots use proper hostnames (e.g crawl-[foo].googlebot.com
)
So by that I can whitelist Google, Yandex, Bing, etc.
Whitelisting .google.com
(not .googlebot.com
) is a bad idea, because there are hosts like google-proxy-[foo].google.com
and rate-limited-proxy-[foo].google.com
that are used by some apps like google translate that some people use/abuse to read blocked websites.
I cannot block these google services IP/hostname because I think they are also used by Gmail, when I send an email containing an image from my site, Gmail might access the site with these hostnames.
Cannot block non-residential IPs eider (e.g datacenters) because ... is hard, and some normal users use VPNs with similar ips.
Blocking by user agent = 0; Blocking by hostname = 0 if hostname doesn't verify back to same IP.
- What are some other similar things to consider ? I don't want to risk dropping in search results or blocking normal users.
Cloudflare doesn't help much, 1) they might serve cached request of my site to scrappers without me being able to see/block that request. 2) they didn't block many of these suspicious requests I have now 3) many times they give me a captcha on various sites, when I do no automatic/suspicious requests from this IP, so I don't think they properly identify bad traffic despite having huge amounts of data about IP reputation and user behaviour.
