Score:1

What things should I consider when identifying and rate limiting bots?

pe flag

// Not sure if this question is best fit for serverfault or webmasters stack exchange...

I am thinking to rate limit access to my sites because identifying and blocking bad bots take most of my time.

For example I have bots accessing the site by vpns/proxies and each individual IP makes 10k - 15k requests per 1-2 hours before I find and block it. And daily I see around 100-200 of them. They slow down the site and maybe try to copy the content (not sure what their purpose is).

Because rate limiting action is kind of extreme and there are plenty of things to go wrong, I am hoping to find out:

  • some best practices/techniques/ways to block bad bots
  • what are some things that can go wrong ? There are always more :)

What I have in mind is identifying over x requests to certain content pages (e.g not images) in last 24 hours (from same IP), if over 100 requests in 24 hours, then display a captcha with a rate limited response code.

Then to allow "good/known bots" I want to identify them by their IP hostname (if hostname IP verifies back to same IP), I seen 90% of known bots use proper hostnames (e.g crawl-[foo].googlebot.com)

So by that I can whitelist Google, Yandex, Bing, etc.

Whitelisting .google.com (not .googlebot.com) is a bad idea, because there are hosts like google-proxy-[foo].google.com and rate-limited-proxy-[foo].google.com that are used by some apps like google translate that some people use/abuse to read blocked websites.

I cannot block these google services IP/hostname because I think they are also used by Gmail, when I send an email containing an image from my site, Gmail might access the site with these hostnames.

Cannot block non-residential IPs eider (e.g datacenters) because ... is hard, and some normal users use VPNs with similar ips.

Blocking by user agent = 0; Blocking by hostname = 0 if hostname doesn't verify back to same IP.

  • What are some other similar things to consider ? I don't want to risk dropping in search results or blocking normal users.

Cloudflare doesn't help much, 1) they might serve cached request of my site to scrappers without me being able to see/block that request. 2) they didn't block many of these suspicious requests I have now 3) many times they give me a captcha on various sites, when I do no automatic/suspicious requests from this IP, so I don't think they properly identify bad traffic despite having huge amounts of data about IP reputation and user behaviour.

enter image description here

Score:3
la flag

identifying and blocking bad bots take most of my time.

Generally when I see such a remark I get the idea that an administrator has too much time on their hands and may be focussing on the wrong issue.

Bots are part of the background noise of the internet and when they are not running at denial-of-service level rates any "problems" you might see are either:

  • the result of a bad signal to noise ratio - your site sees so few "real" vistors that the usual level of bot activity appears much more significant then it objectively is.
  • the result of bad code and too little/no optimisation on the site (back-end).

There are of course always exceptions.


Generally all efforts to improve the performance of your site will reduce the effect of bots and will also directly benefit your real users .

Identify bottlenecks, improve expensive database queries, add caching etc.

Consider delegating the problem and subscribe to a CDN. Configure your site to only allow access to requests originating from the CDN and block direct requests from everywhere else. Let the CDN deal with bots and crawlers.


It might be question better suited for https://webmasters.stackexchange.com/ because I don't have a clue there but the maybe the relevant search engines have a site owner management panel that allow webmasters to tune the amount of real crawling they'll do on your sites.

Real bots honour the robots.txt crawl directions so it should be safe to block everything that ignores the instructions there.

You can even consider adding a section disallowing access to a currently un-used URI-path such as /junk/

User-agent: *
Disallow: /junk/

and then use that as a "trap" and input fake crawlers accessing that URI directly your automation (such as fail2ban and immediately block those.


General recommendations before blocking IP-addresses:

  • Before you start blocking valid bots and crawlers, identify which search engines drive actual traffic and vistors to your site and ensure that you can exclude those from from your block-lists.

    For example for Google: https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot

  • Consider that many organisations don't allow their users direct internet access and all visitors from such an organisation will appear to use the same (few) IP-addresses, namely the IP's used by their proxy servers.


I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.