Score:0

Server

How to avoid emails sent to Google's deep web crawler

miguelmorin

2/20/23, 4:40 PM

My website has an area restricted to users who sign up with a valid email. I have got requests with bogus emails, and I want to avoid sending emails to non-existent addresses lest they increase the bounce rate and hurt my sending reputation.

The emails are:

kwqchvznypecdv@hnwbkfod.my
kwqchvzny.pecdv@hnwbk.fod
kWQcHVzn%40ypEcDvh.NwB

The last one has %40, the HTML entity for @. The emails are truncations of the same character sequence.

Inspecting IP address of the requests with reverse DNS, all three requests come from cache.google.com. If the requests come from Google's crawler, I would expect these email addresses to be documented, but I could not find any reference.

In case it is the Google crawler, I want it to index the website while avoiding send email addresses to bogus addresses. I have already implemented filtering on the address looking for that character sequence.

Is there a list of bogus addresses that deep web crawlers use to gain access and index hidden pages?

Update

Following the answer and the comment pointing at verifying that Googlebot is the crawler, I confirmed that it is not:

$ host 212.113.167.197
197.167.113.212.in-addr.arpa domain name pointer cache.google.com.
$ host cache.google.com
Host cache.google.com not found: 3(NXDOMAIN)

So indeed, it seems a malicious user, which explains why that email address is not documented as coming from Google.

0 + 0

google

email-bounces

web-crawler

ceejayoz

2/20/23, 4:51 PM

Consider blocking the email form's URL in robots.txt. Or a captcha? I *presume* Google bot won't try to crack their own captchas...

miguelmorin

2/21/23, 3:33 PM

That's a nice idea. Can you write an answer?

Score:3

Server

Bob

2/21/23, 5:54 AM

Inspecting IP address of the requests with reverse DNS, all three requests come from cache.google.com.

When doing a reverse lookup, do not forget to check if a forward lookup of the host name points to the IP-address you are investigating.

> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1

When the reverse and forward DNS records align you, like in this example, then can trust it. Otherwise you may have a sloppy administrator or an example of an attempt by an attacker to hide their origin.

Please use a Whois query on the IP-address rather a reverse DNS lookup to determine the owner when investigating abuse.

Whatever the reverse DNS record of especially an attackers IP-address resolves to is not always reliable information.

Note that the owner of an IP-address range can set any value they want on reverse DNS records. There is no limitation that they can only use host names that they own, nor is there any inherent technical limitation that a reverse DNS record must match a forward DNS record.
(Although most diligent providers do try to enforce that when they allow their customers to set up custom reverse DNS records on the public IP-address they use. )

Setting up a fake reverse DNS record is a trick from the arsenal some attackers can use to hide their tracks and/or to appear more benign when attempting to circumvent access controls.

0 + 0

miguelmorin

2/21/23, 4:05 PM

Thank you! The Whois query on the IP address (https://www.whois.com/whois/x.x.x.x) shows it comes from an Internet Service Provider and does not list `cache.google.com` anywhere in the records. If the requests are indeed from the Google bot, should they list a `google.com` domain name?

Bob

2/22/23, 6:33 AM

Please refer to https://developers.google.com/search/docs/advanced/crawling/verifying-googlebot for their recommendation which includes verification that the reverse DNS record used actually matches the forward record.

Elon Musk

I sit in a Tesla and translated this thread with Ai:

EN: How to avoid emails sent to Google's deep web crawler

TH: วิธีหลีกเลี่ยงอีเมลที่ส่งไปยังโปรแกรมรวบรวมข้อมูลเว็บเชิงลึกของ Google

RO: Cum să evitați e-mailurile trimise către crawler-ul Google deep web

RU: Как избежать отправки электронных писем поисковому роботу глубокой сети Google

VI: Cách tránh email được gửi đến trình thu thập dữ liệu web sâu của Google

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.