Score:1

Why is google spamming my Wordpress site with dating keywords?

br flag

I have a Wordpress site getting hit with over 100k requests per day with the same request below. All these GETs are coming from about 200 different IPs within the same Google netrange (66.249.x.x). There is no /search/ route on the site, but something in Wordpress (Relevanssi?) must be processing this request because there are UTF-8 collation errors in the DB, probably due to the emojis or cyrillic characters:

WordPress database error Illegal mix of collations (utf8_general_ci,IMPLICIT) and (utf8mb4_unicode_ci,COERCIBLE) for operation 'like' for query \n\t\t\tSELECT COUNT(DISTINCT(relevanssi.doc))\n\t\t\t\tFROM 49qi0c_relevanssi AS relevanssi\n\t\t\t\t WHERE (relevanssi.term LIKE 'berbat\xf0\x9f\xaa\x80\xe2\x9d\xa4\xef\xb8\x8f\xef\xb8\x8fwww%' OR relevanssi.term_reverse LIKE CONCAT(REVERSE('berbat\xf0\x9f\xaa\x80\xe2\x9d\xa4\xef\xb8\x8f\xef\xb8\x8fwww'), '%')) made by require('wp-blog-header.php'), wp, WP->main, WP->query_posts, WP_Query->query, WP_Query->get_posts, apply_filters_ref_array('posts_pre_query'), WP_Hook->apply_filters, relevanssi_query, relevanssi_do_query, relevanssi_search, relevanssi_search, relevanssi_generate_df_counts, QM_DB->query

I checked the Relevanssi forum and found someone posting almost the same issue. It was said to be 'harmless' and didn't appear to concern anyone so the thread was closed. Thing is though, the sheer load of these requests are starting to lag the site and the errors being generated are filling up the logs under the /var/ partition. I've got the /19 from Google blocked right now but probably not the right answer since it's Google (page ranking and all that). Anyone ever see this kind of stuff from Google before?

GET /search/%F0%9F%AA%80BEST+DATING+SITE%E2%9D%A4%EF%B8%8F%EF%B8%8F%C4%B0ngiliz+kad%C4%B1n+i%C3%A7+%C3%A7ama%C5%9F%C4%B1r%C4%B1+gal+r%C3%B6ntgenci+%C3%B6n%C3%BCnde+berbat%F0%9F%AA%80%E2%9D%A4%EF%B8%8F%EF%B8%8FWww.MtSp.XyZ%F0%9F%AA%80%E2%9D%A4%EF%B8%8F%EF%B8%8F%C4%B0ngiliz+kad%C4%B1n+i%C3%A7+%C3%A7ama%C5%9F%C4%B1r%C4%B1+gal+r%C3%B6ntgenci+%C3%B6n%C3%BCnde+berbat+%C4%B0ngiliz+kad%C4%B1n+i%C3%A7+%C3%A7ama%C5%9F%C4%B1r%C4%B1+gal+r%C3%B6ntgenci+%C3%B6n%C3%BCnde+berbat+%C4%B0ngiliz+kad%C4%B1n+i%C3%A7+%C3%A7ama%C5%9F%C4%B1r%C4%B1+gal+r%C3%B6ntgenci+%C3%B6n%C3%BCnde+berbat/feed/rss2/?page_number_9=1&page_number_15=7&page_number_14=3&page_number_16=3&page_number_10=1&page_number_12=33&page_number_17=3&page_number_13=3&page_number_11=17 HTTP/1.1" 200 718084 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

vn flag
Are you able to give an actual example IP of one of these? This looks basically like referrer spam (some sites show "top/recent searches" in a sidebar); it might just be a compromised server in Google's cloud platform, not the actual GoogleBot. Verify it at https://developers.google.com/search/docs/advanced/crawling/verifying-googlebot.
Nstevens avatar
br flag
Sure, one IP was 66.249.70.19. It's in range #21 here: https://www.gstatic.com/ipranges/goog.json. I think there were a few IPs not in that netblock but the bulk of them were. I'll checkout the info you posted. Thanks!
vn flag
Interesting; it tracks back to GoogleBot. I wonder if someone's abusing the "crawl as GoogleBot" in the Google Search Console, or making a page pointing at these search URLs that Google crawls and assumes is good faith.
Nstevens avatar
br flag
Not sure what that feature is but I'll run it by our WP admin. He was concerned that something with his SEO plugin may be telling Google to crawl the site for those terms. Had another suggestion these may be Google App Engine hosts (VM hosting?). I'm not well versed in Google's services but seems like a possible fit.
in flag
Use Google search console, it might tell you why, or at least get hints after a few days.
jp flag
add `/search/ to `robots.txt`
Nstevens avatar
br flag
I wish it were that easy. `robots.txt` is purely discretionary. Any client is free to ignore it.
vn flag
@Nstevens While true, GoogleBot **definitely** respects it.
Nstevens avatar
br flag
Ah, OK. I see what you're saying now @AlexD. Thanks.
Score:1
ag flag

I had blocked them in robot.txt

Disallow: /*?s=*

I had a lot of request similar to the following from googlebot

https://example.com/es/?s=%20Levitra%2010mg%20filmtabletten%20rezeptfrei%20Viagra%20original%20bei%20pfizer%20100mg%20kaufen%20in%20deutschland%F0%9F%92%88%E2%9C%97%20www.MayoClinic.store%20%E2%9C%97%F0%9F%92%88%20Rezeptfrei%20viagra%20oder%20%C3%A4hnliche%20mittel%20Kamagra%20in%20pattaya%20kaufen%20Kamagra%20deutschland%20100mg%20online%20kaufen%20kaufen%20Cialis%20sicher%20kaufen%20forum

Now they are gone the last one was on DIC 3 2022

These spam requests had two downsides for me:

  1. These spam requests were wasting my crawling budget.

  2. As I am an Ezoic partner, they have an app to control published content. The app is called Objectionable Content There you can see a list of pages that may have objectionable content on them.

In my case I had a long list of 404 error pages from these spam requests. And now they are gone, my site is clean.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.