Score:0

spamassassin and subject utf base64

ph flag

I have a problem with some spam messages with the subject field encoded in utf8 base 64 and weird characters used to fool the filter rules

example:

raw subject of incoming email

Subject: =?UTF-8?B?UklGSVVU0J4gREkgUklOTtCeVtCe?=#821538

decode by spamassasin contains this char О instead of O

__SUBJ_NOT_SHORT ======> got hit: "RIFIUTО DI RINNOVO"

so the rule not trigger

header     __SUBJECT_PHISHING_3     Subject=~ /(RIFIUTО DI RINNОVО)/i

however these characters are displayed in the email client ( Outlook or Thunderbird) with an O and result correct in italian language to fool the user

RIFIUTО DI RINNОVО

So the spammer inserts weird characters knowing that the client will show them correctly in Italian while spamassassin will not trigger the rule

there is a solution to match these characters or decode them like the email client do without having to create a new rule every time the spammer insert special char to bypass filter

found same problem with some hint https://users.spamassassin.apache.narkive.com/LhGDKXkm/utf-8-spam-rules

anx avatar
fr flag
anx
what do you mean, *correctly* - the email header unambiguously instructs to treat the base64-encoded payload as UTF-8?
hcomputer avatar
ph flag
hi, the spammer uses these special characters `RIFIUTО DI RINNOVO` (О instead O )that the mail client instead displays as `Rifiuto di rinnovo`, correct in Italian. So if I create a rule to block emails with subject `Rifiuto di rinnovo` the spammer manages to bypass it, I would like to understand if there is a way with spamassassin to decode special characters in the defined language (italian) to avoid having to create ad hoc rules every time a new modified subject arrives
Score:1
in flag

I don't think there is an easy solution for this.

The problem here is that the email client decodes the base64-encoded text correctly as not having an "O" (as in, "Latin capital letter O") character, but a Cyrillic one ("Cyrillic capital letter O"). The former is U+004F, the latter is U+041E.

So your regexp will not match, simply because for the regexp parser (and for programs in general), those two characters are not the same. For a human, they are, since they look exactly like one another, so it doesn't really matter which one is displayed. I'm not aware of any simple solution which allows you to match texts based on appearance.

By the way, Spamassassin should recognize the Cyrillic character and should have displayed that instead of the garbage "О" (but, truth to be told, that would have been even more confusing). You should check the server's default character encoding.

I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.