spamassasin unable to read Japanese when it is html encoded

lepe

3/18/23, 2:21 AM

I would like to block some emails which contains certain Japanese words, but spamassassin fails to detect such words when the email is HTML encoded, for example:

This is a multi-part message in MIME format.
--------------050206070005060005050706
Content-Type: text/plain; charset=ISO-2022-JP; format=flowed
Content-Transfer-Encoding: quoted-printable

&#12371;&#12435;&#12395;&#12385;&#12399;&#65281;&#27531;&#24565;&#12394;&=
#12364;&#12425;&#20982;&#22577;&#12364;&#12372;&#12374;&#12356;&#12414;&#=
12377;&#12290;&#25968;&#12534;&#26376;&#21069;&#12289;&#12354;&#12394;...

--------------050206070005060005050706
Content-Type: text/html; charset="ISO-2022-JP"
Content-Transfer-Encoding: quoted-printable

<html>
  <head>

    <meta http-equiv=3D"content-type" content=3D"text/html; =
charset=3DISO-2022-JP">
  </head>
  <body bgcolor=3D"#FFFFFF" text=3D"#000000">
    &#12371;&#12435;&#12395;&#12385;&#12399;&#65281;</br>
</br>
&#27531;&#24565;&#12394;&#12364;&#12425;&#20982;&#22577;&#12364;&#12372;&=
#12374;&#12356;&#12414;&#12377;&#12290;</br>
...
  </body>
</html>
--------------050206070005060005050706--

Example rule in spamassassin:

body     JAP_BAD_1  /残念ながら凶報がございます/
score    JAP_BAD_1  5.0

However, when I run the test:

spamassassin -D textcat -t spam.test

It doesn't show up the match. What do I have to do?

0 + 0

spam

spam-filter

spamassassin

anx

3/28/23, 10:09 AM

Any reason to not simply reject *all* mail with numeric html entities in supposedly `text/plain` type parts?

lepe

3/29/23, 1:46 AM

@anx I'm not sure if doing that may reject authentic messages.

Adam Katz

5/13/23, 3:55 PM

I'm not really an expert on [ISO-2022-JP](https://en.wikipedia.org/wiki/ISO/IEC_2022#ISO-2022-JP), but it is my understanding that this encoding uses escape codes while your sample instead uses high-value character codes via HTML entities. Were this Unicode, these would be [Cuneiform](https://en.wikipedia.org/wiki/Cuneiform_(Unicode_block)) signs (starting with `` assuming your font can render that), though `！` is not defined by Unicode afaict.

lepe

5/14/23, 1:58 AM

@AdamKatz If you decode the HTML entities, for example, with this [tool](https://mothereff.in/html-entities), you will find that `こんにちは！` is actually `こんにちは！`.

Adam Katz

5/14/23, 6:11 AM

Hah, I was thinking in hexadecimal. Still, that's not a typical use of ISO-2022-JP by my understanding, since ISO-2022-JP would be filled with escape characters.

lepe

5/14/23, 7:03 AM

@AdamKatz probably you are right. I'm not familiar with ISO-2022-JP.

Elon Musk

I sit in a Tesla and translated this thread with Ai:

EN: spamassasin unable to read Japanese when it is html encoded

TH: spamassasin ไม่สามารถอ่านภาษาญี่ปุ่นได้เมื่อมีการเข้ารหัส html

RO: spamassasin nu poate citi japoneză când este codificat html

RU: spamassasin не может читать по-японски, когда он закодирован в html

VI: spamassasin không thể đọc tiếng Nhật khi nó được mã hóa html

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.