Score:1

spamassasin unable to read Japanese when it is html encoded

ug flag

I would like to block some emails which contains certain Japanese words, but spamassassin fails to detect such words when the email is HTML encoded, for example:

This is a multi-part message in MIME format.
--------------050206070005060005050706
Content-Type: text/plain; charset=ISO-2022-JP; format=flowed
Content-Transfer-Encoding: quoted-printable

こんにちは!残念な&=
#12364;ら凶報がございま&#=
12377;。数ヶ月前、あな...

--------------050206070005060005050706
Content-Type: text/html; charset="ISO-2022-JP"
Content-Transfer-Encoding: quoted-printable

<html>
  <head>

    <meta http-equiv=3D"content-type" content=3D"text/html; =
charset=3DISO-2022-JP">
  </head>
  <body bgcolor=3D"#FFFFFF" text=3D"#000000">
    &#12371;&#12435;&#12395;&#12385;&#12399;&#65281;</br>
</br>
&#27531;&#24565;&#12394;&#12364;&#12425;&#20982;&#22577;&#12364;&#12372;&=
#12374;&#12356;&#12414;&#12377;&#12290;</br>
...
  </body>
</html>
--------------050206070005060005050706--

Example rule in spamassassin:

body     JAP_BAD_1  /残念ながら凶報がございます/
score    JAP_BAD_1  5.0

However, when I run the test:

spamassassin -D textcat -t spam.test

It doesn't show up the match. What do I have to do?

anx avatar
fr flag
anx
Any reason to not simply reject *all* mail with numeric html entities in supposedly `text/plain` type parts?
lepe avatar
ug flag
@anx I'm not sure if doing that may reject authentic messages.
gb flag
I'm not really an expert on [ISO-2022-JP](https://en.wikipedia.org/wiki/ISO/IEC_2022#ISO-2022-JP), but it is my understanding that this encoding uses escape codes while your sample instead uses high-value character codes via HTML entities. Were this Unicode, these would be [Cuneiform](https://en.wikipedia.org/wiki/Cuneiform_(Unicode_block)) signs (starting with `` assuming your font can render that), though `&#65281;` is not defined by Unicode afaict.
lepe avatar
ug flag
@AdamKatz If you decode the HTML entities, for example, with this [tool](https://mothereff.in/html-entities), you will find that `&#12371;&#12435;&#12395;&#12385;&#12399;&#65281;` is actually `こんにちは!`.
gb flag
Hah, I was thinking in hexadecimal. Still, that's not a typical use of ISO-2022-JP by my understanding, since ISO-2022-JP would be filled with escape characters.
lepe avatar
ug flag
@AdamKatz probably you are right. I'm not familiar with ISO-2022-JP.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.