I'm not sure the status of multi-byte character support in GNU tr
.
In perl
, you will need to set at least the stdin and stdout streams UTF-8 aware using the -C perlrun option. You can then use unicode properties as described in the perluniprops documentation - there is even an \p{Emoji}
codepoint group. So for example:
$ printf '%s\n' 'وسائل الاتصال الحديثة & -eYrBcHOx2Jf.mp4' |
perl -C -pe '$_ =~ s/[\p{Emoji}]//g'
وسائل الاتصال الحديثة & -eYrBcHOxJf.mp
Unfortunately it looks like \p{Emoji}
includes at least the decimal digits - although you can exclude those using the (currently experimental) regex_sets feature, for example:
$ printf '%s\n' 'وسائل الاتصال الحديثة & -eYrBcHOx2Jf.mp4' |
perl -Mexperimental=regex_sets -C -pe 's/(?[\p{Emoji} - \p{ASCII}])//g'
وسائل الاتصال الحديثة & -eYrBcHOx2Jf.mp4
At least in File::Rename version 1.30, you can make the perl-based rename
command encoding-aware in a similar manner to perl's -C
via its -u
option:
-u, --unicode [encoding]
Treat filenames as perl (unicode) strings when running the
user-supplied code.
Decode/encode filenames using encoding, if present.
encoding is optional: if omitted, the next argument should be
an option starting with '-', for instance -e.
So given
$ ls *.mp4
'وسائل الاتصال الحديثة & -eYrBcHOx2Jf.mp4'
then
$ rename -n -u utf8 'use experimental qw(regex_sets); s/(?[\p{Emoji} - \p{ASCII}])//g' *.mp4
rename(وسائل الاتصال الحديثة & -eYrBcHOx2Jf.mp4, وسائل الاتصال الحديثة & -eYrBcHOx2Jf.mp4)
You could instead specify character ranges to keep ex.
$ rename -n -u utf8 's/[^\p{ASCII}\p{Arabic}]//g' *.mp4
rename(وسائل الاتصال الحديثة & -eYrBcHOx2Jf.mp4, وسائل الاتصال الحديثة & -eYrBcHOx2Jf.mp4)
or
$ rename -n -u utf8 'use experimental qw(regex_sets); s/(?[\P{ASCII} & \P{Arabic}])//g' *.mp4
rename(وسائل الاتصال الحديثة & -eYrBcHOx2Jf.mp4, وسائل الاتصال الحديثة & -eYrBcHOx2Jf.mp4)
which doesn't seem to have the same issue with the 4
character.