Score:1

How to delete emojis from Youtube filenames?

cn flag

Im trying to remove emoticons from this Youtube filename:

وسائل الاتصال الحديثة & -eYrBcHOx2Jf.mp4

Im using perl -p -e 's/[^[:ascii:]]//g' and tr -dc '[:print:]' but got this;

& -eYrBcHOx2Jf.mp4

How to delete emoticons and keep Arabic characters?

Cyrus avatar
cn flag
`| tr -d ''`?
s3idani avatar
cn flag
@Cyrus This was just an example, I have a hundred of videos with many emoji characters in filenames.
de flag
I have the same problem using `yt-dlp` extensively :)
s3idani avatar
cn flag
@SridharSarnobat `yt-dlp` has a `--restrict-filenames` option can do the trick.
de flag
@s3idani - oh fantastic, I wish I knew this sooner. Thank you so much.
Score:1
hr flag

I'm not sure the status of multi-byte character support in GNU tr.

In perl, you will need to set at least the stdin and stdout streams UTF-8 aware using the -C perlrun option. You can then use unicode properties as described in the perluniprops documentation - there is even an \p{Emoji} codepoint group. So for example:

$ printf '%s\n' 'وسائل الاتصال الحديثة  &   -eYrBcHOx2Jf.mp4' | 
    perl -C -pe '$_ =~ s/[\p{Emoji}]//g'
وسائل الاتصال الحديثة  &   -eYrBcHOxJf.mp

Unfortunately it looks like \p{Emoji} includes at least the decimal digits - although you can exclude those using the (currently experimental) regex_sets feature, for example:

$ printf '%s\n' 'وسائل الاتصال الحديثة  &   -eYrBcHOx2Jf.mp4' | 
    perl -Mexperimental=regex_sets -C -pe 's/(?[\p{Emoji} - \p{ASCII}])//g'
وسائل الاتصال الحديثة  &   -eYrBcHOx2Jf.mp4

At least in File::Rename version 1.30, you can make the perl-based rename command encoding-aware in a similar manner to perl's -C via its -u option:

   -u, --unicode [encoding]
           Treat filenames as perl (unicode) strings when running the
           user-supplied code.

           Decode/encode filenames using encoding, if present.

           encoding is optional: if omitted, the next argument should be
           an option starting with '-', for instance -e.

So given

$ ls *.mp4
'وسائل الاتصال الحديثة  &   -eYrBcHOx2Jf.mp4'

then

$ rename -n -u utf8 'use experimental qw(regex_sets); s/(?[\p{Emoji} - \p{ASCII}])//g' *.mp4
rename(وسائل الاتصال الحديثة  &   -eYrBcHOx2Jf.mp4, وسائل الاتصال الحديثة  &   -eYrBcHOx2Jf.mp4)

You could instead specify character ranges to keep ex.

$ rename -n -u utf8 's/[^\p{ASCII}\p{Arabic}]//g' *.mp4
rename(وسائل الاتصال الحديثة  &   -eYrBcHOx2Jf.mp4, وسائل الاتصال الحديثة  &   -eYrBcHOx2Jf.mp4)

or

$ rename -n -u utf8 'use experimental qw(regex_sets); s/(?[\P{ASCII} & \P{Arabic}])//g' *.mp4
rename(وسائل الاتصال الحديثة  &   -eYrBcHOx2Jf.mp4, وسائل الاتصال الحديثة  &   -eYrBcHOx2Jf.mp4)

which doesn't seem to have the same issue with the 4 character.

Score:1
cn flag

I was able to delete emojis and keep arabic characters in filename using sed:

echo "وسائل الاتصال الحديثة & -eYrBcHOx2Jf.mp4" | sed 's/\xf0\x9f/\r&/g; s/\s*\r.//g'

output

وسائل الاتصال الحديثة & -eYrBcHOx2Jf.mp4

I know this might not be the best and cleanest way but it fix my current issue.

I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.