Score:0

Unable to open a text file created by exiftool: There was a problem opening the file; the file you opened has some invalid characters

cn flag
s.k

I am facing this error when opening a text file in gedit on Ubuntu 22.04:

enter image description here

The file was created as a redirection of the exiftool (12.57) command line for extracting metadata from a bunch of 100'000 photos:

for photo in $(find "/media/data/photos" -type f \( -iname "*.jpg" -o -iname "*.png" \));
do
  exiftool -a -G0 -s -c '%.7f' "${photo}" >> "/media/data/output/processing.txt"
done

I also tried to open the file in Python 3.10.6 but the same kind of error occurred:

 with open('/media/data/output/processing.txt','r') as f:
     data = f.readlines()

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 3107: invalid continuation byte

How can I find out what went wrong and, if possible, how to debug it?

Edits

$ file processing.txt
processing.txt: ASCII text

I discovered theses lines by greping the pipe character in my file:

grep -inr "|" processing.txt
28978:MakerNotes :    BabyAge                         : |.z.�.�.�.�.�.�.�.g.
153969:MakerNotes :    BabyAge                         : �...�.�.@.3.�.|.�.x.
239759:MakerNotes :    BabyAge                         :  .�.�.;.|.A.>.Q.=.�.
287018:MakerNotes :    BabyAge                         : �.�.�.|.�.....K.�.5.
466938:MakerNotes :    InternalSerialNumber            : |.�.x.�\.v.B...
473829:MakerNotes :    BabyAge                         : |.�.�.�...�.X.'.�.X.
475416:MakerNotes :    BabyAge                         : |.�.�.�...�.X.'.�.X.
475777:MakerNotes :    BabyAge                         : 4.�.|.1...P.:...�
477305:MakerNotes :    InternalSerialNumber            : �.|.r.|.h.�.�.O.
483418:MakerNotes :    BabyAge                         : 6.�.J.c.`.�.H.�.|.�.
557390:MakerNotes :    InternalSerialNumber            : �.�.�..|.o&.3.
604471:MakerNotes :    BabyAge                         : �.�.|.Q..A.=.9.�.�.
636619:MakerNotes :    BabyAge                         : �.�.N.�.�.�.G.�.|...
799895:MakerNotes :    InternalSerialNumber            : .|�..�.�`.�.
944200:MakerNotes :    BabyAge                         : �.�.�.|.�.r.�.�...�
cn flag
A UTF-8 file would show `UTF-8 Unicode text` or `UTF-8 Unicode (with BOM) text, with CRLF line terminators`. Python probably will open it if you use `,encoding='latin-1' ` as an option to the open command. Will result in showing the same chars as `grep -inr` shows.
Score:3
at flag

My first suggestion would be to try using the editor notepadqq as it is both less fussy about not opening files when they have problematic encodings (or actual data errors) but also offers some re-encodings to do something practical about them.

Ultimately though, a given situation might also require: inspection at the byte level and/or a good understanding of quite how character encodings work.

There are command line tools for this of course, but I note that what I must have installed for when I was last dealing with some is a program called ghex as a GUI "hex" editor.

I don't claim to be expert in this, but I did write some public general notes on the topic at Grrr Character Encoding

To understand why some byte sequences can be invalid for UTF-8 you'll need to understand both Unicode as a concept and UTF-8 as an encoding of it.

I do also have some unpublished notes about handling unexpected character encodings in Python3 - but the context was about file names, not file contents. In that context, it was the os module that "read" in the encoded string and I could then worry about how Python3 chose to store and interpret that. I've not yet needed to do the same for reading file contents.

I expect there are ways to alter or replace readlines so that it will either do its thing with an effective encoding interpretation or gracefully handle things that don't fit the assumed encoding. That's RTFM territory, but best done after you're clear what's actually there in the file.

FWIW - from my experience, the most likely thing is that what you think is encoded in UTF-8 simply isn't - and that will lead you to the art of guessing/trying in which encoding it really is. Dealing with archaic but normal in their day formats was why I wrote my article - especially stuff that came from problematic things that followed Microsoft's broken conventions.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.