Score:1

Ubuntu 20.04.2 MySQL upgrade mangled our data...how to fix?

mx flag

We upgraded from Ubuntu Server 18.04.5 LTS to Ubuntu Server 20.04.2 LTS. Everything seemed fine until today when we noticed that one of our applications was no longer working. MySQL appears to have mangled Unicode characters (stored as UTF-8) in the database when the database software was upgraded.

For example, UTF-8 character:

0xF0 0x9F 0x8C 0xB5

Became:

0xC3 0xB0 0xC5 0xB8 0xC5 0x92 0xC2 0xB5

Looks like something during the upgrade double-encoded the data as UTF-8 even though the data was already valid UTF-8.

The character set on the table prior to the upgrade was 'utf8'. Now the character set of the table is 'utf8mb3'.

Not sure what to do here to fix the problem.

in flag
Is the issue with a Django application?
mx flag
No. Also not sure why someone's downvoting this. Downvotes are only for "not showing any research or effort, unclear or not useful." None of that applies. I've done research and the best I've found is: https://stackoverflow.com/questions/11436594/how-to-fix-double-encoded-utf8-characters-in-an-utf-8-table But a single SQL query is a far cry from an actual solution. The upgrade to Ubuntu upgraded MySQL to v8.0 and changed the tables and the data in them in such a way that is broken.
in flag
The reason I asked about the use of Django is because some systems, such as that, have very particular expectations from the DB engine based on the connection configurations. You said this is happening to *one* system, not all, so the assumption was a rigid application was in need of an update. UTF8 (in MySQL 5.x and lower) has been deprecated in favour of Utf8mb4, but a table rebuild is required to convert legacy database tables to the current character set. As the question stands, there is an insufficient amount of detail to offer any specific suggestions
mx flag
It's almost certainly not just restricted to one application. It's just the one application that we've noticed so far that has a tendency to use Unicode (emojis, in particular). The data is serialized in the database (which we do a LOT of) and the deserializer is barfing on the data because MySQL is returning double-encoded UTF-8 characters. Using the MySQL client itself shows that the data is double-encoded (i.e. not a problem with the application).
mx flag
The point is that anyone who runs MySQL in Ubuntu Server LTS 18.04 and upgrades to 20.04 and has tables in their database with a 'utf8' charset is going to run into the exact same problem. Therefore, fully relevant and useful and needs a working fix.
mx flag
Okay, after wrestling with this for hours, I have a PARTIAL solution: First convert the table character set to `utf8mb4` via `ALTER TABLE tablename CONVERT TO CHARACTER SET utf8mb4`. Then for each impacted column, use the SQL query from the SO post to alter the data in the column. Do all of that first in a separate, cloned set of database tables so as to avoid accidentally making anything worse. But as I said, this is a partial, tentative solution that *appears* to make the data valid but it's a far cry from an answer. I'm going to need to build a tool to identify how widespread the issue is.
mx flag
Okay, after MUCH more time, I have processed all of our data. The above method works, sort of. I *highly* recommend exporting your data using a `utf8` connection, doing the MySQL upgrade, altering just fields with Unicode characters, and then restoring corrupt data from the exported data (i.e. your export is correct, MySQL occasionally does it wrong). Because https://stackoverflow.com/questions/11436594/how-to-fix-double-encoded-utf8-characters-in-an-utf-8-table WILL absolutely corrupt data somewhere in the database! Processing millions of rows of data is how I know that to be true.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.