How safe is my pseudonymization procedure?

Igor stands with Ukraine

6/16/23, 8:14 AM

I work for an institution where patient data is collected and I am supposed to encrypt it. At the moment I do the following steps (with R):

Randomly assigning an ID to each patient. The procedure avoids duplicates (using sample(), among others)
Create a salt for each patient (using salt <- bcrypt::gensalt(log_rounds= 5))
Create a hashed ID for each patient using the ID and the salt (using id_hashed <- bcrypt::hashpw(id, salt = salt))

I save the data in three different files

first file contains pairs of patient data (name and birthdate) and encrypted/ hashed ID
second file contains pairs of the not encrypted/ hashed IDs and salts
the third file is the actual database with IDs and a number of variables of interest (e.g. smoker, weight,...)

In practice this will be used as follows:

While working in the database (third file) we know the IDs but not the names of the patients. Sometimes we need to find out what person an ID is. I wrote an app (shinyApp) where we can type in the ID and the app returns the name and birthdate. For this the app goes into the second file, takes the ID and the corresponding salt and generates the hashed ID. This hashed ID is compared to the ones in file one. The app returns name and birthdate of the patient with the same hashed ID as just created.
If a patient comes to us and we want to collect new data we know his name and birthdate but we do not know his ID. In this case we can type in name and birthdate in the app and find the corresponding ID. For this the app goes to the second file and uses the IDs and salts to create hashed IDs. While doing so the app compares whether the hashed ID corresponds to one of the ones in file one. If yes, we found which ID the patient has. This process takes a while because the app needs to go though every ID and salt pair until the correct hashed ID is found.
If we have a new patient, we can type in his name and birthdate into the app. This automatically generates an entry in file one (name + birthdate and hashed ID) and in file two (ID and salt).

Question: Is there some obvious pitfall in this procedure? If you could name a weakness and how to resolve this it would be great. I please to be gentle since I am new to this.

Notes:

I know that there is no theoretical need for the random generated IDs because we could use patient data (name and birthdate) and a salt to generate a hashed ID. We did not want this approach because my co-workers dislike having the very long hashed IDs in the actual database (file three).
The discription of bcrypt::hashpw() says "Bcrypt is used for secure password hashing. The main difference with regular digest algorithms such as MD5 or SHA256 is that the bcrypt algorithm is specifically designed to be CPU intensive in order to protect against brute force attacks. The exact complexity of the algorithm is configurable via the log_rounds parameter. The interface is fully compatible with the Python one." (see here ).

0 + 0

hash

salt

password-hashing

Score:1

Crypto

fgrieu

6/16/23, 8:55 AM

Worst weakness is that read access to the first file reveals name and birthdate of patients.

And then, read access to the other files by an adversary with knowledge of the system (as assumed in cryptography) allows to get the medical data for each patient identified by name and birthdate, at bearable computational cost.

This is an IT security issue with no complete cryptographic solution. The standard solution is to restrict read access to the files. The best that I see practically possible without such restriction is that knowing/guessing exactly the name and birthdate of a patient is necessary to de-anonymize their data, and there's a computational cost to verifying a guess. The general idea is to either

not store the name and birthdate at all; this seems possible without changing the functionality as stated in "in practice", but we can no longer de-anonymize, nor detect that a mistyped name/birthdate created duplicate entries for the same patient.
store name and birthdate encrypted under a public key, with the private key kept with extra precautions and used (to decipher) only in the exceptional case that patient data must be de-anonymized.

As a relatively minor aside, "Randomly assigning an ID to each patient" requires something unstated to avoid duplicate IDs, and a weakness could creep there.

0 + 0

Igor stands with Ukraine

6/16/23, 9:00 AM

"Worst weakness is that read access to first file reveals name and birthdate of patients." - The guidelines recommend to use pseunonymization in our context, i.e. a reversible encryption (as oposite to anonymization where no-one is supposed to restore original data). Therefor I need to save the actual personal data at some point. At least I do not know whether there is an alternative?

caveman

6/16/23, 1:02 PM

Would zero-knowledge proofs help?