Score:0

Pseudonymization hashing using public key

be flag

I have logs containing sensitive data (mail addresses, usernames etc.) that need not only be made GDPR compliant, but in general be secured as best as possible, so anonymization/hashing would be an easy solution. However, at the same time I need to be able to correlate the information for monitoring purposes (e.g. identifying attacks). For that, the plaintext values are not relevant, but I need the pseudonymized hashes to be equal for the same plaintext source (determinism). As I cannot store the data in plaintext, but only in pseudonymized fashion, I also need to be able to decrypt the values if necessary . As there might be multiple data sources that need to be decrypted individually (rarely would all at once be needed), my initial idea would be to use a plain old public/private key approach using RSA, which I know from my SSH day-to-day: Encrypt the values using a pub key per source, store the private keys in a safe. However, RSA introduces randomness to the hashes, which makes sense from a security standpoint, but ruins my "must be able to correlate" deterministic requirement.

Any ideas on a good solution? Is there a similar algorithm that I can use that is not too easy to bruteforce, but can use the public/private key approach? Am I missing other weakspots in e.g. attempting to use raw RSA without randomized padding?

In terms of operations: The risk is a lot higher that an attacker might access a large number of hashes than that he gain access to the public key, which already is highly secured. The biggest identified risk (still very unlikey) is that an attacker might be able to inject his own data at the source of the pipeline to be processed/encrypted and then view the hashes, allowing him to compare plaintext and hashes ( = guessing the hashes). Encryption speed does not need to be very high, and decryption speed is even less important.

If relevant, the actual implementation will need to be done using Python.

Maarten Bodewes avatar
in flag
With hashing we generally mean that you run your data through a cryptographic hash such as SHA-256. Do you mean that kind of hashing, or do you mean encryption which creates a ciphertext to obtain confidentiality?
kelalaka avatar
in flag
Hash for equality and at the same time use AES-GCM with IV=log number.
Senshi avatar
be flag
@MaartenBodewes I mean encryption, because I need to be able to restore the plaintext in some cases. My understanding is that hashing is a "oneway" method, meaning I cannot get the plaintext from the generated hash. Is that correct?
Score:3
ar flag

What you seem to be asking for is some type of public-key convergent encryption, for which various schemes do exist.

However, all such schemes are unavoidably vulnerable to brute force guessing attacks if the number of plausible plaintexts is low — say, less than about a septillion ($10^{24} ≈ 2^{80}$).

Specifically, an attacker who has a convergent ciphertext (or just a cryptographic hash) of a message and the public key (if any) used to generate it can guess a plausible plaintext, encrypt it and compare it with the ciphertext / hash to verify whether or not their guess was correct.

A typical desktop CPU can do on the order of about a billion ($10^9 ≈ 2^{30}$) encryptions a second, so if there are fewer likely plaintexts than that, the scheme offers no real security at all. And if the attacker is willing to spend a bit more time and/or buy (or steal) a bit more computing power than that, they can probably brute force their way through a thousand or a million or a billion times that many plaintexts. Also, attacking multiple ciphertexts simultaneously in this manner is just as fast as attacking a single one.

A key stretching scheme may be used to slow down such attacks, but only at the cost of slowing down legitimate encryption (and decryption) operations by approximately the same factor. So for example, if you can live with a single encryption operation taking approximately one CPU-second, you can force the attacker to also take about one CPU-second (give or take a factor of 10 or 100 depending on implementation efficiency) to guess and test a single plausible plaintext. But that still means that an attacker with access to 100 CPUs (or possibly ~one fast GPU) can still test a million possible plaintexts in about three hours.

Typically, data such as mail addresses, usernames, phone numbers, etc. has rather low cardinality — and worse, even if there might be millions or billions of possible addresses, the likely address combinations in use are much fewer, and easily discoverable using public maps and address databases. So, in practice, using cryptographic hashing or convergent encryption to pseudonymize such data is doomed to fail.

What you can do, instead, is one of the following (in roughly descending order of preference):

  1. Don't store or process such data if you can avoid it.

  2. If you must store such data, store it encrypted (using a conventional chosen-plaintext attack resistant encryption scheme) and only decrypt it for processing on a secure system. Store the decryption key securely.

  3. If you must pseudonymize such data, e.g. for processing by untrusted third parties, do it preferably by associating each datum (address, username, etc.) with a fully random identifier and storing the associations in an encrypted database. Make sure to keep this association database secure as described above.

Alternatively, you may generate the pseudonymous identifiers by applying a PRF (such as HMAC or some key-stretching KDF) to the sensitive data items with a secret key. If so, you must protect the PRF key from compromise, as it can be used to de-pseudonymize all data.

Either way, it is generally not feasible to do pseudonymization "at the source". Instead, you should encrypt any sensitive data using a secure public-key scheme at the collection point (which should only have access to the public half of the key pair), collect this encrypted data into a secure system and decrypt, process and (optionally) pseudonymize it there.

ar flag
Ps. Related earlier answer: https://crypto.stackexchange.com/questions/25808/one-way-deterministic-hash-for-low-entropy-input/25813#25813
Senshi avatar
be flag
Thank you very much. All of that was super informative and gave me plenty to think about. I underestimated bruteforce vulnerability with chosen-plaintext and limited combinations. Recommendation 2 sounds like an attractive solution for this particular case. Automated analysis happens on an isolated workstation where the decrypt key is stored inaccessible to all users, and the generated reports will only be made accessible to select few analysts using our IDM. Ad-hoc decryption is never supposed to happen anyway except in extraordinary forensic cases.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.