Score:5

What is the best way to pseudonymise IP addresses while retaining the ability to identify those that share a subnet?

us flag

Background: I'm developing an app that is based around registered users voting on stuff, and I want to create a heuristic that involves IP addresses as one way to flag accounts for further investigation of potential multiple account+vote abuse. In the interests of privacy/data minimization/GDPR obligations, it appears the best strategy is to store keyed hashes of the IPs, which would be sufficiently pseudonymous but deterministic for checking matches. So I could just use HMAC-SHA256 or similar on the whole address and be done with it.

However it occurs to me that it might be useful to not only identify IPs which are identical to each other, but also those that share a subnet, which would require something not quite so opaque. The obvious way to do this would be to hash each part of the IP separately. The problem is that with HMAC-SHA256 (for example) the complete output is just way too large, especially for IPv6 addresses (8 x 256 = 2048 bits for each 128-bit address). It would also cut down the size of the input space substantially (1-byte values for IPv4, 2-byte values for IPv6) - I assume it would be best to use a different key for each part if I were to do this, which doesn't sound fun.

What's a good way to achieve my goal while keeping the stored size relatively small? Maybe it's ok to truncate the output when using SHA256? Maybe since it's an HMAC and the key is secret it's ok to use a smaller and weaker hashing function in the first place like MD5? Maybe there's another hash function that is uniquely suited to this kind of use case? Any guidance is appreciated.

kelalaka avatar
in flag
See. format-preserving encryption.
knaccc avatar
es flag
"it appears the best strategy is to store keyed hashes of the IPs, which would be sufficiently pseudonymous but deterministic for checking matches" <- What are your criteria for sufficient pseudonymity? There are only 4 billion IPv4 addresses, so it would be trivially easy to go backwards from the hash to the IP address in a few seconds using a modern GPU. Perhaps you intend to use an HSM so that you have no access to the HMAC key?
eddydee123 avatar
mk flag
I think more than just format preservation is needed to preserve relations such as common-subnet
Score:1
us flag

After some more searching (thanks for the pointer in the comments) I came across the Crypto-PAn scheme (often spelled without the hyphen as CryptoPAn), which was described/developed for precisely this purpose. It has a handful of software implementations in various languages, a few of which support IPv6.

The property I was looking for is termed "prefix-preserving", and the paper that introduced Crypto-PAn offers a mathematical proof that there's only one general way to do it (at least in a way that the value for each part of the prefix is dependent on everything that comes before it, as opposed to the independent scheme I proposed).

Crypto-PAn involves the (repeated) use of a pseudorandom function (PRF), which in the reference implementation and most others is AES-128-ECB. Pseudonymised IP addresses can be deciphered (when the secret key is known), even if a hash function is used for the PRF, due to how the algorithm works - in other words Crypto-PAn is an encryption scheme regardless of the PRF.

Crypto-PAn also happens to be format-preserving, so this makes it possible to work with pseudonymised addresses just like you would with the originals.

Being deterministic and prefix-preserving and operating on a small value space comes with the downside that the scheme is necessarily weak to semantic analysis. It's clear that in order to have the desired utility, there is an unavoidable tradeoff in privacy (which I knew from the start). In other words it's best-effort, but that's better than storing the original IP addresses. Of course there are additional techniques that could or should be employed to help mitigate the risk (e.g. key rotation, deletion after a set time, partitioning with different keys). Obviously I hope that my other security practices prevent the data from ever being exposed in the first place.

There is an IETF RFC from 2020 that includes a table of IP address anonymisation/pseudonymisation techniques. Besides Crypto-PAn, the only one that falls under both the "pseudonymization" and "prefix preserving" categories is something called "Top-Hash Subtree-Replicated Anonymization (TSA)", which is apparently optimised for speed (probably not a good thing in this context?), but it comes with a note that suggests that it might be too memory hungry for IPv6 addresses, and I haven't been able to find any implementations.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.