Score:0

can you generate an ID number quickly, with no collisions, and without IDs revealing information?

ru flag

Is there a standard way to generate ID numbers one after the other such that:

  • You can guarantee, or almost guarantee, that you avoid collisions. (By "almost guarantee", I mean for example if you generated completely random 24-digit numbers, and you "only" generated 1 million of them, then even with the birthday paradox, the odds of a collision would be small.)
  • You want the ID numbers to be short, not unwieldy - specifically, you don't want to rely on the length of the ID number (and picking random values) to avoid collisions, as described in the previous bullet point. You have to do it some other way.
  • You don't want to avoid collisions each time you generate a new value by looking at all the pre-existing values to see if it's already been used. This would only be a log(n) lookup each time on a sorted list, but suppose I want to avoid this anyway.
  • You don't want the ID number to reveal any information about when it was generated, or how many ID numbers were generated in between ID number X and ID number Y. Without this condition, the problem is trivial; you could just use the clock time (plus some random value that's large enough to avoid collisions between numbers generated in the same clock time value), or you could just use sequential integers (except now an attacker knows that if someone generated ID value 5000 on March 1st and ID value 6000 on April 1st, there were 1000 other values generated in between then).

I tried to find a trivial answer, but none of the ones I tried seemed to work. You could just take the SHA-256 hash of the numbers 1, 2, 3, etc. (plus some secret key), but this has the same problem as just picking random numbers from the available space -- if you're relying on the length of the hash (e.g. SHA-256) to avoid collisions, the resulting ID numbers are long and unwieldy, and if you make the hash shorter, you increase the chance of collisions.

Or you could generate new IDs by incrementing each time by a random value between 1 and n, instead of always incrementing by 1. The problem is that depending on what the attacker can do, they could figure out what n is -- if they have the ability to generate two IDs in sequence, and do it repeatedly, they could figure out n, or if they have the ability to check which IDs are valid, they could check every number in some small range, to see how densely packed the valid IDs are, and figure out n from that.

The only sort-of solution I could find is as follows: First, do some prep work in advance. For however many ID values you expect to generate (say, 1 million), take all the integers from 1 to 1 million, and in order, start computing the hash of each integer plus a secret key. Truncate the hash to whatever value you think is short enough. But, with a short enough truncation, you expect to see collisions. So each time you generate a new truncated hash for a given integer, check it against previously generated values, and if there's a collision, add that integer to a list L of integers where the hash of that integer collides with that of a smaller integer. (So in fact if your plan is to generate 1 million IDs, during his prep work you'll have to go a little bit last 1 million integers, to make up for the ones that you skipped.)

Then, at run time when you generate your IDs, you just start with an integer counter. Each time you generate a new ID, you increment the integer and check if it's on your list L, and if it is, you skip and to go the next integer. (This does involve a "log n lookup", apparently breaking one of my stated rules, but what I really wanted to do was avoid having to check each new ID value against every value generated so far; checking L should be much faster.) And you can fine-tune this for tradeoffs (the longer you make the truncated hashes, the shorter L will be and hence the shorter the check will be each time you generate a new ID; but longer ID values might not be desirable).

But this feels like a hack. Is there a standard way? If not, can you think of a better way?

kelalaka avatar
in flag
For a fixed key, encrypt with AES-ECB mode. AES is a family of permutations and a key select one of them. Besides, I would go for SHA-512, SHAKE, or BLAKE-512, etc. Finding collision is not expected, however, once you find, you will be way famous!
poncho avatar
my flag
@kelalaka: SHA-512? So, you're suggesting a 512 bit (154 digit) id???
kelalaka avatar
in flag
Do you think that you can have more than $2^{100}$ ID's?
poncho avatar
my flag
@kelalaka: if you're relying on the collision resistance of SHA-512, you need to output the full hash - a truncated hash would not imply a collision in the original hash function.
Bennett avatar
ru flag
The problem with using any kind of hash to generate the ID is the same problem with just using random numbers, as described in the problem statement -- if you avoid collisions by simply making them long, that's too unwieldy, and if you truncate them to make them shorter, you increase the chance of a collision. I'm looking for a way to avoid collisions without simply using very long values.
Score:2
my flag

The most efficient way would be to use a Format Preserving Encryption algorithm; that is an algorithm that is a permutation over an arbitrary-sized set (for example, a sequence of 10 decimal digits).

Using it would be simple: you'd pick a random key and store it; you'd also keep a sequence number (in the same format as the output, for example, you might start with 0000000000). Then, when it comes time to generate the next ID, you'd increment the sequence number, and send that through the FPE algorithm; that's your next ID [1].

Because the FPE algorithm is a permutation, you will never generate the same ID twice until the sequence number wraps; hence, no collisions. You can make the ID as short as needed (current FPE algorithms have issues with truly tiny spaces; if you keep the ID to be, say, at least 6 digits, you'd be safe). And, because FPE algorithms are secure, any ID does not provide any information about any other ID (including the relative generation order).

The drawback: there are no (as far as I know) common FPE libraries available. For use, I'd suggest FF1 from this document; implementing that from scratch would be a bit of work (but would satisfy your needs).


A less efficient, but easier to implement method would be to do a variation on what you suggested: keep a list of IDs you have not yet assigned.

Here, during setup, you would initialize a list of all possible IDs in sequential order, say 000000 through 999999. In addition, you would set a value N to the highest unassigned ID (in this example, N = 999999).

Then, when it comes time to issue a new ID, you would pick a random number x between 0 and N (inclusive). Then, you would swap the IDs at indices N and x (and if x=N, then this swap operation doesn't do anything); then, you would output the value that's not in index N (and then decrement N).

That's it; this is actually the Fisher-Yates shuffle, you could do this on-demand (as I wrote it), or you could do all the shuffling at setup time (and just read things off the list when generating IDs).

This is less efficient than the FPE idea - both because the setup is more involved, and you need to keep a largish list of IDs around. On the other hand, it is a lot easier to implement than trying to implement FF1 from scratch...


[1]: Format Preserving Encryption algorithms also take a 'tweak'; you'd want to keep that a fixed value (for example, the empty string); a tweak provides a service that your specific use does not need.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.