Score:2

How to generate n unique numbers from the output of a hash function?

bq flag

What would be an easy way to use the hexadecimal output of a hash function (like md5) to generate n unique numbers from, say 0 to 15. Of course I could generate n numbers by using each digit of the digest, but I need those numbers to be unique.

kelalaka avatar
in flag
What is the aim? What is the actual range? Do you really need to use hash functions? Discarding is the easiest....
Maarten Bodewes avatar
in flag
@kelalaka The problem with the simple discard method is that it uses an undetermined amount of random bits. And, as [I've shown](https://github.com/owlstead/RNG-BC/blob/master/README.md), it is **very** inefficient if the number is discarded. With an average of 1.7 bits of overhead at least the RNG-BC method that I've described has a chance of *usually* keeping within 128 bits produced by MD5.
Eugene Ryabtsev avatar
cz flag
Are there any requirements imposed on the resulting numbers? The simplest way is [MD5 + 0, MD5 + 1, ..., MD5 + n] and they will all be unique. Or do you mean 0..15 is the range of the resulting numbers? In that case, you can index into [0, 1, ..., 15] and use that as the first number and the remaining n as the rest.
kelalaka avatar
in flag
@MaartenBodewes sure it is undetermined, however, the range is not specified, instead it was made so simple. Fisher-Yates shuffle and it's varianst's is the usual way, however, the OP mentioned hash so I've wanted to clarify. Eugene's solution is more precise.
U. Windl avatar
cn flag
The easiest way to get *n* unique numbers is *counting* (from 1 to *n*).
Score:9
ru flag

You should use the combinatorial number system to bijectively map an integer in the range 0 to $\binom{16}n-1$ an $n$-element subset of the first 16 numbers. I wrote a related answer on mapping messages to error positions in the Niederreiter cryptography system.

Maarten Bodewes avatar
in flag
Hmm, seems very much like the method that I'm proposing, not sure if it is entirely the same.
Starscream512 avatar
bq flag
So we find the combinadics representation of an integer in the range 0 to ${16 \choose n} - 1$ for degree `n`?
Daniel S avatar
ru flag
@Starscream512 Yes, that's exactly it. Use the hash function to generate such an integer uniformly and then convert it to a subset. It's clearly information theory optimal.
Maarten Bodewes avatar
in flag
Getting more and more sure that it is functionally the same as what is in my answer. Fisher-Yates uses it to index and shuffle so that lists other than ranges are supported, but I guess that's about the difference.
kelalaka avatar
in flag
Is there a work for the quality of the randomness? I have not seen around, interestingly Wiki says it i used also used for the analysis of lottery games. If the lotteries not used in the real application then there is a something that they see but I've failed.
Starscream512 avatar
bq flag
Your answer then leads to my next question about modulo bias in generating an integer using a hash function. I suppose if ${16 \choose n}$ wasn't too big, and I used a big hash like sha256, then the modulo bias would be negligible?
Maarten Bodewes avatar
in flag
I'd certainly argue for SHA-256 in this case, it would get rid of any possible bias issues, however small. MD5 has a lot of problems, one is that it is severely broken especially w.r.t. collision resistance, which **may** very well play a role here, and the other important one is the output size, which is not enough to assure collision resistance in the first place.
Score:4
in flag

TL;DR: use the Fisher-Yates shuffle based on a seed of the range represented as a list; the hash value is the seed.


First of all, MD5 generates 16 bytes that are generally well distributed. It doesn't generate hexadecimals; those are just used to represent the bytes in a textual form.

You can generate values in a range using the Fisher-Yates shuffle. However, this requires you first to draw an number in the range [0..n), then [0..n - 1) until [0..2). Generally getting random numbers from bits is tricky because most algorithms use a non-deterministic number of bits.

In this case you can use a factorial of n, then choose a vector using that to compute the required number in a range.

So you'll get something like the following Python code, which implements the shuffle based on a 128 bit seed:

from decimal import Decimal, getcontext
from hashlib import md5
import math

# Setting the precision for decimal operations
getcontext().prec = 50

def fisher_yates_shuffle(n, seed):
    # Guard clause: Check the type and size of the seed
    if not isinstance(seed, int) or seed.bit_length() != 128:
        raise ValueError("Seed must be a 128-bit integer.")

    # Initialize the array from 0 to n-1
    arr = list(range(n))

    # Convert the seed to a decimal
    random_decimal = Decimal(seed) / Decimal(2**128)

    # Calculate the product x of all numbers in [0, 1, ..., n-1]
    x = math.factorial(n)
    
    # Pick a starting point within [0, x)
    current_vector = random_decimal * Decimal(x)

    for i in range(n - 1, 0, -1):
        # Generate a value for i-th index using current_vector
        i_value = int(current_vector % Decimal(i + 1))

        # Perform the shuffle step
        arr[i], arr[i_value] = arr[i_value], arr[i]

        # Update the current_vector for the next iteration
        current_vector //= Decimal(i + 1)

    return arr

# Generate a 128-bit random seed from MD5 hash of "Hello World"
seed_str = "Hello World"
md5_hash = md5(seed_str.encode('utf-8')).digest()
seed = int.from_bytes(md5_hash, byteorder='big')

# Number of elements in the array (max)
n = 16

# Perform the Fisher-Yates shuffle
result = fisher_yates_shuffle(n, seed)

print("Shuffled array:", result)

Written with the help of ChatGPT, but altered & logically validated.

Note that some bias is introduced, somewhere around $\frac{1}{2^{88}}$ for 16 values. Usually that's considered negligible, but it pays to be careful when it comes to cryptography.

I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.