Score:1

Can there be an injective function that maps a large set of integers to a smaller set while being "collision-aware"

in flag

Consider two sets:

The "big set" contains all integers between $0$ and $2^{160}$ exactly once.

The "small set" contains all integers between $0$ and $2^{32}$ exactly once.

Given that the number of members in the "big set" is greater than those in the "small set", there can't be an injective function $f(n_b) = n_s$ mapping any input being a member of the "big set" $n_b$ to an output that's a member of the "small set" $n_s$. If that function $f$ existed, it'd be the questions' answer.

For practical reasons, we assume that there can however still be a construction/algorithm with a practical function $f_p$ where the results of all inputs into $f_p(n_b)$ are a member of the "small set" and each $n_b$ points to a distinct $n_s$.

For a lack of understanding about cryptographic concepts, I'll call this property "collision-aware". E.g. to implement this construction, assuming a storage capacity of the size of $2^{256}$ (256-bit unsigned integer), is there a function $f_p$ or an algorithm that for any $n_b$ either returns a "collision" ("collision-aware") or a distinct member of $n_s$?

cn flag
Is very unclear to me what you're looking for. How does what you're describing in your second paragraph differ from an injective function?
cn flag
What does "returning a collision" mean?
user10030 avatar
in flag
I'm looking for a function where card(domain) > card(codomain) but each number that I take as input to the function from domain maps to a distinct number in the codomain. From my understanding, this can only be possible if card(domain) >= card(codomain). So since this is not possible, can I have a function that allows this kind of mapping but e.g. "errors" if a collision has been found for the first time.
cn flag
That's impossible, yes. But what exactly is the behavior you want? The function either returns a value from the domain or an error symbol?
user10030 avatar
in flag
Here's what I practically want: Every now and then I get a random number out of uint160 (it's an Ethereum acccount/address) and I need a function that distinctively maps each address to a specific slot of a 2^32 length list. Say I've already allocated 100 addresses, now address #101 comes a long and its slot is the same as e.g. #91, then I want to be aware that there's a collision for the slots now #101 shouldn't overwrite #91. However, I can't just store all previously stored addresses. I only have a storage capacity slot in size e.g. uin256.
cn flag
Without any additional requirements the function defined as "If $x\leq2^{32}$, return $x$, else return $\bot$." seems to satisfy your requirements from the question. But I'm sure that you have additional requirements that you for some reason refuse to specify.
user10030 avatar
in flag
It's not refusal. It's rather not being able to express myself properly. I'll try specifiying more. Thanks for helping so far. To continue: If we used "if $x <= 2^{32}$, return x else return ⊥", the problem would be that we'd discriminate any input from the domain that is bigger than $2^{32}$. What I'm looking for, however, is e.g. that we allow the creation of a mapping between a number $0 < x < 2^{160}$ in the domain to the codomain until there's no numbers in the codomain left that we haven't already created a pointer from domain to codomain. "I want to fill up the codomain progressively"
fgrieu avatar
ng flag
What about a truncated hash? In our use case, that works up to and including #101 with less than 1 chance in 850 thousand of the contrary.
cn flag
Next obviously wrong attempt to find out what you're really looking for: Assign indices sequentially. Maintain a 33-bit counter n initialized at zero. Everytime you see an input, if $n<2^{32}$ return n and increment the state. If $n=2^{32}$ return $\bot$.
user10030 avatar
in flag
@fgrieu Interesting. What I like about this is that it's filling up the 2^32 codomain space progressively and evenly spread out. The problem I see, however, is that given the hash function: I want to store an Ether balance in relation to it. E.g. I want to say that $n_b$ e.g. 0xabc... → $n_s$ owns 1 Ether. However, as balances become larger, at one point - similar to how it is with Bitcoin mining - it may become economically sound to run a brute force algorithm to e.g. find a collision for $n_s$ for an account $n_b$ that currently holds a large balance.
user10030 avatar
in flag
Hey @Maeher. Given a counter that increases with each input and has a 1 bit larger capacity than the MAX number in the codomain, the problem is that for any number in the domain, I want to assign it exactly one number in the codomain even on repeated insertions. I work in computer science so I'd call such a function "deterministic". Given a particular input, it always produces the same output. From what I understand is that if we used an increasing counter, if we repeatedly inserted the same input multiple times, we'd end up with different codomain results each time (increasing by e.g. +1).
cn flag
You could try using a truncated hash and maintaining an approximate set membership datastructure (e.g. a Bloom filter) of the used up indices. But it's likely that that will end up being too large.
Score:2
cn flag

What about:- $$ n_s = \mathcal{H}(n_b) \& (2^{32} - 1) $$ where $n_b \in N_b$, etc? $\mathcal{H}$ can be a hash function of your choosing. Since this is a crypto site, I suggest SHA-256. $\&$ means bitwise AND, but could be replaced with right or left shifts of the appropriate number of bits (128 in SHA-256's case). Too slow perhaps(?)

Cryptographic hash functions are surjective, meaning their outputs occasionally collide. That rate of collisions will greatly increase if you truncate $\mathcal{H}$ to 32 bits. Even more so will the effect of the pigeon hole principle. So you'll have set $N_b$ filling up with uniformly distributed numbers, giving $n_b \to n_s$ from domain to codomain.


I don't know about Ethereum, but 160 looks suspiciously like the output of SHA-1. If so, just truncate the account/address to 32 bits as it's already uniformly distributed.

ma flag
"Cryptographic hash functions are surjective" - I don't think so. Surjective means that for every possible hash value, there exists some message that generates it. You can't prove that a cryptographic hash function has no "blind spots".
user10030 avatar
in flag
Hey @paul-uszak. I've commented in my original answer why I believe truncating is not the optimal solution. Wrt to uint160 and SHA-1: In Ethereum an account is derived from an ECDSA public key, then hashed using Keccak-256 to a 32bytes output and then - surprisingly - truncated using only the last 20 bytes. Finally it's prefixed with 0x. See more: https://ethereum.stackexchange.com/a/3619/47031
Paul Uszak avatar
cn flag
@user10030 So you have your answer :-) Truncate to four bytes and there you have it, as in my post scriptum.
user10030 avatar
in flag
While I appreciate the existence of your solution, I believe it's not what I'll ultimately need. In case I truncate to 4 bytes, I start giving my users an incentive to mine for $n_b$s that result in the same $n_s$. Sure, for addresses $n_b$ that only hold a low amount of monetary value this may not be a real problem or an economically sound attack: But I'd like to generally avoid this type of problem. There either cannot be collisions or there can be collisions and I'm immediately aware of them as I can look back at all other hashing operation inputs and outputs.
Paul Uszak avatar
cn flag
@user10030 Absolutely there will be collisions due to the pigeon hole principle. At a best rate of $2^{-32}$ per account. Then it will increase asymptomatically towards 1 as the codomain fills up.
Score:1
ph flag

If I understand the requirements, what you are asking for is not a "function" according to the normal definition. It sounds like you want some $f$ that given a sequence of inputs ${x_i}$ will return either a deterministic value $y_i$ or an error symbol if it has already returned $y_i$. But suppose $x_k$ is the first input to return an error. What should happen if you call $f$ with the sequence starting from $x_k$? I think you would want it to not return an error, so the value of $f(x_k)$ is not well-defined.

The way to resolve that mathematically is to change the definition of $f$ to accept a sequence. But practically, what that probably means is having your system record all the inputs that is has been called with. And you'll find that if you do that, you might as well just return $0, ..., n-1$ for the first distinct inputs, and errors for everything following.

user10030 avatar
in flag
Yes, practically being "collision-aware" would probably mean having to pass a sequence to $f$. Actually, passing in a sequence would be no problem as long as it wouldn't lead to an increase in storage for each new hashing operation. I'm wondering if there's e.g. a way where the previously passed value could be factorized or compressed such that they won't take up much storage and where it'd become practical to always pass all past values and a new one to make the function "collision-aware", as in "it's returning an error symbol if it encounters an output for more than the first time".
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.