Score:0

Public commitments to subsets

no flag

Problem

Suppose there is an entity with some users. The users are split into subsets (determined by the entity) and the entity needs to create public commitments to these subsets such that only a user can determine which set they are in. The sets need to be some minimum size, $m$.

More formally

Let $U$ be a set of user IDs--we want some flexibility here so the definition is left loose for now. Only the entity knows $U$, and each user only knows their ID. Let $\mathbb{V}$ be the power set of $U$: $\mathbb{V}=\mathcal{P}(U)=\{V:V \subseteq U\}$, and let $\mathbb{V}^* \subset \mathbb{V}$ be the subset split chosen by the entity. We need 2 functions: $$ \begin{align*} f&:\mathbb{V}^*\to K \\ g&:K \times U \to \mathbb{Z}_2 \end{align*} $$

where $K$ some finite space. $f$ is the commitment function that is used by the entity to create a commitment $k$ to a specific subset of users. $k$ is made public. $g$ is the function that the user would use to check if they are in a given subset.

$f$ and $g$ need to have the following properties:

  1. $\forall \space V \in \mathbb{V}^* \quad f(V)=k \implies g(v,k)=1 \space \forall \space v \in V$
  2. $\forall \space V \in \mathbb{V}^* \quad f(V)=k \implies g(u,k)=0 \space \forall \space u \in U-V$
  3. [not sure about the wording here] If an adversary has access to $k=f(V)$ then they are not able to determine any of the following in polynomial time with non-negligible probability:
    • any $v \in V$
    • any $u \in U \text{ s.t. } u \notin V$
    • $|V|$

Find $f, g, K$.

One possible solution

We can make a start on $f, g, K$ by using polynomials. First, construct the disjoint split of the users: $$ \mathbb{V}^*= \{ V_1, V_2, ... V_n : V_i \in \mathbb{V}, V_i \text{ disjoint}, \cup_{i}V_i=U, V_i \neq \emptyset \} $$

Let $k_i=f(V_i)$ and $v_{i,j} \in V_i$ be the $j^{th}$ element of $V_i$. We define our polynomial $p$ like so:

$$ \begin{align*} p(x)=&\prod_{i=1}^{n}\prod_{j=1}^{|V_i|}(x-k_iv_{i,j}) \end{align*} $$

and our $g$ like so:

$$ g(u,k)= \begin{cases} 1 : p(uk)=0 \\ 0 : \text{otherwise} \end{cases} $$

and our $f$ can be a cryptographic hash function, which defines $K$. Depending on which space we want $p$ & $v$ to live in we would have to adjust the definition of $p$ & $g$ so $k_iv_{i,j}$ is well defined.

A first look at our setup tells us we have properties #1 & #2, as long as our space chosen for $p$ is sufficiently large such that $\text{Pr}[k_av_{a,b}=k_cv_{c,d}, a \neq c] < \epsilon$. Property #3 is not achieved because any adversary that can factor the polynomial at least once will be able to access a users identity and know which subset they are in, in polynomial time.

It may be possible to achieve #3 by embedding the polynomial into an elliptic curve group via the function $p \to G^p$ where $G$ is a generator of the group. The intractability of the discrete logarithm problem there could be what gives us #3.

Richard Thiessen avatar
mx flag
Why use polynomials at all? For a user's secret ID `u` ∈ `U`, map it through `f(u,salt) -> x` where `x` is a 128 bit number and `salt` is a per set value to prevent cross set correlations, and then stick the value in a list, possibly with extra random dummy `x` values to hide the set size. More efficient structures like bloom filters could also work. Without knowing a user's secret ID `u`, an adversary can't find `x` and so can't determine membership. If the space of all `U` values can't be practically searched this is secure.
Stent avatar
no flag
This is definitely better than the polynomial suggestion but the size of the commitment is rather large, O(n) if there are n users. Ideally the commitment size should be O(1) per set, and the time taken for the user to determine which set they are in O(m) where m is the number of sets
Richard Thiessen avatar
mx flag
The set membership data structure in the `X` domain has no special cryptographic requirements. Probabilistic data structures like bloom filters or [one of their many alternatives](https://lemire.me/blog/2019/12/19/xor-filters-faster-and-smaller-than-bloom-filters/) are a good match. These have some probability of accidentally including some other values not in `V` but this can be checked for during set generation. Cost to generate the set is linear in total number of entries since we need to map all `u` to their corresponding `x`. Set membership tests are constant-ish time.
Richard Thiessen avatar
mx flag
If the generated set does accidentally include a value in `U`, pick a new salt and try again, obviously. If this takes too long increase the size of the (bloom filter)/whatever to reduce collision probability.
Richard Thiessen avatar
mx flag
"Ideally the commitment size should be O(1) per set" There are information theoretic bounds on how small such a structure can be. The structure has to get bigger for a larger number of ID values since it must represent more possible combinations of IDs. For `n` IDs you need at least `n` bits to represent an arbitrary combination (yes/no for each possible member).
Score:0
mx flag

The notion of mapping from the U domain to some other domain X (EG:128 bit strings) is mostly all you need. With a secret mapping from U to X we then need a data structure that does g(x,struct)-->{True/False} to indicate set membership. The data structure should additionally hide the number of true/false values.

Data Structure size

The minimum number of bits to represent an arbitrary combination of n items is n bits. One bit to indicate if each item is present in the set. The extra secrecy requirements in this application (don't indicate how many items are in set) will likely require >2n bits to make this work.

Hiding the total number of IDs issued isn't really possible except by making the data structure very large so that a very large n is possible. There's an obvious tradeoff though.

False positive rate and new IDs

Efficient g(x,struct)-->{True/False} data structures will not handle new x values well. There will be some false positive rate if people test new IDs against old sets.

A simple fix for this is to include a date_created for each ID and set and then an ID is not part of a set unless it was created before that set meaning the set creation took it into account.

If you want to build a distributed system with no centrally managed list of IDs there will be a nonzero false positive rate and a big bloom filter or similar data structure is the best option.

Data structure candidates

One option is a probabilistic set membership testing data structure like bloom filters or one of their many alternatives. The problem with these is they're designed mostly to have no false negatives (IE:g(x)-->{true} for all set members) with tunable false positive rates. Getting these to not leak information about the size of the set is hard. Naive approaches like padding the set out with random values are hard to implement efficiently without leaking some probabilistic information about set sizes.

A better option might be a hash table using an approach similar to pre-computed perfect hash functions. Since we're storing True/False values this can be quite compact.

The hash table is made up of a series of progressively smaller arrays that contain True/False/Collide ternary values (5 values can be packed in 8 bits). Each array is a hash table with it's own hash function. Keys are initially placed in the largest array. If there's a collision with keys having different true/false values they get put in the next one down etc. During construction we track an additional don't care state for each bucket indicating it can be set to an arbitrary value. After filling the table, we modify the don't care buckets so the final structure's distribution of True/False/Collide values does not leak info about set size.

To test membership, check for the key in the first array, if the bucket contains a collide value, check the next one etc.

Removing ID secrecy requirements for distributed applications

One interesting choice for x = f(u,salt) is to have both u and salt be elliptic curve public key points with f being the Diffie Hellman shared secret.

Protocol participants pick an ID as a public point u=t*G

Anyone can then pick a secret salt value y and calculate x=u*y (the resulting DH shared secret) for each client's public key u. The corresponding public point salt=y*G is published in the final data structure.

To test set membership, anyone knowing the secret key t for a given ID can re-calculate x=salt*t. They can also create a proof that x=salt*t for their private key u=t*G and publish this to show they are/aren't a member of a given set.

Stent avatar
no flag
Thanks for all the info! Would the hash table hide the number of elements in each subset?
Richard Thiessen avatar
mx flag
Depends, if constructed naively, no, collisions between ID's with the same `{true/false}` value do mean there isn't a 1:1 correspondence between hash table entries and ID's. Still after applying a statistical correction factor it'll be enough to get a very good estimate of the set size. The `don't care` marked entries can be changed to `True/false/collide` so the statistical properties of an all `true`, all `false` and `50:50` split set are identical. That might require making the table a bit bigger.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.