Score:0

Revealing percentiles of an ordered dataset without revealing its size

br flag
N J

Given an ordered set $S$ of positive integers (eg. $S=\{503, 503, 520, 551...N\}$) I want to be able to reveal the percentile rank (eg. 503 is in the top 10th percentile) for each element of a contiguous subset of $S$ (ie. $\{s_i,s_{i+1},... s_k\} \;|\; i \ge 0, k \lt N$). However I don't want to leak information that can be used to efficiently deduce $N$.

Using the formula for calculating a percentile rank of a given score from wikipedia:

$$P = \frac{\text{# values below score } s - (0.5 \times \text{# of scores with value }s)}{N}$$

We should be able to solve for $N$ with two percentiles $p_1$ and $p_2$ and the number of scores between them, $n$ using this formula.

$$ N = \frac{n}{p_2-p_1} $$

As a demonstration, given a randomly generated dataset of $N$ of $10,000$ and values

$p_1=0.0751, p_2 = 0.0951 \text{ and } n=200$

$$N = \frac{200}{0.0951-0.0751}=10,000$$

Is there anything that can be done to maintain as much accuracy as possible while preventing the efficient determination of $N$ (something like differential privacy)? If this is possible I'm assuming I'll need to introduce some randomness, however I'm not sure how to formulate how much will be required.

ph flag
Your prose description talks about "X is in the top 10th percentile" while your example gives an answer to 4 decimal places. Which model do you want to support?
br flag
N J
@bmm6o integer precision is fine (eg. 29th Percentile).
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.