In an online community, you sometimes have to ban certain IP addresses or even entire IP ranges due to abuse. You may hire moderators to help you with this, but you might not trust them enough to show them everybody’s plain IP address.
The question is: Is there a scheme to encrypt or hash IP addresses in a way that preserves prefixes, i.e. such that you can still do range bans, but cannot easily recover the original IP address from the encrypted/hashed address?
For example, let’s say the real (“plain”) IP address is 172.42.3.50 and the encrypted/hashed (“scrambled”) form of that address is 68.4.127.42. Then we want the following properties:
- If a moderator bans the scrambled range 68.4.127.42/24, this results in a ban for the plain range 172.42.3.50/24, i.e. 172.42.3.0 - 172.42.3.255.
- Likewise, a ban on 68.4.127.42/16 results in a ban for 172.42.3.50/16, i.e. 172.42.0.0 - 172.42.255.255.
- etc.
- A moderator cannot reconstruct the plain address 172.42.3.50 from the scrambled address 68.4.127.42 (with some exceptions, see below).
I can see some inherent weaknesses with this:
- If two scrambled addresses share the same prefix, then the moderator knows that the plain addresses must also share the same prefix. In particular, if the scrambled addresses are equal, then the moderator knows that the plain addresses are also equal (but this is actually desirable for this use case).
- Each moderator knows both their own plain address and their scrambled address. If the moderator’s scrambled address and another user’s scrambled address share a common prefix, then the moderator knows that their plain addresses also share a common prefix of equal length. For example, if the moderator has the plain address 172.42.3.50 and the scrambled address 68.4.127.42, and some other user has a scrambled address 68.4.x.x, then the moderator knows that that user has a plain address of 172.42.x.x. This way they could deduce that they live in the same area or use the same ISP.
- But it’s worth noting that this by itself does not give the moderator any new information they could not otherwise have obtained: Even if they did not know anything about the IP address of the other user and could only create bans blindly, they could still create a (short) ban for a user and observe if their own address is in the banned range. If it is in the range, then they know that they share a prefix with the user. By starting with a large range ban and making it progressively smaller, they could even determine exactly how long the common prefix is.
My idea was to use a scheme like the following. For simplicity, I’ll pretend we are dealing with an IPv4 address here and are encrypting octet-by-octet, even though this would restrict range bans to subnet sizes of /8, /16, /24 and /32. But the same principle could be used for bitwise encryption, where we would not have that restriction.
\begin{align}
x &: \text{The plain address consisting of four octets}\\
y &: \text{The scrambled address consisting of four octets}\\
k &: \text{A sufficently large fixed global salt (e.g. 128 bits)}\\
H &: \text{A cryptographic hash function}\\
\\
\text{Encryption:} \\
x &= x_1.x_2.x_3.x_4 \\
H_0 &= H(k) \\
H_1 &= H(k \mathbin {||} x_1) \\
H_2 &= H(k \mathbin {||} x_1 \mathbin {||} x_2) \\
H_3 &= H(k \mathbin {||} x_1 \mathbin {||} x_2 \mathbin {||} x_3) \\
y_1 &= H_0 \oplus x_1 \\
y_2 &= H_1 \oplus x_2 \\
y_3 &= H_2 \oplus x_3 \\
y_4 &= H_3 \oplus x_4 \\
y &= y_1.y_2.y_3.y_4 \\
\\
\text{Decryption:} \\
y &= y_1.y_2.y_3.y_4 \\
x_1 &= H_0 \oplus y_1; \quad H_1 = H(k \mathbin {||} x_1) \\
x_2 &= H_1 \oplus y_2; \quad H_2 = H(k \mathbin {||} x_1 \mathbin {||} x_2) \\
x_3 &= H_2 \oplus y_3; \quad H_3 = H(k \mathbin {||} x_1 \mathbin {||} x_2 \mathbin {||} x_3) \\
x_4 &= H_3 \oplus y_4 \\
x &= x_1.x_2.x_3.x_4 \\
\end{align}
Another variant that might (?) be more secure but slightly less convenient in practice would be to omit the xor step and just use $y = H_1.H_2.H_3.H_4$. This way you would get a one-way hash, instead.
Cryptography is not my area of expertise, so I want to know if I’m missing something obvious here, other than the inherent weaknesses I already pointed out. What I do not want is for someone to be able to decrypt all addresses just by knowing one address pair, for example. That would make the endeavor pointless. Another concern is the hash function: I’m thinking of using SHA2, but I’d only use 8 bits of the hash (or 1 bit in the case of bit-wise encryption) and I’m not sure what security implications this has, if any.
After having a second look, I believe I’m always leaking one extra “chunk” of information with this scheme. In particular, if the chunks are octets as in the above example, it would always be possible to recover the first octet, since $x_1 = y_1 \oplus H_0$ and $H_0 = H(k)$ is constant. More generally, we would always leak one more “chunk” than the length of the common prefix. But I believe it would not be possible to decrypt the remainder, as long as the hash function is strong. Here’s my analysis:
Let Eve be a moderator and Alice be a user. Eve can see her own plain address $\tilde{x}$ and her own scrambled address $\tilde{y}$. Eve can also see Alice’s scrambled address $y$, but not her plain address $x$. Eve wants to find out as much information as possible about Alice’s plain address $x$.
\begin{align}
\text{Eve’s plain address:} \quad
\tilde{x} &= \tilde{x}_1.\tilde{x}_2.\tilde{x}_3.\tilde{x}_4 \\
\text{Eve’s scrambled address:} \quad
\tilde{y} &= \tilde{y}_1.\tilde{y}_2.\tilde{y}_3.\tilde{y}_4 \\
\\
\text{Hence, Eve also knows $\tilde{H}_0..\tilde{H}_3$:} \quad
\tilde{y}_1 &= \tilde{x}_1 \oplus \tilde{H}_0 \implies
\tilde{H}_0 = \tilde{y}_1 \oplus \tilde{x}_1 \\
\tilde{y}_2 &= \tilde{x}_2 \oplus \tilde{H}_1 \implies
\tilde{H}_1 = \tilde{y}_2 \oplus \tilde{x}_2 \\
\tilde{y}_3 &= \tilde{x}_3 \oplus \tilde{H}_2 \implies
\tilde{H}_2 = \tilde{y}_3 \oplus \tilde{x}_3 \\
\tilde{y}_4 &= \tilde{x}_4 \oplus \tilde{H}_3 \implies
\tilde{H}_3 = \tilde{y}_41 \oplus \tilde{x}_4 \\
\\
\text{Alice’s scrambled address:} \quad
y &= {y}_1.{y}_2.{y}_3.{y}_4 \\
&= ({x}_1 \oplus {H}_0).
({x}_2 \oplus {H}_1).
({x}_3 \oplus {H}_2).
({x}_4 \oplus {H}_3)\\
\text{where}\quad H_0 &= H(k) \\
H_1 &= H(k \mathbin {||} x_1) \\
H_2 &= H(k \mathbin {||} x_1 \mathbin {||} x_2) \\
H_3 &= H(k \mathbin {||} x_1 \mathbin {||} x_2 \mathbin {||} x_3) \\
\\
\text{Using} \quad
H_0 &= H(k) = \tilde{H}_0 \\
\text{Eve can always decrypt $x_1$:} \quad
{y}_1 &= {x}_1 \oplus {H}_0 \\
\iff {y}_1 &= {x}_1 \oplus \tilde{H}_0 \\
\iff {x}_1 &= {y}_1 \oplus \tilde{H}_0 \\
\\
\text{Eve now tries to decrypt $x_2$:} \quad
\\
{y}_2 &= {x}_2 \oplus {H}_1 \\
\iff {x}_2 &= {y}_2 \oplus {H}_1 \\
\iff {x}_2 &= {y}_2 \oplus \underbrace{H(k \mathbin {||} x_1)} \\
&\text{If $x_1 \neq \tilde{x}_1$, Eve cannot compute this} \\
&\text{because she only knows $x_1$, but not $k$.} \\
&\text{Therefore she cannot decrypt $x_2$.} \\
&\text{If $x_1 = \tilde{x}_1$, then $H_1 = \tilde{H}_1$, which Eve} \\
&\text{knows, so she can decrypt $x_2$.}
\end{align}
The decryption attempts for $x_3, x_4$ and so on would then follow the same pattern as for $x_2$. In other words, as long as $x_2 \neq \tilde{x}_2$, Eve cannot decrypt $x_3$ and so on. Formally: Eve can decrypt $x_{n+1}$ if and only if $\forall i \in \{1..n\}: x_i = \tilde{x}_i$. At least that is my conclusion.
If the chunks were 1 bit in size, this means we would only leak one additional bit to Eve that she could not have known already (see third point in the list of inherent weaknesses above). I think that is not too bad. Certainly better than having no obfuscation at all.
I appreciate if you read this far. Thoughts? Does this make sense? Is there a better way for doing this?