Score:1

Compression algorithm with multiple valid same-sized outputs

lb flag

Is there a lossless compression algorithm that has hashing-like properties where there are multiple solutions to it?

As in for example, when a 1000-bit data-sequence is compressed into a 500-bit data sequence, there are multiple possible 500-bit data sequences that can be generated as outputs. Each of these 500-bit data sequences, once decompressed would all output the original 1000-bit data sequence. Additionally, upon decompression, any one of the 500-bit data sequences would also have multiple 1000-bit data sequence solutions upon decompression (around (2^1000/2^500) valid solutions). Collisions are fine, since otherwise it would be impossible for multiple 500-bit data sequences to be valid compression outputs of the 1000-bit data sequence.

The amount of "pigeon-holes" available for compression and decompression don't need to be symmetric. Also, it may not be appropriate to call a compression algorithm that produces multiple valid outputs when compressing and multiple valid same-sized outputs when decompressing as "lossless". But all the original data would still be there, it would just generate additional data on top of it. This additional data can be dealt with (the decompression would be stochastic if without help).

One more thing, the compression degree should be customizable (eg, from 1000-bits to 900-bits, or from 1000-bits to 950-bits, etc). Same with decompression. Otherwise, if such an algorithm requires the bit-size-change steps to be fixed, then it needs to be symmetric and fine enough to allow for multiple repeats. For ex, 1000-bit to 900-bit to 800-bit to 700-bit, etc. And the decompression process needs to backtrack for those exact sizes.

poncho avatar
my flag
"Additionally, upon decompression, any one of the 500-bit data sequences would also have multiple 1000-bit data sequence solutions upon decompression (around (2^1000/2^500) valid solutions)." - how is this compatible with the assumption that the compression is "lossless"? If there are 2^500 possible decompression outputs, how does the decompressor know which one is correct?
ph flag
The difference between the original 1000 bits and the decompressed 1000 bits is the loss. Lossless means there is no difference.
kodlu avatar
sa flag
you need to clarify what your goal is, it seems you don't understand the meaning of lossless
ph flag
@fgrieu I think we're saying the same thing, or at least I was trying to. "Without any change" = "no difference".
D.W. avatar
fr flag
Cross-posted: https://cs.stackexchange.com/q/159858/755, https://crypto.stackexchange.com/q/106305/351. Please [do not post the same question on multiple sites](https://meta.stackexchange.com/q/64068).
Score:0
ng flag

Is there a lossless compression algorithm that has hashing-like properties where there are multiple solutions to it?

No. A lossless compression algorithm† can't have two different inputs with the same output, contrary to a hash.

In the question, the proposition "Additionally, upon decompression, any one of the 500-bit data sequences would also have multiple 1000-bit data sequence solutions upon decompression" is wrong. In lossless compression, decompression (of the result of compression) is always deterministic. And "a 1000-bit data-sequence is compressed into a 500-bit data sequence" can apply to only at most $2^{500}$ out of the $2^{1000}$ existing 1000-bit data-sequences.


† That is, formally, a pair of algorithms called compression and decompression, each accepting finitely much data as input and producing finitely much data as output, such that feeding the output of compression to the input of decompression insures that the output of decompression exactly matches whatever the input of compression was.

Score:0
in flag

The background idea in compression algorithms are due to the Typical Set and Asymptotic Equipartition Property (AEP) concepts from Information Theory.

So, roughly speaking, the compressed message is similar to a rewriting of its original symbols, like if some snippet, e.g. $0000:0000$ is more frequent in the original message, AEP gives us a measure of what is the most economical symbol that, uniquely, represent it, taking advantage on its redundant aspect: so, rewriting the same with fewer bits, and the general idea to reach that is building a mapping table, what is a completely deterministic thing.

So, that given reference we can read: "First, lets note that a lossless coding scheme needs to satisfy the following requirement:"

  • Unique decodability: $x_i^n \neq x_i^m \implies C(x_i^n) \neq C(x_i^m)$
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.