Score:5

Does using multiple hashes (to check if a file has been spoofed) reduce collisions?

fi flag

I'm trying to create a script that will take a snapshot of the contents of a directory. For each file, all possible metadata will be recorded and written to the database. The point is that with some transfer of this directory, you can take a new snapshot and compare it with the original one taken in the "safe zone" to check whether any third-party changes have been made to the files.

Along with the metadata, I want to capture the hash of each file. I am aware of the existence of a collision problem and the possibility of file substitution so that its hash remains unchanged. Does using multiple hashes increase the complexity of implementing such an attack? Or does it not make sense?

And I am actually interested in this question from a mathematical point of view, if it is possible to consider it using formulas. If this is the case, how can I try to prove the usefulness/uselessness of using multiple hashes for my purpose mathematically?

Edit: Thanks for all the answers, it helped me a lot

Score:6
in flag

For the following answer I'll assume that the hashing is performed twice over the data (a 2-pass system) and that the output is the output of the two hash operations combined, i.e.:

$$\text{H}'(m) = \text{H}_1(m) \| \text{H}_2(m)$$

Along with the metadata, I want to capture the hash of each file. I am aware of the existence of a collision problem and the possibility of file substitution so that its hash remains unchanged.

Any unbroken hash function, at the time of evaluation, does not have this problem. In other words, if this problem is present then the hash has broken collision resistance; it is considered broken.

The old hash functions MD5 and SHA-1 have been broken for collision resistance, the MD5 function is broken instantaneously but for SHA-1 a lot of computation has to be performed - although having data files that are colliding is a problem all in itself.

Does using multiple hashes increase the complexity of implementing such an attack? Or does it not make sense?

It does make some sense. There has been a precedence of using this kind of scheme in a well known protocol as OpenSSL used signatures over both MD5 and SHA-1, which were already considered to be problematic at the time.

If this is the case, how can I try to prove the usefulness/uselessness of using multiple hashes for my purpose mathematically?

I don't think you can create a mathematical proof for any good degree. The premise would be that one of the hashing algorithms would be broken and that the other hash is not affected by this unknown attack. Of course you can start proving that you'd need a collision on both, but that's hardly any effort and can be understood without any math.

Beware that concatenating two hash outputs may not be as secure as you would expect. In short, it is not as strong as the output bitsize may indicate. This may be useful against breaking one of the algorithms, not so much to provide higher security in bits.


For very high, long term assurance you could use a SHA-2 and SHA-3 / SHAKE hash combined. If the double pass mechanism and double output size are worth it is highly questionable though. We've got pretty high confidence in the current hash functions especially since SHA-2 held up fine during the SHA-3 hash competition. If SHA-3 is too slow you might want to consider BLAKE3 as well.

For normal hashing using SHA-256 is good enough, and it is high speed as it is supported in hardware by most modern processors, which is perfect for file hashing.

sh flag
Arguably it's not "double pass" if you can do it in parallel.
Maarten Bodewes avatar
in flag
Huh, I thought it was "double pass" if you'd have to process the data twice. If you would create e.g. a MAC-then-encrypt method from HMAC and CTR, would that be single pass? Nice to see you around by the way :)
sh flag
I understand double-pass as "need the result from the first pass before doing the second one". If you have some kind of MAC-then-encrypt where you use the MAC result to generate your key (convergent encryption?), that would be double-pass, but if you do both in parallel (as is usual), it's single-pass. (Even encrypt-then-mac is usually single-pass for encryption (but for decryption you might want to do two passes, to not start the decryption before checking the MAC). The difference is that for double-pass you'd need to store the full data before giving any output, and can't do any streaming.
sh flag
(I'm occasionally looking in for hot network questions.)
Score:2
in flag

Concatenating 2 hash functions will give at least the collision resistence of any one of them, it may or may not give more then the best of the two.

If we use Merkle Damgard constructions and have multi-collisions from collision easily, and the weaker function doesn't need more than birthday collision to be practical. We get little to no extra safety from concatenating. This is the SHA-1 + MD5 case which can be solved with cost of aproximately $2^{67}$ hash operations.

If we have no practical attack on both, we gain security not measured in compute cost, we gain safety from future cryptanaytic advantages. We assume one might break in the future, but if we use two dissimilar functions the chnace of both being broken is noticeably reduced.

If both functions have(or will have) a better than generic (birthday) attack, we can hope these methods will be hard to combine and I would even say this is likely, but not something we can prove in the general case.

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.