Score:1

Avoiding MD5 collision with secondary partial hashes

cn flag

I am trying to design a VCS like program that determines if the files are the same by comparing their MD5 hashes.

Then I read about MD5 collisions here, and I wonder if I can work around that by doing a secondary check by hashing only parts of the file if the first check led to a collision.

What are some issues with this approach that can be foreseen already?

bk2204 avatar
fr flag
Please don't use MD5 or even SHA-1 for this purpose (or any purpose). Git recently added support for SHA-256 specifically because the other two are terrible, and as kelalaka proposed, BLAKE2, BLAKE3, or SHA-2 are all great choices.
Score:3
in flag

It might be a huge problem in your case since there is an identical prefix collision in MD5;

 |identical prefix | free part of file A | identical suffix |
 |identical prefix | free part of file B | identical suffix |
                                         ^
                 they have collision here| the rest is the same

Although Today the collision finding for MD5 is very easy, where the attacker can control a middle block of MD5, the probability of this may lessen if the files are not arbitrary since it will give the attack fewer possible candidates for the collision.

In your case, there is no attacker and you are looking for uncontrolled collision. In VCSs, files with lots of edits can fall into the pool of possible collision scenarios, same prefix, some changed parts, and identical suffix. Your major problem will be determining which part to test; just the second block ( MD5 has 512-bit blocks) or just the third block or second and third block...

Why bother with MD5 and have a secondary check while we can have better and faster alternatives.

  • BLAKE2 was the fastest around now there is BLAKE3 which is even faster. BLAKE2 ~2 times and BLAKE3 ~9 times faster than MD5. Use BLAKE2/3 with 512-bit output and have a $2^{256}$-time collision resistance; so creating a collision is computationally infeasible.
  • SHA-512 which almost has the same speed as MD5 and it can guarantee much better collision resistance that MD5 cannot match by any means.

The conclusion of Corkami;

Kill MD5!

Unless you actively check for malformations or collisions blocks in files, don't use MD5!

It's not a cryptographic hash, it's a toy function!

ZachB avatar
lc flag
https://github.blog/2017-03-20-sha-1-collision-detection-on-github-com/ is a more correct link about Git SHA-1 collisions. The collision there was engineered though, not "natural."
kelalaka avatar
in flag
Well, [Git was using SHA-1](https://www.zdnet.com/article/linus-torvalds-on-sha-1-and-git-the-sky-isnt-falling/),and now [they improve it](https://github.blog/2021-09-01-improving-git-protocol-security-github/)
Morrolan avatar
ng flag
Mind that Git's default, and in fact the only thing supported by major Git hosts such as Github is *still* SHA1. See e.g. [this answer](https://stackoverflow.com/a/65874596). That second link of yours is about Github dropping support for SHA1 in conjuction with SSH. That is it is about one mode of transport by which one can access Git, and not about git as a way to address content.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.