What hashing algorithm is fast and good enough for checking if source data is changed?

Gasim

11/5/23, 10:03 AM

Not sure if this falls into crypto from contextual point of view but it is about hashing algorithms. I have two directories -- assets/ and cache/. Anytime there is a file added, deleted, or changed in the assets/ directory, a corresponding, application specific file will be generated in the cache/ directory. On top of that, an additional "cache file" gets created that stores the following information:

assetFile: my-file.png # relative to assets directory
cacheFile: my-file.ktx2 # relative to cache directory
assetHash: 709a0aef5d1ecda90fb3f3542aa71bef08b9fab8 # hash from contents of asset file
cacheHash: 0511420356589c5669c83daeff059d68078aef80 # hash from contents of cache file

These files are generally huge. Some of these files can be 15-20mb in size.

The general purpose is not about security or privacy but it is more about checking if the source data is changed. So, in the example above, if my-file.png is changed, the application will check the file's contents against the hash stored in the "cache file" and if the hashes do not match, it will recreate my-file.ktx2 and update the "cache file" with a new hash. The assets are generally added by the user themselves; so, if they try to willfully tamper with this system, they will be breaking their own workflow.

What kind of hashing algorithm can I use here that is fast enough to create the hash but also reliable enough in terms of not having false negatives (i.e collisions)?

I am currently using Sha256 hash and it is quiet slow, especially on large files.

319

0 + 8

collision-resistance

CodesInChaos

11/5/23, 10:13 AM

Are you sure you're bottlenecked by hashing the file, and not by reading it from disk?

samuel-lucas6

11/5/23, 10:23 AM

You don't have to worry about collisions when using any modern hash function with a 256-bit output. For large files, accelerated BLAKE3 is probably the fastest you can get. However, there's an issue with some BLAKE3 implementations that causes bugs with large large files (e.g. 2 GiB). Also, SHA-2 with acceleration is [fast too](https://github.com/BLAKE3-team/BLAKE3/issues/207).

DannyNiu

11/5/23, 12:02 PM

"user **them**selves; ... **tamper** with this system". Upon reading this part, randomized hashing using a static secret key came into my mind (think HMAC).

Gasim

11/5/23, 4:25 PM

@CodesInChaos This is something that also came into mind but I have not done profiling on this yet; however, I have around 80kb of file for testing and it was quiet slow. I also wanted to know if SHA2 is an overkill for my situation.

Gasim

11/5/23, 4:26 PM

@samuel-lucas6 What do you mean by acceleration in this case? Do CPUs have specialized instructions to make hashing faster?

samuel-lucas6

11/5/23, 7:11 PM

@Gasim Yes, there are different instruction sets that provide hardware acceleration, improving performance. In the grand scheme of things, these are not large file sizes. You should investigate the performance of how you're reading from disk and the SHA-2 implementation you're using. I have a program for hashing with various algorithms, and hashing a 35 MiB file with SHA-256 is practically instant.

Maarten Bodewes

11/6/23, 11:11 AM

As long as you stream or memory map it. It's amazing how many file hash & encryption routines on e.g. [so] simply read things in memory first (and if you are unlucky, perform some interesting and unnecessary encoding actions on top of that, it seems like a minimum of 3 in memory copies is the minimum for some reason or other).

CodesInChaos

11/7/23, 9:19 AM

I don't know what you mean when you say "quite slow", but even without specialized instructions SHA-256 should be able to hash several hundred megabytes per second per core. If you're much slower than that, you're either IO bound, or you're using an inefficient implementation.

Elon Musk

I sit in a Tesla and translated this thread with Ai:

EN: What hashing algorithm is fast and good enough for checking if source data is changed?

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.