Score:0

Checking encoded strings for a hash collision in Python

in flag

There is a common term used in cryptography called a hash collision. If I am reading the definition correctly on Wikipedia, this can occur if two different data values give rise to the same hash value.

Duplicate hash, different input:

text1 encoded = hash1 
text2 encoded = hash1

The first code block is a binary value with a hash obtained from the digest() function, which I found on a website. The section code block is what I modified, which is what I'm understanding is a hash collision. Notice that the second code block is checking if the hash is a duplicate but the original string is different.

Can anyone explain if my second code block is a hash collision and if not, why? And explain how the first and second code blocks differ in terms of the definition.

https://www.learnpythonwithrune.org/birthday-paradox-and-hash-function-collisions-by-example/

Code Block #1:

import hashlib
import os

collision = 0
for _ in range(1000):
    lookup_table = {}
    for _ in range(16):
        random_binary = os.urandom(16)
        result = hashlib.md5(random_binary).digest()
        result = result[:1]
        if result not in lookup_table:
            lookup_table[result] = random_binary
        else:
            collision += 1
            break
print("Number of collisions:", collision, "out of", 1000)

Code Block #2:

Codes 0 through 31 and 127 (decimal) are unprintable control characters. Code 32 (decimal) is a nonprinting spacing character. Codes 33 through 126 (decimal) are printable graphic characters.

string.ascii_lowercase + string.ascii_uppercase + string.ascii_letters + string.digits + string.punctuation + string.whitespace + string.printable

import hashlib
import os
import random
import string

collision = 0
total_attempts = 100000
lookup_table = {}
for _ in range(total_attempts):

    str = ''.join(random.choice(string.printable) for i in range(3))
    str_encode = str.encode('utf-8')
    hash = hashlib.md5(str_encode).hexdigest()
    hash = hash[:3]
    
    if hash in lookup_table:
        if str not in lookup_table[hash]: # hash is the same; string is different
            collision += 1 
            print(lookup_table[hash] + '_' + hash)
            lookup_table[hash] = lookup_table[hash] + ';' + str
    else:
        lookup_table[hash] = ';' + str    
print("Number of collisions:", collision, "out of", total_attempts)
fgrieu avatar
ng flag
The question twice makes a confusion between code and what it attempts to perform, in "first code block is a binary value with a hash" and in "second code block is a hash collision".
Score:1
ng flag

A hash collision is the circumstance in which two distinct inputs of a hash function have the same hash. A colliding pair is two such distinct inputs. For examples of MD5 collisions and colliding pairs, see there.

The first code block finds and counts events that are mostly hash collisions for MD5 restricted to it's first byte, but could also be accidental collisions between the values returned by os.urandom(16) (the later are so extremely rare that it's negligible in the result). The experiment is restricted to collisions detected among 16 inputs, and is repeated 1000 times. The code seems written to illustrate a variant of the usual birthday problem.

The second code block (in version 7) can't find any hash collision, for two independent reasons:

  • It's attempted to find MD5 collisions in a naive way: by hashing random strings, but (by the aforementioned birthday problem) we would likely need in the order of $2^{64}$ hashes to find one MD5 collision in this way, and we only do $10^5=2^{16.6\ldots}$. Hint: shorten the hash (not the input of the hash) to it's first three bytes so that there are collisions to observe.
  • lookup_table never contains more than one element, for it's reset inside the loop; FIX THAT!

The code distinguish hash collisions from accidental collisions in the hashed strings.

Note: one simple way to avoid collisions among the inputs of the hash is to generate the inputs incrementally. This allows to simplify the code.

MacGyver avatar
in flag
I think I understand the first issue. You're simply saying there isn't enough variation of characters based on the string length, correct? By shortening the length, and by maybe introducing all ASCII characters, there would be a better probability, correct? My intention wasn't to find a collision, but to understand when it happens. Can you take another look and see if my logic is correct now?
MacGyver avatar
in flag
I'm using this to figure out duplicate hashes with DIFFERENT string input in an application, that should not exist. It'll probably never happen, but I'm just doing it as a precaution to check for them. What I learned today is that there are four ways to denote the number of variances for the string input. It's based on ordering the items and repeating items in each position. In our case, order matters and repeating items can happen with string input for a hash. https://math.stackexchange.com/a/4663863/37313
MacGyver avatar
in flag
The other answers in this question were not very comprehensive, so I got all combinations of the combinations, if that makes sense. :)
fgrieu avatar
ng flag
The issue in both bullets stands uncorrected and both are a show stopper.
MacGyver avatar
in flag
Okay, I fixed the second issue. The concatenation of ";" + str was in the wrong code block. Now it's nested in the second IF. As for the first issue, I will truncate the hash to the first three characters. Is that correct? hash = hash[:3]. In my actual program, I will use a 2 dimensional dictionary, where the 1st dimension is hash and the second dimension is str. But I just kept it simple here and just printed the collisions.
fgrieu avatar
ng flag
No the second issue is not fixed. Again, `lookup_table` never contains more than one element, for it's reset **inside** the loop, rather than **before** as it should. That's a programming error, and the fact it was not fixed yet shows the question is more about programming than about anything else. Don't worry; it takes practice to build a mind model of how (here, sequential) computing works. There's another lesser programming errors: the separator `;` occurs in `string.printable` and thus is not fully guaranteed to play it's intended role. Also: see the final note for faster/cleaner code.
MacGyver avatar
in flag
If I set it before the IF conditional, I cannot append additional str values for the same hash key. By having it in the loop, we set a single str value to the hash key when the hash is not found as a key in the dictionary. And if we do find the key but a different str value, we append. I could have used the append function as an alternative. I see what you mean with the ";". When I have a 2D dictionary, that won't happen.
fgrieu avatar
ng flag
Your main problem is with where `lookup_table = {}` is in the second code block. It must be moved up one line, and left one level. As is, there can't be any collision detected.
MacGyver avatar
in flag
Oh da! I never ran the code in the second block. I just wrote it based on the first code block. Thank you for your help on this.
MacGyver avatar
in flag
I was reading through some of the answers you have on this site. You must be an extremely brilliant person!
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.