There is a common term used in cryptography called a hash collision. If I am reading the definition correctly on Wikipedia, this can occur if two different data values give rise to the same hash value.
Duplicate hash, different input:
text1 encoded = hash1
text2 encoded = hash1
The first code block is a binary value with a hash obtained from the digest() function, which I found on a website. The section code block is what I modified, which is what I'm understanding is a hash collision. Notice that the second code block is checking if the hash is a duplicate but the original string is different.
Can anyone explain if my second code block is a hash collision and if not, why? And explain how the first and second code blocks differ in terms of the definition.
https://www.learnpythonwithrune.org/birthday-paradox-and-hash-function-collisions-by-example/
Code Block #1:
import hashlib
import os
collision = 0
for _ in range(1000):
lookup_table = {}
for _ in range(16):
random_binary = os.urandom(16)
result = hashlib.md5(random_binary).digest()
result = result[:1]
if result not in lookup_table:
lookup_table[result] = random_binary
else:
collision += 1
break
print("Number of collisions:", collision, "out of", 1000)
Code Block #2:
Codes 0 through 31 and 127 (decimal) are unprintable control
characters. Code 32 (decimal) is a nonprinting spacing character.
Codes 33 through 126 (decimal) are printable graphic characters.
string.ascii_lowercase + string.ascii_uppercase + string.ascii_letters + string.digits + string.punctuation + string.whitespace + string.printable
import hashlib
import os
import random
import string
collision = 0
total_attempts = 100000
lookup_table = {}
for _ in range(total_attempts):
str = ''.join(random.choice(string.printable) for i in range(3))
str_encode = str.encode('utf-8')
hash = hashlib.md5(str_encode).hexdigest()
hash = hash[:3]
if hash in lookup_table:
if str not in lookup_table[hash]: # hash is the same; string is different
collision += 1
print(lookup_table[hash] + '_' + hash)
lookup_table[hash] = lookup_table[hash] + ';' + str
else:
lookup_table[hash] = ';' + str
print("Number of collisions:", collision, "out of", total_attempts)