Score:-1

Find file duplicates and convert them into links [WINDOWS]

gb flag

My users tend to save tons of duplicate files what consumes more and more space and generate HW and archiving cost.

Im thinking to create some scheduled job, to:

  1. find duplicate files (check file MD5 sum, not only filename / size)
  2. leave only 1 original file
  3. replace other redundant copies by link (shortcut) to file (point above)

Any idea how to archive that?

Script / tool / tips ?

EDIT 28.10.2021

Ive found in the meantime findDupe: https://www.sentex.ca/~mwandel/finddupe/

It allows to create hardlinks to original files. Ive tried this - it shows correctly what is duplicated, seems creating hardlinks - but... I cant see difference in HDD usage stats after all

Why that? Can it be Windows calculates free space incorrectly ?

Score:1
cn flag

I made a small script in python who answer your needs.

It use fdupes -r <dir> in order to get all duplicates files (even with different names). After that, it iterate over the output and delete duplicated files, then make a symbolic link.

I let you uncomment the two os.system() lines in order to enable the modifications.

Maybe you'll want to pass parameter to this script (like a path or other), I let you search for this need :)

import os

root_dir='/home/user/directory'

blocks_of_dup_files = os.popen('fdupes -r ' + root_dir).read().split('\n\n')

if(blocks_of_dup_files[-1] == '') :
    blocks_of_dup_files.pop()


for files in blocks_of_dup_files:
    files = files.split('\n')
    keeped_file = files.pop()
    for file in files:
        print('rm -f ' + file)
        print('ln -s ' + keeped_file + ' ' + file)

        #os.system('rm -f ' + file)
        #os.system('ln -s ' + keeped_file + ' ' + file)

gb flag
Thanks. It seems your solution is addressed for Linux. I need something like that for Windows (sorry I forgot to mention that in my post - corrected)
gb flag
Ok Ive found this can be installed on Windows - via Choco. Will give it try
Score:1
in flag

For Windows I authord https://github.com/Caspeco/BlobBackup/tree/master/DuplicateFinder

You will need visual studio to compile the code. Note tho, that with links if one "file" is modified, then all are (or rather, there is only one file). That could be unwanted behaviour for users.

gb flag
Thanks for sharing this, Ive compiled that but I cant find any info re cmd line params, how it works, etc I did quick check adding one param (directory to scan), it returned: Duplicate done 4 items traversed in xxx - but no ifo if duplicates found (there are some), also no info re (hard) linking
in flag
It matches on size and checksums (only testing if duplicates are found) if files are already linked they are "skipped" if duplicates are found it will print it.
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.