Score:4

How do I combine two files while excluding lines that exist in both files?

it flag

I have 2 files that can not be sorted. Both of them have a list of words per lines. I am trying to compare both files and create a new one without any duplicate lines that get matched between both files. This means, if a line on file A is found on file B, it should not show as an output result.

There is a huge issue with many questions and sites that say in their titles "Deleting Duplicates" when in fact it is "Merging Duplicates & Showing A Unique One". These 2 points are very different. One is not actually deleting duplicate lines, only merging them.

For this particular case I do need to DELETE THEM for real. So if they are found in both files, they do not show as a result.

I have tested comm already and this fails. I have also tested several other cases like awk, grep that I have seen. The rules for both files is the following:

  • They have different size (Do not have the same amount of lines)
  • To be a duplicate it compares the whole line against each and all other lines in the other file
  • Files can not be sorted

Here is some information about the files, they carry list of emails, one email per line. Of course because they are not the same size, it does not mean they will have all emails the same, but they do have inside of each other all unique emails. It is just that some emails might be on both files. For the cases where the emails are on both files, the output results should not show those emails.

FedKad avatar
cn flag
What is the reason that "_Files can not be sorted_"?
N0rbert avatar
zw flag
I'm not sure, but you can try `dwdiff` utility for comparison; see https://askubuntu.com/a/1073389/66509 for reference.
hr flag
Can duplicates occur *within* either file? In what order should the results be merged? Please consider supplying a minimal example.
Luis Alvarado avatar
it flag
@steeldriver, there are no duplicates (thank god) on each file. All of them are unique)
hr flag
*"if a line on file A is found on file B, it should not show as an output result"* sounds like `grep -vFxf fileB fileA`, whereas *" if they are found in both files, they do not show as a result"* sounds like `awk '!seen[$0]++' fileA fileB`. This is where a short, representative example would be helpful.
Luis Alvarado avatar
it flag
@steeldriver Thank you friend, I have tested both of them but with no luck. They still show the output wrong. So for example one file has 700 emails, another one has 80 emails. I know for a fact that almost all 80 emails are duplicate of the 700 one, so the output count number should be around 620 emails.
Luis Alvarado avatar
it flag
Correction, I just tested the 1st one again but noticed the f at the end. I did it and it worked. But the awk one did not show the correct results, although your grep did. If you would please put this as the answer since this ACTUALLY worked for me after several hours.
Score:1
cn flag

There are more efficient ways, but here is a solution. I was unsure how you would want the files merged. So, in this solution distinct lines from file1 are written to the new file, then distinct lines from file2 are written to the new file.

# remove_dupes.py
from sys import argv

infile1 = open( str(argv[1]), "r" )
infile2 = open( str(argv[2]), "r" )
try:
    outfile = open( str(argv[3]), "w" )
except (IndexError):
    outfile = open( 'out', "w" )


if1_arr = infile1.readlines()
if2_arr = infile2.readlines()
tmp_arr = if2_arr



exclude = []
for line in if1_arr:
    if line in if2_arr:
        exclude.append(line)
    else:
        outfile.write(line)

for line in if2_arr:
    if line not in exclude:
        outfile.write(line)

infile1.close()
infile2.close()
outfile.close()

To run:

python3 remove_dupes.py <file1> <file2> <output_file>

If you'd like to turn this into a quicker command-line tool, move the script to a long-term place and add the following line to your .bashrc, .bash_aliases, .zshrc, or equivalent file.

alias mydiff='python3 <path_to_script> '

You can replace 'mydiff' with whatever you'd like to call it. After that you can run the script with:

mydiff <file1> <file2> <output_file>
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.