Score:8

Ubuntu

Compare new txt file with old txt file and remove all data that matches

Robert Carroll

6/22/24, 6:51 PM

I have a new file with the following data separated by a carriage return

a
a
b
c
d
d

I have an old file that is also separated by a carriage return

b
d

How do I remove b & d from the new file and remove one of the a's from the first file?

The desired output, separated by a carriage return, would be

a
c

I have tried sort -u which removes the b & d but also removes the a a. I have tried grep -vxFf however, there are duplicates from the new file.

973

7 + 3

command-line

text-processing

glenn jackman

6/22/24, 7:38 PM

Look into the `uniq` command to remove duplicates

Robert Carroll

6/22/24, 8:03 PM

I did that with the `sort -u (uniq)` and it removes the a and a. I would like one of the a's to stay. It'd be better if there was a merge command maybe?

Barmar

6/23/24, 11:15 PM

You say "remove one of the a's". Does that mean if there are 3 a's, the result should have 2 of them?

Score:11

Ubuntu

user unknown

6/22/24, 8:06 PM

 grep -F -f oldfile -v newfile | uniq

Use the oldfile as search for grep, in the end remove duplicate lines.

+ 5

Robert Carroll

6/22/24, 8:24 PM

Thank you!! that worked :) I really appreciate it.

vanadium

6/23/24, 7:42 AM

If this worked for your, then please show your appreciation by "accepting" it: click the checkmark next to the answer. This also shows other users of this site that a usefull answer is available here. Feel free also to upvote other answers with different approaches that also work.

dedunumax

6/23/24, 11:50 AM

`grep -F -f oldfile -v newfile | sort | uniq` will guarantee the unique results.

Robert Carroll

6/23/24, 10:15 PM

Thank you @dedunu. I will add the sort and test it out. I appreciate all of the replies.

dedunumax

6/24/24, 5:05 PM

https://gist.github.com/dedunumax/38fc581d337df9b442f4bffce3960492 https://www.onlinegdb.com/LNJSGfsFB this might help!

Score:7

Ubuntu

steeldriver

6/22/24, 9:02 PM

Using awk, print only the lines of newfile that haven't previously occurred in either file:

awk '!(seen[$0]++ || NR==FNR)' oldfile newfile

+ 2

Robert Carroll

6/23/24, 10:20 PM

Thank you @steeldriver. Is this better than the grep I am using? I'll try it, but I am happy grep.

steeldriver

6/23/24, 10:57 PM

@RobertCarroll it's not necessarily *better*, but it is subtly *different* from the [currently accepted answer](https://askubuntu.com/a/1474183/178692) in that it will remove duplicates wherever they occur, without sorting, whereas `grep ... | uniq` will only remove *adjacent* duplicates, while `grep ... | sort | uniq` will remove all duplicates but may result in re-ordered output. So it depends what you want.

Score:3

Ubuntu

Guntram Blohm

6/23/24, 7:31 AM

If you can sort the files (which I assume as you said you tried sort -u), you can run comm oldfile.sorted newfile.sorted which will show the contents in three columns - old file only, new file only, both files. The -1, -2, -3 options allow you to suppress some of the columns, so comm -13 oldfile.sorted newfile.sorted | uniq should do what you want.

+ 0

Score:1

Ubuntu

waltinator

6/22/24, 7:00 PM

Read man grep and do something like:

grep -F -f oldfile -v newfile

+ 1

Robert Carroll

6/22/24, 7:36 PM

That doesnt work either. I still have the issue of a and a both being on the updatedFile. `grep -vxFf oldfile newfile > updatedFile` produced the same result.

Score:1

Ubuntu

Raffa

6/25/24, 3:53 PM

In perl:

perl -ne 'print if ! ( $seen{$_}++ || $#ARGV eq 0 )' oldfile newfile

Or:

perl -ne '( $seen{$_}++ || $#ARGV eq 0 ) || print' oldfile newfile

+ 0

Score:0

Ubuntu

zzz

6/27/24, 8:33 AM

Use comm:

comm -1 -2 <(sort -u old) <(sort -u new)

+ 0

Score:0

Ubuntu

Mike

6/28/24, 10:56 AM

Though the grep answer works for you, sort and uniq can do the trick also.

Assuming you have the construct <(command) available (e.g. use bash). The construct will be replaced by a temporary file, containing the output of command.

This will work, with your example:

sort <(uniq new) old | uniq -u

Though more general would be:

sort <(uniq new) old old | uniq -u

Which also works if the old file contains lines that are not in new.

What happens is, that first the new file gets it duplicate lines removed. (This assumes that duplicates in the new file are adjacent. Otherwise replace <(uniq new) by <(sort -u new).)

Then the output of sort ... is taken as input to uniq -u.

uniq -u prints lines that are occurring only once.

Because the old file is presented twice to the sort ... command, none of the lines in old will make it to the output of uniq -u.

Also any lines common between old and new will not be present.

sort and uniq can be combined in several ways to perform mathematical set operations:

Set union: f U g

sort f g | uniq
Set intersection: f ∩ g

sort f g | uniq -d ## uniq -d only prints duplicate lines
Set difference: f \ g

sort f g g | uniq -u ## uniq -u only prints unique lines

Note that the files f and g represent sets. So they must not have duplicate lines. If that is not the case, replace e.g. f by <(sort -u f).

+ 0

Elon Musk

I sit in a Tesla and translated this thread with Ai:

EN: Compare new txt file with old txt file and remove all data that matches

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.