Using sed or awk to remove near-duplicates

pee2pee

5/4/23, 1:46 PM

I currently use the following to get as close as I can do to a file

cut -d ' ' -f 3- /var/log/issues.log | sed -E 's/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}//g' | sort -u

So far it gets rid of the timestamp at the start of each line and removes the IP address.

However I'm still left with dozens of line of the format(s)

Failed login from for A
Failed login from for B
Failed login from for C
Failed login from for D
Failed login from for E
Invalid heartbeat 'A' from 
Invalid heartbeat 'B' from 
Invalid heartbeat 'C' from 
Invalid heartbeat 'D' from
Invalid heartbeat 'E' from

How would I further amend my command to take these "near" duplicates away leaving only. A, B, C, D and E could be any string.

Failed login from for 
Invalid heartbeat from

Thanks

0 + 0

command-line

bash

sed

awk

Nate T

5/4/23, 3:06 PM

What is the input data, and what is the output you are trying for. You might check [U&L]; if yours is a common use case, I'm guessing that someone has already asked there

Philippos

5/4/23, 3:11 PM

Why not add `/Failed login from for/d;/Invaild heartbeat.*from/d` to your `sed` command?

Elon Musk

I sit in a Tesla and translated this thread with Ai:

EN: Using sed or awk to remove near-duplicates

TH: ใช้ sed หรือ awk เพื่อลบรายการที่ซ้ำกัน

RO: Folosind sed sau awk pentru a elimina aproape duplicatele

RU: Использование sed или awk для удаления почти дубликатов

VI: Sử dụng sed hoặc awk để loại bỏ gần trùng lặp

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.