How do you append the first pattern of a regular expression to the end of a line using sed?

Justin

12/19/22, 3:27 AM

I have a .fasta (text) file containing DNA sequence data in the format as follows:

>uce-8374_Genus_species
ACGTACGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTACGATCGCGGTATATCGGCGATTCGATCG

>uce-239_Genus_species
ATCGTAGCATGCGCTAGCTAGCTAGCTCGCGGTACGCATGTCTGACTGCGTCTGGTCGTACGATTACTACGACTGCG

>uce-83_Genus_species
ATCGATCTAGCGTAGCATGCGATCGATATCTGCGATCGACTCGATGCATGCATGCATCGATGCTAGCTAGCTAGCTA

>uce-902_Genus_species
AGCTGACTAGCTGGCGATACTGGCGATATCGGATTACGCGGCATATCGAGCGAGTCGATCGATGCATCTGATGCAGC

I am trying to append everything before the first underscore, preceded by a | to only the end of the lines have the >. So for example, the first sequence would read: uce-8374_Genus_species|uce-8374, followed by the DNA sequence beneath it. Is there a way to do this in sed? I tried storing ^[^_]+(?=_) into a variable, but it didn't work and just kept appending ^[^_]+(?=_) to the end of the line instead of the pattern itself. Any help, as well as explanations (as I am new to regex) would be helpful. If there is a better way to go about this, I am open to other options!

So far, I have tried (I will just show the first DNA sequence, but I am wanting to change all of them):

sed -E 's/species/species|^[^_]+(?=_)/' sample_file.fasta

Result: uce-8374_Genus_species|^[^_]+(?=_)

and I have also tried:

x="^[^_]+(?=_)"
sed -E "s/species/species|$x/" "sample_file.fasta"

Result: uce-8374_Genus_species|^[^_]+(?=_)

134

1 + 0

command-line

text-processing

Score:2

Ubuntu

steeldriver

12/19/22, 3:35 AM

Unlike Perl, sed doesn't support the PCRE lookahead syntax (?=_) but you could fake it as follows:

match > anchored to the start of the line ^>
then match and capture zero or more non-_ characters ([^_]*)
then match everything else .*

then replace with

the entire matched pattern &
followed by literal | and then the first captured group \1

$ sed -E 's/^>([^_]*).*/&|\1/' sample_file.fasta 
>uce-8374_Genus_species|uce-8374
ACGTACGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTACGATCGCGGTATATCGGCGATTCGATCG

>uce-239_Genus_species|uce-239
ATCGTAGCATGCGCTAGCTAGCTAGCTCGCGGTACGCATGTCTGACTGCGTCTGGTCGTACGATTACTACGACTGCG

>uce-83_Genus_species|uce-83
ATCGATCTAGCGTAGCATGCGATCGATATCTGCGATCGACTCGATGCATGCATGCATCGATGCTAGCTAGCTAGCTA

>uce-902_Genus_species|uce-902
AGCTGACTAGCTGGCGATACTGGCGATATCGGATTACGCGGCATATCGAGCGAGTCGATCGATGCATCTGATGCAGC

0 + 2

Justin

12/19/22, 1:27 PM

That worked! Thank you! If you don't mind, can you explain what this is doing? specifically the second and third part of the `sed` syntax here?

steeldriver

12/19/22, 1:40 PM

@Justin please see updated answer

Elon Musk

I sit in a Tesla and translated this thread with Ai:

EN: How do you append the first pattern of a regular expression to the end of a line using sed?

TH: คุณจะผนวกรูปแบบแรกของนิพจน์ทั่วไปต่อท้ายบรรทัดโดยใช้ sed ได้อย่างไร

RO: Cum atașați primul model al unei expresii regulate la sfârșitul unei linii folosind sed?

RU: Как добавить первый шаблон регулярного выражения в конец строки с помощью sed?

VI: Làm cách nào để nối mẫu đầu tiên của biểu thức chính quy vào cuối dòng bằng cách sử dụng sed?

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.