Score:0

How do you append the first pattern of a regular expression to the end of a line using sed?

ke flag

I have a .fasta (text) file containing DNA sequence data in the format as follows:

>uce-8374_Genus_species
ACGTACGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTACGATCGCGGTATATCGGCGATTCGATCG

>uce-239_Genus_species
ATCGTAGCATGCGCTAGCTAGCTAGCTCGCGGTACGCATGTCTGACTGCGTCTGGTCGTACGATTACTACGACTGCG

>uce-83_Genus_species
ATCGATCTAGCGTAGCATGCGATCGATATCTGCGATCGACTCGATGCATGCATGCATCGATGCTAGCTAGCTAGCTA

>uce-902_Genus_species
AGCTGACTAGCTGGCGATACTGGCGATATCGGATTACGCGGCATATCGAGCGAGTCGATCGATGCATCTGATGCAGC

I am trying to append everything before the first underscore, preceded by a | to only the end of the lines have the >. So for example, the first sequence would read: uce-8374_Genus_species|uce-8374, followed by the DNA sequence beneath it. Is there a way to do this in sed? I tried storing ^[^_]+(?=_) into a variable, but it didn't work and just kept appending ^[^_]+(?=_) to the end of the line instead of the pattern itself. Any help, as well as explanations (as I am new to regex) would be helpful. If there is a better way to go about this, I am open to other options!

So far, I have tried (I will just show the first DNA sequence, but I am wanting to change all of them):

sed -E 's/species/species|^[^_]+(?=_)/' sample_file.fasta

Result: uce-8374_Genus_species|^[^_]+(?=_)

and I have also tried:

x="^[^_]+(?=_)"
sed -E "s/species/species|$x/" "sample_file.fasta"

Result: uce-8374_Genus_species|^[^_]+(?=_)

Score:2
hr flag

Unlike Perl, sed doesn't support the PCRE lookahead syntax (?=_) but you could fake it as follows:

  • match > anchored to the start of the line ^>
  • then match and capture zero or more non-_ characters ([^_]*)
  • then match everything else .*

then replace with

  • the entire matched pattern &
  • followed by literal | and then the first captured group \1

So

$ sed -E 's/^>([^_]*).*/&|\1/' sample_file.fasta 
>uce-8374_Genus_species|uce-8374
ACGTACGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTACGATCGCGGTATATCGGCGATTCGATCG

>uce-239_Genus_species|uce-239
ATCGTAGCATGCGCTAGCTAGCTAGCTCGCGGTACGCATGTCTGACTGCGTCTGGTCGTACGATTACTACGACTGCG

>uce-83_Genus_species|uce-83
ATCGATCTAGCGTAGCATGCGATCGATATCTGCGATCGACTCGATGCATGCATGCATCGATGCTAGCTAGCTAGCTA

>uce-902_Genus_species|uce-902
AGCTGACTAGCTGGCGATACTGGCGATATCGGATTACGCGGCATATCGAGCGAGTCGATCGATGCATCTGATGCAGC
Justin avatar
ke flag
That worked! Thank you! If you don't mind, can you explain what this is doing? specifically the second and third part of the `sed` syntax here?
hr flag
@Justin please see updated answer
mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.