I have a .fasta (text) file containing DNA sequence data in the format as follows:
>uce-8374_Genus_species
ACGTACGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTACGATCGCGGTATATCGGCGATTCGATCG
>uce-239_Genus_species
ATCGTAGCATGCGCTAGCTAGCTAGCTCGCGGTACGCATGTCTGACTGCGTCTGGTCGTACGATTACTACGACTGCG
>uce-83_Genus_species
ATCGATCTAGCGTAGCATGCGATCGATATCTGCGATCGACTCGATGCATGCATGCATCGATGCTAGCTAGCTAGCTA
>uce-902_Genus_species
AGCTGACTAGCTGGCGATACTGGCGATATCGGATTACGCGGCATATCGAGCGAGTCGATCGATGCATCTGATGCAGC
I am trying to append everything before the first underscore, preceded by a |
to only the end of the lines have the >
. So for example, the first sequence would read:
uce-8374_Genus_species|uce-8374
, followed by the DNA sequence beneath it. Is there a way to do this in sed? I tried storing ^[^_]+(?=_)
into a variable, but it didn't work and just kept appending ^[^_]+(?=_)
to the end of the line instead of the pattern itself. Any help, as well as explanations (as I am new to regex) would be helpful. If there is a better way to go about this, I am open to other options!
So far, I have tried (I will just show the first DNA sequence, but I am wanting to change all of them):
sed -E 's/species/species|^[^_]+(?=_)/' sample_file.fasta
Result: uce-8374_Genus_species|^[^_]+(?=_)
and I have also tried:
x="^[^_]+(?=_)"
sed -E "s/species/species|$x/" "sample_file.fasta"
Result: uce-8374_Genus_species|^[^_]+(?=_)