Score:5

How to add double quotes around string and number pattern?

in flag

Hi I need to add double quotes in a pattern in 300k lines. I'm trying to use sed and I read multiple inquiries here and other sources, but I can't seem to understand its syntax.

I have:

chr1    StringTie       exon    191964  192299  1000    -       .       gene_id MSTRG.201; transcript_id MSTRG.201.53; exon_number 2;
chrY    StringTie       exon    26420508        26420531        1000    +       .       gene_id MSTRG.49889; transcript_id MSTRG.49889.11; exon_number 1;

And I need to have:

chr1    StringTie       exon    191964  192299  1000    -       .       gene_id "MSTRG.201"; transcript_id "MSTRG.201.53"; exon_number 2;
chrY    StringTie       exon    26420508        26420531        1000    +       .       gene_id "MSTRG.49889"; transcript_id "MSTRG.49889.11"; exon_number 1;

I'm using sed as follows:

sed 's/MSTRG./"MSTRG."/g' filename

But I can only get:

chr1    StringTie       exon    191964  192299  1000    -       .       gene_id "MSTRG."201; transcript_id "MSTRG."201.53; exon_number 2;
chrY    StringTie       exon    26420508        26420531        1000    +       .       gene_id "MSTRG."49889; transcript_id "MSTRG."49889.11; exon_number 1;

I've tried:

sed -Ei 's|MSTRG[[:digit:]]+|"&"|g' filename
sed 's/M/"M/; s/$/"/' filename
sed 's/MSTRG.[[:digit:]]+/"MSTRG.[[:digit:]]+"/g' filename

But none of these will work.

I was wondering if I could use awk, but I don't have any skills in this language.

Any help, please?

Thanks in advance.

terdon avatar
cn flag
This question is perfectly on topic and welcome here, but based on your input file, you might also be interested in checking out our sister site: [bioinformatics.se].
Score:5
cn flag

Why limit yourself to this particular gene name? Here's a more general solution that will put anything after gene_id or transcript_id and until the first ; in quotes:

$ sed -E 's/(transcript_id|gene_id)  *([^;]+)/\1 "\2"/g' file
chr1    StringTie       exon    191964  192299  1000    -       .       gene_id "MSTRG.201"; transcript_id "MSTRG.201.53"; exon_number 2;
chrY    StringTie       exon    26420508        26420531        1000    +       .       gene_id "MSTRG.49889"; transcript_id "MSTRG.49889.11"; exon_number 1;

Explanation

  • -E: this enables extended regular expressions which let us use ( ) unescaped (not \( \)) to capture groups, and also gives us + for "one or more" and allows us to use unescaped | for "either this or that".
  • s/(transcript_id|gene_id) *([^;]+)/\1"\2"/g': we are looking for either transcript_id or gene_id (that's why the | is used, for "OR"), followed by one or more spaces ( +), and then one or more non-; characters. The parentheses are used to capture what is matched, so it can then be used on the right hand side of the substitution operator. The first captured group (here, the transcript_id or gene_id along with the spaces) will be \1, the second will be \2 and so on. Then, this is all replaced with whatever was captured first (\1) and then whatever was captured second, surrounded by quotes ("\2").
  • s///g: the g is needed to make the substitution global, to substitute all matches found on the same line. Without the g, only the first match would be substituted.

You can use this on arbitrary gene names, even an entire GTF file and it should work just fine.

Score:4
cn flag

With GNU sed:

sed -E 's/MSTRG[0-9.]+/"&"/g' file

Output:

chr1    StringTie       exon    191964  192299  1000    -       .       gene_id "MSTRG.201"; transcript_id "MSTRG.201.53"; exon_number 2;
chrY    StringTie       exon    26420508        26420531        1000    +       .       gene_id "MSTRG.49889"; transcript_id "MSTRG.49889.11"; exon_number 1;

&: refer to that portion of the pattern space which matched

See: man sed and The Stack Overflow Regular Expressions FAQ

Raffa avatar
jp flag
Making use of `;` which seems consistent .... `sed 's/MSTRG.[^;]*/"&"/g'` might be a shorter alternative ... Also, most likely the extended regex flag `-E` won't be needed.
terdon avatar
cn flag
The `;` is consistent, yes: the OP didn't mention it, but this is [GTF or GFF](https://www.ensembl.org/info/website/upload/gff.html?redirect=no), standard formats in bioinformatics.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.