Score:1

To make separate files based on pattern within a single file

id flag

I have a file containing 2.3M lines. Which looks like:

$less V2.fastq

>TS19_EWP4IQK02JPFP5
CATGCTGCCTCCCGTAGGAGTTTGGTCCGTGTCTCAGTACCAATGTGGGGGACCTTCCTC
TCAGAACCCTATCCATCGTCGGTTTGGTGGGCCGTTACCCGCCAACTGCCTAATGGAACG
CATGCCTATCTATCAGCGATGAATCTTTAGCAAATATCCCCATGCGGGGCCCTGCTTCAT
GCGGTATTAGTCCGACTTTCGCCGGGTTATCCCCTCTGATAGGTAAGTTGCATACGCGTT
ACTCACCGTGCGCCGG
>TS20_EWP4IQK02FSQQL
CATGCTGCCTCCCGTAGGAGTTTGGACCGGTGTCTCAGTTCCAACTGTGGGGGGACCTTC
CTCTCCAGAACCCCCTATCCCATCGAAG
>TS19_EWP4IQK02GBB8K
CATGCTGCCTCCCGTAGGAGTCTGGGCCGTGTCTCAGTCCCAGTGTGGCCGATCACCCTC
TCAGGTCGGCTATGTATCGTCGCCTAGGTGAGCCGTTACCTCACCTACTAGCTAATACAA
CGCAGGTCCATCTTGTAGTGGAGCATTTGCCCCTTTCAAATAAATGACATGAGTCACCCA
TTGTTATGCGGTATTAGCTATCGTTTCCAATAGTTATCCCCCGCTACAAGGCAGGTTACC
TACGCG
>TS19_EWP4IQK02FUJRM
CATGCTGCCTCCCGTAGGAGTTTGGACCGTGTCTCAGTTCCAATGTGGGGGACCTTCCTC
TCAGAACCCCTATCCATCGAAGACTAGGTGGGCCGTTACCCCGCCTACTATCTAATGGAA
CGCACCCCCATCTTACACCGGTAAACCTTTAATCATGCGAAAATGCTTACTCATGATAAC
ATCTTGTATTAATCTCCCTTTCAGAAGGCTGTCCAAGAGTGTAAGGCAGGTTGGATACGC
GTTACTCACCCGTGCGCCGGTCG
>TS119_EWP4IQK02I2KHZ
CATGCTGCCTCCCGTAGGAGTTTGGACCGTGTCTCAGTTCCAATGTGGGGGACCTTCCTC
TCAGAACCCCTATCCATCGATGGCTTGGTGGGCCGTTACCCCGCCAACAACCTAATGGAA
CGCATCCCCATCAATGACCGAAATTCTTTAATAGCTGAAAGATGCCTTTCAGATATACCA
TCGGGTATTAATCTTTCTTTCGAAAGGCTATCCCCGAGTCATCGGCAGGTTGGATACGTG
TTACTCACCCGTGCGCCGTCG

Line that starts with ">" denotes a single SampleID. Sample name is designated by the term before "_" in that line. For example:TS19, TS20, TS119, etc. I want to make separate output files for each such sample that contains the SampleID and the content within. Can anyone please help me?

Many thanks

edit:1 For getting output for sample TS_19 we can use this command which returns following output: Command

sed -n '/>TS19_/, />/p' V2.fasta 

Output (a few lines out of thousands)

>TS19_ok4.40713 CTAACGCAGTCA
TTGGGCCGTGTCTCAGTCCCAATGTGGCCGGTCACCCTCTCAGGTCGGCTACTGATCGTCGGCTTGGTAGGCCGTTACCCCACCAACTACCTAATCAGACGCGGGTCCATCTCATACCACCGGAGCTTTTTCACACCGTACCATGCGGTACTGTGCGCTTATGCGGTATTAGCAGTCGTTTCCAACTGTTATCCCCTGTATGAGGCAGGTTACCCACGCGTTACTCACCCGTCCG
>TS6.2_ok4.40714 CGTCAGACGGAT
>TS19_ok4.40771 CTAACGCAGTCA
TTGGGCCGTGTCTCAGTCCCAATGTGGCCGGTCACCCTCTCAGGTCGGCTACTGATCGTCGCTTTGGTAGGCCGATACCCCACCAACCGGCTAATCAGACGCGGGTCCATCTCATACCACCGGAGTTTTTACCCCTCGCACCATGCGGTGCTGTGGTCTTATGCGGTATTAGCAGTCATTTCTTGACTGTTTATTTCCCCTCGTATGAGGCAGGTTACCCACGCGTTACTCACCCG
>TS8_ok4.40772 TCGAGACGCTTA
>TS19_ok4.40971 CTAACGCAGTCA
CTGGGCCGTGTCTCAGTCCCAATGTGGCCGGTCACCCTCTCAGGTCGGCTACTGATCATCGCCTTGGTGGGCCGTTACCCCGCCAACAAGCTAATCAGACGCGGGTCCATCTCATACCACCGGAGTTTTTCACACTGTACCATGTGGTACTGTGCGCTTATGCGGTATTACCAGCCGTTTCCAGCTGCTATCCCCATCTGAAGGGCAGGTTGCTTACGCGG
>TS127_ok4.40972 GACCGAGCTATG

I just want to remove the lines that starts with > but don't follow TS_19. Can anyone help me?

edit:2 https://drive.google.com/file/d/17MC0tiIE6axOJqNZukzsQX5bVpuvV312/view?usp=sharing

kanehekili avatar
zw flag
Look at [Stackoverflow](https://stackoverflow.com/questions/11313852/split-one-file-into-multiple-files-based-on-delimiter). You'll find a huge amount of similar questions there
hr flag
... or U&L, ex. [Split file into multiple files based on pattern](https://unix.stackexchange.com/questions/263904/split-file-into-multiple-files-based-on-pattern)
es flag
A simple C program would do it with very little fuss
DEEP avatar
id flag
Hi @steeldriver- The solution in the link works only when the SampleID is same always. In my case, only a portion of the lines starting with ">" denote SampleID.
DEEP avatar
id flag
Hi, @steeldriver- following command that includes the start pattern (**>TS19**) and end pattern (**>**) works to some extent. But, it returns line with the end pattern which I don't want. Can you please add some changes to the command so that it works perfectly? **sed -n '/>TS19_/, />/p' V2.fasta**
Score:1
tr flag

I wrote a perl script a while ago, specifically for this.

The script takes a fasta file and creates individual files for all sequences. It will also clean the fasta file: Linebreaks in the sequence as well as empty lines and leading whitespaces in headers (> id) are removed by default. Additionally, non ACGT charachters can be converted to N and lowercase sequence characters can be converted to uppercase.

The script is called split_fasta.pl and you can find it on my github: https://github.com/nterhoeven/sequence_processing

mondotofu avatar
cn flag
Specifically tailored scripts for working with fasta or fastq are welcome
Score:1
jp flag

With awk, you can set > as the record separator and process(match) whole records instead of lines and search for e.g. records containing "TS19" like so:

awk 'BEGIN {RS=">"; ORS=RS} /TS19/' V2.fasta

Or automatically split each record type into a file with .split extension i.e. TS119.split TS19.split TS20.split ... in the same working directory like so:

awk 'BEGIN {RS=">"; ORS=RS} {split($1, arr, "_"); f=arr[1]".split"; print > f}' V2.fasta
DEEP avatar
id flag
Hi @Raffa- Thanks a lot. it works somewhat close but not perfect. I have seen following command returns 21469 lines: `cat V2.fasta | grep ">TS19_" | wc -l`. But, your command generates 21435 lines as obtained from this ommand: `awk 'BEGIN {RS=">"; ORS=RS} /TS19_/' V2.fasta | grep ">TS19_" | wc -l`. I am uploading my file under **edit:2** with the question so that you can help me properly.
DEEP avatar
id flag
Hi @Raffa- I found some sample IDs that are not being extracted with your command and they have a common pattern. They have four different strings separated by three spaces in the ID names started with `>`. Here are some examples of such IDs: `>TS19_ct4.121197 CTAACGCAGTCA Corrected: CTAATGCAGTCA, Changes: (pos 4, T -> C) >TS19_ct4.121626 CTAACGCAGTCA >TS19_ct4.131873 CTAACGCAGTCA Corrected: CTTACGCAGTCA, Changes: (pos 2, T -> A) >TS19_ct4.131930 CTAACGCAGTCA`
Raffa avatar
jp flag
Hi @DEEP `grep` will match lines (each one) directly ... While the above `awk` code will identify blocks (records) separated by `>` first, then print the whole record if a match is found in that record ... I apologize for not having time to inspect your file ... But I have downloaded it and hopefully I will look into it when I have time. :-)
DEEP avatar
id flag
ok @Raffa- Thanks
Raffa avatar
jp flag
@DEEP I quickly looked into your file and found leading white-space in some lines ... That can be solved with `awk 'BEGIN {RS=">"; ORS=RS} gsub(/^[ \t]+/,""); /TS19_/' V2.fasta` ... Please try it and let me know.
DEEP avatar
id flag
Many many thanks @Raffa. It is working perfectly. I really appreciate your helping hands towards me. However, is it possible to write a loop for all sample IDs like `TS19_, TS119_, TS20_` and so on?
Raffa avatar
jp flag
@DEEP Sure, you can loop over patterns in an array or from another file ... Please see my answer here: https://askubuntu.com/a/1420653 ... It includes an example with `awk` that you can adapt with the above code ... It's my pleasure to help :-)
DEEP avatar
id flag
sorry @Raffa- I found some problem with your last command `awk 'BEGIN {RS=">"; ORS=RS} gsub(/^[ \t]+/,""); /TS19_/' V2.fasta`. It actually adds some distorted lines at the end of the file where the lines are starting with something like `>C) or >G), etc`. An example line is: `>G) CTGGGCCGTGTCTCAGTCCCAATGTGGCCGGTCACCCTCTCAGGTCGGCTACTGATCGTCGGCTTGGTGGGCCGTTACCTCACCAACTACCTAATCAGACGCGGGTCCATCTCATACCACCGGAGTTTTACCACTGTACCATGCAGTACTGTGGTCTTATGCGGTATTAGCAGCCATTTCTAACTGTTACTCCCCCTGTATGAGGCAGGTTACCCACGCGTTACTCCACCCGTCCCG`
Raffa avatar
jp flag
@DEEP It's OK ... I tested and had the same result so I inspected your file and found `>` in the middle of some lines and treated like a record separator where it shouldn't ... Your file is inconsistent in this regard ... That can be fixed: `awk 'BEGIN {RS="\n>"; ORS=RS} gsub(/^[ \t]+/,""); /TS19_/' V2.fasta`
DEEP avatar
id flag
I found it works except the fact that it adds a single `>` at the end like: `>TS19_ok4.40771 CTAACGCAGTCA TTGGGCCGTGTCTCAGTCCCAATGTGCTCTCAGGTCGGCTACTGATCGTCGCTTTGGTAGGCCGATACCCCACCAACCGGCTAATCAGACGCGGGTCCATCTCATACCACCGGAGTTTTTACCCCTCGCACCATGCGGTGCTGTGGTCTTATGCGGTATTAGCAGTCATTTCTTGACTGTTTATTTCCCCTCGTATGAGGCAGGTTACCCACGCGTTACTCACCCG >TS19_ok4.40971 CTAACGCAGTCA CTGGGCCGTGTCTCAGTCCCAATGTGGCCGGTCACCCTCTCAGGTCGGCTACTGATCATCGCCTTGGTGGGCCGTTACCCCGCCAACAAGCTAATCAGACGCGGGTCCATCTCATACCACCGGAGTTTTTCACACTGTACCGGTACTGTGCGCTTATGCGGTATTACCAGCCGTTTCCAGCTGCTATCCCCATCTGAAGGGCAGGTTGCTTACGCGG >`
Raffa avatar
jp flag
@DEEP That is because setting `ORS=RS` ... It sets the output record separator as the input record separator and is invoked after each `print` action (not before) and as a result the first record is printed without a `>` as well ... One solution is: `awk 'BEGIN {RS="\n>"} gsub(/^[ \t]+/,""); /TS19_/ {print ">"$0}' V2.fasta`
DEEP avatar
id flag
Thanks @Raffa- It works. For generating individual files for each sample names (TS19, TS190, TS191 etc) I have generated a file (`samplenames.txt`) with the sample names from `V2.fasta`. It's content are like: TS19 `TS190 TS191 TS19.2`. Now, wrote this script to run as loop with the samples: `#!/usr/bin/bash filename="samplenames.txt" samples=$(cat $filename) for f in $samples do awk 'BEGIN {RS="\n>"} gsub(/^[ \t]+/,""); /$f/ {print ">"$0}' V2.fasta > ${f}.fasta done`. It generates only blank output files `TS19.fasta, TS190.fasta, etc`. Do you have any idea about the problem?
Raffa avatar
jp flag
@DEEP You need to pass the shell variable to awk like so `awk -v f="$f" '…..'` and use it inside awk without `$` … just `f`
Raffa avatar
jp flag
@DEEP `awk -v f="$f" 'BEGIN {RS="\n>"} gsub(/^[ \t]+/,""); $0 ~ f {print ">"$0}' V2.fasta`
DEEP avatar
id flag
Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/140070/discussion-between-deep-and-raffa).
Raffa avatar
jp flag
@DEEP Put that awk command back in your for loop in place of the similar awk command part and it will work :-) … I am on my mobile phone now so can’t write code correctly but will get infront of a PC in a couple hours if you still need me
DEEP avatar
id flag
Thanks @Raffa for being patient with my problems. It works but I discovered one problem. I found `for f in TS1.2_`, inspite of only `TS1.2_`, all the following sample names are taken at once: `TS1.2_ TS132_ TS142_ TS152_ TS162_ TS182_ TS192_` . I think the dot `.` chracter in `TS1.2_` is being treated as a wildcard which is creating the problem.
Score:1
cn flag

Edit 1 Take away the -n 7 ... you won't need it.

csplit -z v2.fastq  -f TestSample /\>TS/ '{*}'

Will generate files TestSample00, TestSample01, TestSample02,TestSample03,... TestSamplennnnnn based upon your file.

Finally, you'll want a prefix to identify all these files. Sorry my solution doesn't rename your file to show the Test Sample number naming convention, but at least you can vary it each time you run the command by changing the prefix with the -f parameter.

Edit 2
If however you need all of your data having the same test sample identifier collected together in the same file, then follow up with a command such as

find . -name "TestSample*" | xargs grep -l TS19_ | awk '{print "cat " $1"  >> My_TS19_.fasta " }' | sh

The new file (My_TS19_.fasta) will have all the sequences in it that pertain to TS19_ or whatever case-sensitive string you put in after grep

I've added the xargs command to stream the list of files rather than choking the find command.

The awk command takes the file names and appends each one to an initially non-existent or empty file. Be careful to use a new file each time to avoid making duplicates.

DEEP avatar
id flag
Hi @mondotofu- probably my question is not clear enough. Actually, here **TS19_EWP4IQK02JPFP5, TS19_EWP4IQK02GBB8K and TS19_EWP4IQK02FUJRM** belong to same sample that is **TS19**. They will be kept inside a single file named **TS19**. Is it possible to do?
mondotofu avatar
cn flag
I've edited my answer by adding another command at the end.
mondotofu avatar
cn flag
If this answer resolved your question, then please show your appreciation by "accepting" it: click the checkmark next to the question.
DEEP avatar
id flag
HI @mondotofu- Actually my system hangs with the `csplit` command probably because it results into million of outputs. And also`find` command returns the following error: `bash: /usr/bin/find: Argument list too long`. I will request to take a look into **edit 1** in the questin itself where I have put a command that goes close to desired answer. And I need your help to make it perfect. Please take a look.
mondotofu avatar
cn flag
Revised the commands and used xargs to eliminate the Argument list too long error. Take note of new commands in **Edit 1** and **Edit 2** @DEEP
mondotofu avatar
cn flag
Tried out commands on your file in Google drive. csplit completed in about a minute. The second command finished in about 10 seconds. I have Ubuntu 22.04.1 LTS on a Thinkpad with 32GB RAM.
DEEP avatar
id flag
Yeah @mondotofu- probably because of RAM issue it is not running. I have only 8GB of RAM
mondotofu avatar
cn flag
I'd wipe out all the TestSample files and try again with the revised commands in Edit 1 and Edit 2 in my answer. There's should be enough gradualism in the csplit command that you don't need to take large amounts of RAM to complete it. The xargs command also takes away the long list of files, doing one at a time, and the sh command makes sure that it executes once for each file detected.
I sit in a Tesla and translated this thread with Ai:

mangohost

Post an answer

Most people don’t grasp that asking a lot of questions unlocks learning and improves interpersonal bonding. In Alison’s studies, for example, though people could accurately recall how many questions had been asked in their conversations, they didn’t intuit the link between questions and liking. Across four studies, in which participants were engaged in conversations themselves or read transcripts of others’ conversations, people tended not to realize that question asking would influence—or had influenced—the level of amity between the conversationalists.