Score:1

Ubuntu

To make separate files based on pattern within a single file

DEEP

10/16/23, 5:46 PM

I have a file containing 2.3M lines. Which looks like:

$less V2.fastq

>TS19_EWP4IQK02JPFP5
CATGCTGCCTCCCGTAGGAGTTTGGTCCGTGTCTCAGTACCAATGTGGGGGACCTTCCTC
TCAGAACCCTATCCATCGTCGGTTTGGTGGGCCGTTACCCGCCAACTGCCTAATGGAACG
CATGCCTATCTATCAGCGATGAATCTTTAGCAAATATCCCCATGCGGGGCCCTGCTTCAT
GCGGTATTAGTCCGACTTTCGCCGGGTTATCCCCTCTGATAGGTAAGTTGCATACGCGTT
ACTCACCGTGCGCCGG
>TS20_EWP4IQK02FSQQL
CATGCTGCCTCCCGTAGGAGTTTGGACCGGTGTCTCAGTTCCAACTGTGGGGGGACCTTC
CTCTCCAGAACCCCCTATCCCATCGAAG
>TS19_EWP4IQK02GBB8K
CATGCTGCCTCCCGTAGGAGTCTGGGCCGTGTCTCAGTCCCAGTGTGGCCGATCACCCTC
TCAGGTCGGCTATGTATCGTCGCCTAGGTGAGCCGTTACCTCACCTACTAGCTAATACAA
CGCAGGTCCATCTTGTAGTGGAGCATTTGCCCCTTTCAAATAAATGACATGAGTCACCCA
TTGTTATGCGGTATTAGCTATCGTTTCCAATAGTTATCCCCCGCTACAAGGCAGGTTACC
TACGCG
>TS19_EWP4IQK02FUJRM
CATGCTGCCTCCCGTAGGAGTTTGGACCGTGTCTCAGTTCCAATGTGGGGGACCTTCCTC
TCAGAACCCCTATCCATCGAAGACTAGGTGGGCCGTTACCCCGCCTACTATCTAATGGAA
CGCACCCCCATCTTACACCGGTAAACCTTTAATCATGCGAAAATGCTTACTCATGATAAC
ATCTTGTATTAATCTCCCTTTCAGAAGGCTGTCCAAGAGTGTAAGGCAGGTTGGATACGC
GTTACTCACCCGTGCGCCGGTCG
>TS119_EWP4IQK02I2KHZ
CATGCTGCCTCCCGTAGGAGTTTGGACCGTGTCTCAGTTCCAATGTGGGGGACCTTCCTC
TCAGAACCCCTATCCATCGATGGCTTGGTGGGCCGTTACCCCGCCAACAACCTAATGGAA
CGCATCCCCATCAATGACCGAAATTCTTTAATAGCTGAAAGATGCCTTTCAGATATACCA
TCGGGTATTAATCTTTCTTTCGAAAGGCTATCCCCGAGTCATCGGCAGGTTGGATACGTG
TTACTCACCCGTGCGCCGTCG

Line that starts with ">" denotes a single SampleID. Sample name is designated by the term before "_" in that line. For example:TS19, TS20, TS119, etc. I want to make separate output files for each such sample that contains the SampleID and the content within. Can anyone please help me?

Many thanks

edit:1 For getting output for sample TS_19 we can use this command which returns following output: Command

sed -n '/>TS19_/, />/p' V2.fasta

Output (a few lines out of thousands)

>TS19_ok4.40713 CTAACGCAGTCA
TTGGGCCGTGTCTCAGTCCCAATGTGGCCGGTCACCCTCTCAGGTCGGCTACTGATCGTCGGCTTGGTAGGCCGTTACCCCACCAACTACCTAATCAGACGCGGGTCCATCTCATACCACCGGAGCTTTTTCACACCGTACCATGCGGTACTGTGCGCTTATGCGGTATTAGCAGTCGTTTCCAACTGTTATCCCCTGTATGAGGCAGGTTACCCACGCGTTACTCACCCGTCCG
>TS6.2_ok4.40714 CGTCAGACGGAT
>TS19_ok4.40771 CTAACGCAGTCA
TTGGGCCGTGTCTCAGTCCCAATGTGGCCGGTCACCCTCTCAGGTCGGCTACTGATCGTCGCTTTGGTAGGCCGATACCCCACCAACCGGCTAATCAGACGCGGGTCCATCTCATACCACCGGAGTTTTTACCCCTCGCACCATGCGGTGCTGTGGTCTTATGCGGTATTAGCAGTCATTTCTTGACTGTTTATTTCCCCTCGTATGAGGCAGGTTACCCACGCGTTACTCACCCG
>TS8_ok4.40772 TCGAGACGCTTA
>TS19_ok4.40971 CTAACGCAGTCA
CTGGGCCGTGTCTCAGTCCCAATGTGGCCGGTCACCCTCTCAGGTCGGCTACTGATCATCGCCTTGGTGGGCCGTTACCCCGCCAACAAGCTAATCAGACGCGGGTCCATCTCATACCACCGGAGTTTTTCACACTGTACCATGTGGTACTGTGCGCTTATGCGGTATTACCAGCCGTTTCCAGCTGCTATCCCCATCTGAAGGGCAGGTTGCTTACGCGG
>TS127_ok4.40972 GACCGAGCTATG

I just want to remove the lines that starts with > but don't follow TS_19. Can anyone help me?

edit:2 https://drive.google.com/file/d/17MC0tiIE6axOJqNZukzsQX5bVpuvV312/view?usp=sharing

260

3 + 5

text-processing

22.04

kanehekili

10/16/23, 5:59 PM

Look at [Stackoverflow](https://stackoverflow.com/questions/11313852/split-one-file-into-multiple-files-based-on-delimiter). You'll find a huge amount of similar questions there

0

Reply

steeldriver

10/16/23, 6:05 PM

... or U&L, ex. [Split file into multiple files based on pattern](https://unix.stackexchange.com/questions/263904/split-file-into-multiple-files-based-on-pattern)

0

Reply

Giorgos Saridakis

10/16/23, 6:21 PM

A simple C program would do it with very little fuss

0

Reply

DEEP

10/16/23, 7:00 PM

Hi @steeldriver- The solution in the link works only when the SampleID is same always. In my case, only a portion of the lines starting with ">" denote SampleID.

0

Reply

DEEP

10/16/23, 9:12 PM

Hi, @steeldriver- following command that includes the start pattern (**>TS19**) and end pattern (**>**) works to some extent. But, it returns line with the end pattern which I don't want. Can you please add some changes to the command so that it works perfectly? **sed -n '/>TS19_/, />/p' V2.fasta**

0

Reply

Score:1

Ubuntu

Wayne_Yux

10/21/23, 11:56 AM

I wrote a perl script a while ago, specifically for this.

The script takes a fasta file and creates individual files for all sequences. It will also clean the fasta file: Linebreaks in the sequence as well as empty lines and leading whitespaces in headers (> id) are removed by default. Additionally, non ACGT charachters can be converted to N and lowercase sequence characters can be converted to uppercase.

The script is called split_fasta.pl and you can find it on my github: https://github.com/nterhoeven/sequence_processing

+ 1

mondotofu

10/21/23, 8:24 PM

Specifically tailored scripts for working with fasta or fastq are welcome

0

Reply

Score:1

Ubuntu

Raffa

10/21/23, 1:32 PM

With awk, you can set > as the record separator and process(match) whole records instead of lines and search for e.g. records containing "TS19" like so:

awk 'BEGIN {RS=">"; ORS=RS} /TS19/' V2.fasta

Or automatically split each record type into a file with .split extension i.e. TS119.split TS19.split TS20.split ... in the same working directory like so:

awk 'BEGIN {RS=">"; ORS=RS} {split($1, arr, "_"); f=arr[1]".split"; print > f}' V2.fasta

+ 17

DEEP

10/21/23, 1:44 PM

Hi @Raffa- Thanks a lot. it works somewhat close but not perfect. I have seen following command returns 21469 lines: `cat V2.fasta | grep ">TS19_" | wc -l`. But, your command generates 21435 lines as obtained from this ommand: `awk 'BEGIN {RS=">"; ORS=RS} /TS19_/' V2.fasta | grep ">TS19_" | wc -l`. I am uploading my file under **edit:2** with the question so that you can help me properly.

0

Reply

DEEP

10/21/23, 2:08 PM

Hi @Raffa- I found some sample IDs that are not being extracted with your command and they have a common pattern. They have four different strings separated by three spaces in the ID names started with `>`. Here are some examples of such IDs: `>TS19_ct4.121197 CTAACGCAGTCA Corrected: CTAATGCAGTCA, Changes: (pos 4, T -> C) >TS19_ct4.121626 CTAACGCAGTCA >TS19_ct4.131873 CTAACGCAGTCA Corrected: CTTACGCAGTCA, Changes: (pos 2, T -> A) >TS19_ct4.131930 CTAACGCAGTCA`

0

Reply

Raffa

10/21/23, 2:20 PM

Hi @DEEP `grep` will match lines (each one) directly ... While the above `awk` code will identify blocks (records) separated by `>` first, then print the whole record if a match is found in that record ... I apologize for not having time to inspect your file ... But I have downloaded it and hopefully I will look into it when I have time. :-)

0

Reply

DEEP

10/21/23, 2:39 PM

ok @Raffa- Thanks

0

Reply

Raffa

10/21/23, 3:35 PM

@DEEP I quickly looked into your file and found leading white-space in some lines ... That can be solved with `awk 'BEGIN {RS=">"; ORS=RS} gsub(/^[ \t]+/,""); /TS19_/' V2.fasta` ... Please try it and let me know.

1

Reply

DEEP

10/21/23, 3:42 PM

Many many thanks @Raffa. It is working perfectly. I really appreciate your helping hands towards me. However, is it possible to write a loop for all sample IDs like `TS19_, TS119_, TS20_` and so on?

0

Reply

Raffa

10/21/23, 4:03 PM

@DEEP Sure, you can loop over patterns in an array or from another file ... Please see my answer here: https://askubuntu.com/a/1420653 ... It includes an example with `awk` that you can adapt with the above code ... It's my pleasure to help :-)

0

Reply

DEEP

10/22/23, 1:45 PM

sorry @Raffa- I found some problem with your last command `awk 'BEGIN {RS=">"; ORS=RS} gsub(/^[ \t]+/,""); /TS19_/' V2.fasta`. It actually adds some distorted lines at the end of the file where the lines are starting with something like `>C) or >G), etc`. An example line is: `>G) CTGGGCCGTGTCTCAGTCCCAATGTGGCCGGTCACCCTCTCAGGTCGGCTACTGATCGTCGGCTTGGTGGGCCGTTACCTCACCAACTACCTAATCAGACGCGGGTCCATCTCATACCACCGGAGTTTTACCACTGTACCATGCAGTACTGTGGTCTTATGCGGTATTAGCAGCCATTTCTAACTGTTACTCCCCCTGTATGAGGCAGGTTACCCACGCGTTACTCCACCCGTCCCG`

0

Reply

Raffa

10/22/23, 2:51 PM

@DEEP It's OK ... I tested and had the same result so I inspected your file and found `>` in the middle of some lines and treated like a record separator where it shouldn't ... Your file is inconsistent in this regard ... That can be fixed: `awk 'BEGIN {RS="\n>"; ORS=RS} gsub(/^[ \t]+/,""); /TS19_/' V2.fasta`

1

Reply

DEEP

10/22/23, 2:57 PM

I found it works except the fact that it adds a single `>` at the end like: `>TS19_ok4.40771 CTAACGCAGTCA TTGGGCCGTGTCTCAGTCCCAATGTGCTCTCAGGTCGGCTACTGATCGTCGCTTTGGTAGGCCGATACCCCACCAACCGGCTAATCAGACGCGGGTCCATCTCATACCACCGGAGTTTTTACCCCTCGCACCATGCGGTGCTGTGGTCTTATGCGGTATTAGCAGTCATTTCTTGACTGTTTATTTCCCCTCGTATGAGGCAGGTTACCCACGCGTTACTCACCCG >TS19_ok4.40971 CTAACGCAGTCA CTGGGCCGTGTCTCAGTCCCAATGTGGCCGGTCACCCTCTCAGGTCGGCTACTGATCATCGCCTTGGTGGGCCGTTACCCCGCCAACAAGCTAATCAGACGCGGGTCCATCTCATACCACCGGAGTTTTTCACACTGTACCGGTACTGTGCGCTTATGCGGTATTACCAGCCGTTTCCAGCTGCTATCCCCATCTGAAGGGCAGGTTGCTTACGCGG >`

0

Reply

Raffa

10/22/23, 3:14 PM

@DEEP That is because setting `ORS=RS` ... It sets the output record separator as the input record separator and is invoked after each `print` action (not before) and as a result the first record is printed without a `>` as well ... One solution is: `awk 'BEGIN {RS="\n>"} gsub(/^[ \t]+/,""); /TS19_/ {print ">"$0}' V2.fasta`

1

Reply

DEEP

10/23/23, 4:52 AM

Thanks @Raffa- It works. For generating individual files for each sample names (TS19, TS190, TS191 etc) I have generated a file (`samplenames.txt`) with the sample names from `V2.fasta`. It's content are like: TS19 `TS190 TS191 TS19.2`. Now, wrote this script to run as loop with the samples: `#!/usr/bin/bash filename="samplenames.txt" samples=$(cat $filename) for f in $samples do awk 'BEGIN {RS="\n>"} gsub(/^[ \t]+/,""); /$f/ {print ">"$0}' V2.fasta > ${f}.fasta done`. It generates only blank output files `TS19.fasta, TS190.fasta, etc`. Do you have any idea about the problem?

0

Reply

Raffa

10/23/23, 6:51 AM

@DEEP You need to pass the shell variable to awk like so `awk -v f="$f" '…..'` and use it inside awk without `$` … just `f`

0

Reply

Raffa

10/23/23, 9:11 AM

@DEEP `awk -v f="$f" 'BEGIN {RS="\n>"} gsub(/^[ \t]+/,""); $0 ~ f {print ">"$0}' V2.fasta`

0

Reply

DEEP

10/23/23, 11:23 AM

Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/140070/discussion-between-deep-and-raffa).

0

Reply

Raffa

10/23/23, 11:49 AM

@DEEP Put that awk command back in your for loop in place of the similar awk command part and it will work :-) … I am on my mobile phone now so can’t write code correctly but will get infront of a PC in a couple hours if you still need me

0

Reply

DEEP

10/23/23, 12:44 PM

Thanks @Raffa for being patient with my problems. It works but I discovered one problem. I found `for f in TS1.2_`, inspite of only `TS1.2_`, all the following sample names are taken at once: `TS1.2_ TS132_ TS142_ TS152_ TS162_ TS182_ TS192_` . I think the dot `.` chracter in `TS1.2_` is being treated as a wildcard which is creating the problem.

0

Reply

Score:1

Ubuntu

mondotofu

10/19/23, 2:31 AM

Edit 1 Take away the -n 7 ... you won't need it.

csplit -z v2.fastq  -f TestSample /\>TS/ '{*}'

Will generate files TestSample00, TestSample01, TestSample02,TestSample03,... TestSamplennnnnn based upon your file.

Finally, you'll want a prefix to identify all these files. Sorry my solution doesn't rename your file to show the Test Sample number naming convention, but at least you can vary it each time you run the command by changing the prefix with the -f parameter.

Edit 2
If however you need all of your data having the same test sample identifier collected together in the same file, then follow up with a command such as

find . -name "TestSample*" | xargs grep -l TS19_ | awk '{print "cat " $1"  >> My_TS19_.fasta " }' | sh

The new file (My_TS19_.fasta) will have all the sequences in it that pertain to TS19_ or whatever case-sensitive string you put in after grep

I've added the xargs command to stream the list of files rather than choking the find command.

The awk command takes the file names and appends each one to an initially non-existent or empty file. Be careful to use a new file each time to avoid making duplicates.

+ 8

DEEP

10/19/23, 6:30 AM

Hi @mondotofu- probably my question is not clear enough. Actually, here **TS19_EWP4IQK02JPFP5, TS19_EWP4IQK02GBB8K and TS19_EWP4IQK02FUJRM** belong to same sample that is **TS19**. They will be kept inside a single file named **TS19**. Is it possible to do?

0

Reply

mondotofu

10/20/23, 12:15 AM

I've edited my answer by adding another command at the end.

0

Reply

mondotofu

10/21/23, 1:14 AM

If this answer resolved your question, then please show your appreciation by "accepting" it: click the checkmark next to the question.

0

Reply

DEEP

10/21/23, 10:51 AM

HI @mondotofu- Actually my system hangs with the `csplit` command probably because it results into million of outputs. And also`find` command returns the following error: `bash: /usr/bin/find: Argument list too long`. I will request to take a look into **edit 1** in the questin itself where I have put a command that goes close to desired answer. And I need your help to make it perfect. Please take a look.

0

Reply

mondotofu

10/21/23, 9:12 PM

Revised the commands and used xargs to eliminate the Argument list too long error. Take note of new commands in **Edit 1** and **Edit 2** @DEEP

1

Reply

mondotofu

10/22/23, 2:09 AM

Tried out commands on your file in Google drive. csplit completed in about a minute. The second command finished in about 10 seconds. I have Ubuntu 22.04.1 LTS on a Thinkpad with 32GB RAM.

1

Reply

DEEP

10/22/23, 5:17 AM

Yeah @mondotofu- probably because of RAM issue it is not running. I have only 8GB of RAM

0

Reply

mondotofu

10/22/23, 3:09 PM

I'd wipe out all the TestSample files and try again with the revised commands in Edit 1 and Edit 2 in my answer. There's should be enough gradualism in the csplit command that you don't need to take large amounts of RAM to complete it. The xargs command also takes away the long list of files, doing one at a time, and the sh command makes sure that it executes once for each file detected.

0

Reply

Elon Musk

I sit in a Tesla and translated this thread with Ai:

EN: To make separate files based on pattern within a single file

To make separate files based on pattern within a single file

Post an answer