Extracting specific data from different tables and merging it

Question

Score:1

Ubuntu

Extracting specific data from different tables and merging it

Guillaume Tahon

3/25/23, 9:04 AM

I would like to extract specific rows from specific columns from different tables, based on the column header. However, my knowledge seems to be too limited to solve it myself (both in command-line and python).

I have a directory with a growing number of .tsv files (OTU tables). These files all have the same layout:

Row 1 contains headers.
Rows 2 - x contain data. The number of columns can differ between files.
Column 1 always contains the OTU ids.
The last column always contains the Taxonomy.

text-formatted files can be found at the bottom

For example, file 1 looks like this:

OTU_id	20210801_sampleA	20210801_sampleB	20210801_sampleC	Taxonomy
otuA	5	1	0	Hans is cool
otuB	2	0	0	Patricia has blue eyes
otuF	0	6	2	Rutger is actually Bart
otuZ	1	3	0	Felix was lost

File 2 looks like this:

OTU_id	20211111_sampleT	20211111_sampleG	20211111_sampleA	Taxonomy
otuA	0	3	3	Hans is cool
otuB	2	1	0	Patricia has blue eyes
otuP	1	6	0	Thomas is great
otuT	2	22	56	Anna and Elsa
otuZ	1	8	4	Felix was lost

Of course, the number of samples and otus are different per file. As you can also see, the headers of the samples are always preceded by a date. In a certain file, this is always the same date.

What I want to achieve now is the following:

If I have a certain sample name, for example sampleA, I want to extract the results for this sample from all tables (.tsv files) in the directory. Of course sampleA should only be extracted from the files in which it is present.

In the first table it will find sampleA in the second column and it should take over the results for which the value in the below rows is larger than 0. In the second table, it will find sampleA in the 4th column.

The final table should look like this:

OTU_id	20210801_sampleA	20211111_sampleA	Taxonomy
otuA	5	3	Hans is cool
otuB	2	0	Patricia has blue eyes
otuZ	1	4	Felix was lost
otuT	0	56	Anna and Elsa

SampleA has data for otuA,B,Z in the first file. These results are taken over (so the OTU_id, the values for header 20210801_sampleA and the Taxonomy).

As you will see, otuF was not extracted from the first table, because sampleA had a value 0 there. otuF also isn't present in table 2, so hence the total absence in the output file.

otuA and otuZ also had values > 0 in table 2, so these values were taken over.

otuB had a value of 0 for sampleA in table 2. Normally it would not extract it, but since it had been extracted from table 1, a 0 is added in that position.

otuT is only present in table 2 and thus was extracted there (OTU_id, value for 20211111_sampleA and Taxonomy). Because it was not found in table 1, a 0 is added in that position.

In the results file, the first column should thus also be the OTU_id. columns 2 to x should contain the data for sampleA, extracted from the files in the directory. The last column should contain the Taxonomy.

I hope you manage to understand everything. It has been bothering me for some time and lack the skills to figure it out. Please let me know if something isn't clear!

Thanks a lot for your help and sorry to bother you with this!

Cheers Guillaume

Text-formatted table 1.tsv:

OTU_id  20210801_sampleA    20210801_sampleB    20210801_sampleC    Taxonomy
otuA    5   1   0   Hans is cool
otuB    2   0   0   Patricia has blue eyes
otuF    0   6   2   Rutger is actually Bart
otuZ    1   3   0   Felix was lost

text-formatted table2.tsv

OTU_id  20211111_sampleT    20211111_sampleG    20211111_sampleA    Taxonomy
otuA    0   3   3   Hans is cool
otuB    2   1   0   Patricia has blue eyes
otuP    1   6   0   Thomas is great
otuT    2   22  56  Anna and Elsa
otuZ    1   8   4   Felix was lost

text-formatted output.tsv

OTU_id  20210801_sampleA    20211111_sampleA    Taxonomy
otuA    5   3   Hans is cool
otuB    2   0   Patricia has blue eyes
otuZ    1   4   Felix was lost
otuT    0   56  Anna and Elsa

48

0 + 0

python

command-line

scripts

text-processing

Extracting specific data from different tables and merging it

Post an answer