Directory organization in Bioinformatics analysis

· Read in about 2 min · (414 Words)

Well organized directory structures in Bioinformatics project make the analysis more scalable and reproducible, and are easy for parallelization.

Here’s my way.

Directory raw contains all paired-ends reads files, which are carefully named with same suffixes: _1.fastq.gz for read 1 and _2.fastq.gz for read 2.

$ ls raw | head -n 4

I use a python script cluster_files to cluster paired-end reads by samples:

$ cluster_files -p '(.+?)_[12]\.fastq\.gz$' raw/ -o raw.cluster

$ tree
├── raw
│   ├── delta_G_9-30_1.fastq.gz
│   ├── delta_G_9-30_2.fastq.gz
│   ├── WT_9-30_1.fastq.gz
│   └── WT_9-30_2.fastq.gz
└── raw.cluster
    ├── delta_G_9-30
    │   ├── delta_G_9-30_1.fastq.gz -> ../../raw/delta_G_9-30_1.fastq.gz
    │   └── delta_G_9-30_2.fastq.gz -> ../../raw/delta_G_9-30_2.fastq.gz
    └── WT_9-30
        ├── WT_9-30_1.fastq.gz -> ../../raw/WT_9-30_1.fastq.gz
        └── WT_9-30_2.fastq.gz -> ../../raw/WT_9-30_2.fastq.gz

For every analysis step, I creat a new directory with same directory structure in which reads files are soft links. There are many benefits for this:

  1. Safety. Previously produced files are independent from current working space, which would not be deleted by accident.
  2. Clear file organization. Every analysis step has its own working space.
  3. Easy for parallelization using tools like GNU parallel and rush.

Now let’s clean the reads. . Replacement string {/} represents the basename of input file path, here it’s the sample name.

$ ls -d raw.cluster.clean/* | parallel -j 6 --dryrun \
    java -jar ${TRIMMOMATIC} PE -threads 4 -phred33 \
    {}/{/}_1.fastq.gz {}/{/}_2.fastq.gz \
    {}/{/}_1.fq.gz {}/{/}_U_1.fq.gz {}/{/}_2.fq.gz {}/{/}_U_2.fq.gz \
    ILLUMINACLIP:trimSeq.fasta:2:7:7 MINLEN:20

java -jar PE -threads 4 -phred33 raw.cluster.clean/delta_G_9-30/delta_G_9-30_1.fastq.gz raw.cluster.clean/delta_G_9-30/delta_G_9-30_2.fastq.gz raw.cluster.clean/delta_G_9-30/delta_G_9-30_1.fq.gz raw.cluster.clean/delta_G_9-30/delta_G_9-30_U_1.fq.gz raw.cluster.clean/delta_G_9-30/delta_G_9-30_2.fq.gz raw.cluster.clean/delta_G_9-30/delta_G_9-30_U_2.fq.gz ILLUMINACLIP:trimSeq.fasta:2:7:7 MINLEN:20

java -jar PE -threads 4 -phred33 raw.cluster.clean/WT_9-30/WT_9-30_1.fastq.gz raw.cluster.clean/WT_9-30/WT_9-30_2.fastq.gz raw.cluster.clean/WT_9-30/WT_9-30_1.fq.gz raw.cluster.clean/WT_9-30/WT_9-30_U_1.fq.gz raw.cluster.clean/WT_9-30/WT_9-30_2.fq.gz raw.cluster.clean/WT_9-30/WT_9-30_U_2.fq.gz ILLUMINACLIP:trimSeq.fasta:2:7:7 MINLEN:20

Next step, results of trimmomatic are clustered for further analyis, e.g., assembly.

$ cluster_files -p '(.+?)_\d.*\.fq\.gz$' raw.cluster.clean -o raw.cluster.clean.assembly

$ tree raw.cluster.clean.assembly
├── delta_G
│   ├── delta_G_9-30_1.fq.gz -> ../../raw.cluster.clean/delta_G_9-30/delta_G_9-30_1.fq.gz
│   ├── delta_G_9-30_2.fq.gz -> ../../raw.cluster.clean/delta_G_9-30/delta_G_9-30_2.fq.gz
│   ├── delta_G_9-30_U_1.fq.gz -> ../../raw.cluster.clean/delta_G_9-30/delta_G_9-30_U_1.fq.gz
│   └── delta_G_9-30_U_2.fq.gz -> ../../raw.cluster.clean/delta_G_9-30/delta_G_9-30_U_2.fq.gz
└── WT
    ├── WT_9-30_1.fq.gz -> ../../raw.cluster.clean/WT_9-30/WT_9-30_1.fq.gz
    ├── WT_9-30_2.fq.gz -> ../../raw.cluster.clean/WT_9-30/WT_9-30_2.fq.gz
    ├── WT_9-30_U_1.fq.gz -> ../../raw.cluster.clean/WT_9-30/WT_9-30_U_1.fq.gz
    └── WT_9-30_U_2.fq.gz -> ../../raw.cluster.clean/WT_9-30/WT_9-30_U_2.fq.gz

# assemble with spades:
ls -d raw.cluster.clean.assembly/* | parallel -k 21,33,47,55,63,77 -t 4 -m 10 --careful -o {}/spades -1 {}/{/}_1.fq.gz -2 {}/{/}_2.fq.gz -s {}/{/}_1.unpaired.fq.gz -s {}/{/}_2.unpaired.fq.gz'

Some times I want to assembly using another tool, so I can create another working space by:

$ cluster_files -p '(.+?)_\d.*\.fq\.gz$' raw.cluster.clean -o raw.cluster.clean.assembly_with_xxx

Here’s one project of mine:


PS: You may try rush, A GNU parallel like tool in Go supporting Linux/OS X/Windows. rush has some unique features, e.g., awk -v like custom defined variables.