Well organized directory structures in Bioinformatics project make the analysis more scalable and reproducible, and are easy for parallelization.
Here’s my way.
Directory raw
contains all paired-ends reads files, which are carefully
named with same suffixes: _1.fastq.gz
for read 1 and _2.fastq.gz
for read 2.
$ ls raw | head -n 4
delta_G_9-30_1.fastq.gz
delta_G_9-30_2.fastq.gz
WT_9-30_1.fastq.gz
WT_9-30_2.fastq.gz
...
I use a python script cluster_files to cluster paired-end reads by samples:
$ cluster_files -p '(.+?)_[12]\.fastq\.gz$' raw/ -o raw.cluster
$ tree
.
├── raw
│ ├── delta_G_9-30_1.fastq.gz
│ ├── delta_G_9-30_2.fastq.gz
│ ├── WT_9-30_1.fastq.gz
│ └── WT_9-30_2.fastq.gz
└── raw.cluster
├── delta_G_9-30
│ ├── delta_G_9-30_1.fastq.gz -> ../../raw/delta_G_9-30_1.fastq.gz
│ └── delta_G_9-30_2.fastq.gz -> ../../raw/delta_G_9-30_2.fastq.gz
└── WT_9-30
├── WT_9-30_1.fastq.gz -> ../../raw/WT_9-30_1.fastq.gz
└── WT_9-30_2.fastq.gz -> ../../raw/WT_9-30_2.fastq.gz
For every analysis step, I creat a new directory with same directory structure in which reads files are soft links. There are many benefits for this:
Now let’s clean the reads. .
Replacement string {/}
represents the basename of input file path,
here it’s the sample name.
$ ls -d raw.cluster.clean/* | parallel -j 6 --dryrun \
java -jar ${TRIMMOMATIC} PE -threads 4 -phred33 \
{}/{/}_1.fastq.gz {}/{/}_2.fastq.gz \
{}/{/}_1.fq.gz {}/{/}_U_1.fq.gz {}/{/}_2.fq.gz {}/{/}_U_2.fq.gz \
ILLUMINACLIP:trimSeq.fasta:2:7:7 MINLEN:20
java -jar PE -threads 4 -phred33 raw.cluster.clean/delta_G_9-30/delta_G_9-30_1.fastq.gz raw.cluster.clean/delta_G_9-30/delta_G_9-30_2.fastq.gz raw.cluster.clean/delta_G_9-30/delta_G_9-30_1.fq.gz raw.cluster.clean/delta_G_9-30/delta_G_9-30_U_1.fq.gz raw.cluster.clean/delta_G_9-30/delta_G_9-30_2.fq.gz raw.cluster.clean/delta_G_9-30/delta_G_9-30_U_2.fq.gz ILLUMINACLIP:trimSeq.fasta:2:7:7 MINLEN:20
java -jar PE -threads 4 -phred33 raw.cluster.clean/WT_9-30/WT_9-30_1.fastq.gz raw.cluster.clean/WT_9-30/WT_9-30_2.fastq.gz raw.cluster.clean/WT_9-30/WT_9-30_1.fq.gz raw.cluster.clean/WT_9-30/WT_9-30_U_1.fq.gz raw.cluster.clean/WT_9-30/WT_9-30_2.fq.gz raw.cluster.clean/WT_9-30/WT_9-30_U_2.fq.gz ILLUMINACLIP:trimSeq.fasta:2:7:7 MINLEN:20
Next step, results of trimmomatic are clustered for further analyis, e.g., assembly.
$ cluster_files -p '(.+?)_\d.*\.fq\.gz$' raw.cluster.clean -o raw.cluster.clean.assembly
$ tree raw.cluster.clean.assembly
raw.cluster.clean.assembly
├── delta_G
│ ├── delta_G_9-30_1.fq.gz -> ../../raw.cluster.clean/delta_G_9-30/delta_G_9-30_1.fq.gz
│ ├── delta_G_9-30_2.fq.gz -> ../../raw.cluster.clean/delta_G_9-30/delta_G_9-30_2.fq.gz
│ ├── delta_G_9-30_U_1.fq.gz -> ../../raw.cluster.clean/delta_G_9-30/delta_G_9-30_U_1.fq.gz
│ └── delta_G_9-30_U_2.fq.gz -> ../../raw.cluster.clean/delta_G_9-30/delta_G_9-30_U_2.fq.gz
└── WT
├── WT_9-30_1.fq.gz -> ../../raw.cluster.clean/WT_9-30/WT_9-30_1.fq.gz
├── WT_9-30_2.fq.gz -> ../../raw.cluster.clean/WT_9-30/WT_9-30_2.fq.gz
├── WT_9-30_U_1.fq.gz -> ../../raw.cluster.clean/WT_9-30/WT_9-30_U_1.fq.gz
└── WT_9-30_U_2.fq.gz -> ../../raw.cluster.clean/WT_9-30/WT_9-30_U_2.fq.gz
# assemble with spades:
ls -d raw.cluster.clean.assembly/* | parallel spades.py -k 21,33,47,55,63,77 -t 4 -m 10 --careful -o {}/spades -1 {}/{/}_1.fq.gz -2 {}/{/}_2.fq.gz -s {}/{/}_1.unpaired.fq.gz -s {}/{/}_2.unpaired.fq.gz'
Some times I want to assembly using another tool, so I can create another working space by:
$ cluster_files -p '(.+?)_\d.*\.fq\.gz$' raw.cluster.clean -o raw.cluster.clean.assembly_with_xxx
Here’s one project of mine:
raw
raw.fastqc
raw.kaiju
raw.cluster
raw.cluster.clean
raw.cluster.clean.mapping
raw.cluster.clean.mapping.pilon
raw.cluster.clean.mapping.breseq
raw.cluster.clean.spades.plasmid
raw.cluster.clean.spades.result
PS: You may try rush,
A GNU parallel like tool in Go supporting Linux/OS X/Windows.
rush has some unique features,
e.g., awk -v
like custom defined variables.