I did some analysis on high thoughput sequencing data this week. Here are some experience valuable to share.
1. Make scripts flexible and reproducible
Use arguments parser to handle different running condition/steps. Make sure it is easy to change parameters from command line.
Use counting option “verbose”; and logging modules for multiple level of output, e.g. “quite”; -> “info”; -> “verbose output”; -> “debug info”;, avoid repeatedly changing debug code.
2. Version control
I stored code and doc of one project to Github. Every stage of code was committed, so I can track the project process, and sometimes roll back a previous version, which is every important.
3. Make scripts compatible with common filetype
For example, NGS reads are often saved as fq.gz file, tools should accept both plain fastq and gzipped fq.gz file, just according to the file name.
Save big file as gzip file to save disk space.
And also, some tools should also support reading from standard input (stdin) and writing to standard output (stdout), so they could be integrated in shell pipe.
4. Well file and directory organization
A script may be used to handle different datasets, it’s a good practice to save all output into one directory. Do not save to current directory.
Avoid writing thousands of file to ONE directory. Organize files in a hierarchical tree, by prefix of file name. For example:
└── output20150213
├── A
│ ├── A
│ │ ├── AAA.txt
│ │ ├── AAC.txt
│ │ ├── AAG.txt
│ │ └── AAT.txt
│ ├── C
│ │ ├── ACA.txt
│ │ ├── ACC.txt
│ │ ├── ACG.txt
│ │ └── ACT.txt
│ ├── G
│ └── T
├── C
├── G
└── T
5. Optimize code for better performance
Reading and writing file are usually bottleneck of the performance, frequently writing files should be avoid. Try caching data into memory, and cache size should be easily changed from command line option.
Do not stop optimizing the code. Other aspects, including data structure, algorithm, parallelization, could also be taking into consideration.